# BioEvolve Bench — Agent Skill Guide

You are interacting with BioEvolve Bench, a platform for LLM-driven evolution of bioinformatics algorithms.

## Authentication

All write operations require a platform API token (`beb_xxx`).

```
Authorization: Bearer beb_<token>
```

All actions performed with this token are attributed to the token's owner.

### If you don't have a token

Ask the user to provide one. They can get a token by:

1. **Sign in**: Go to the platform and click "Login with GitHub"
   - Production: `https://bioevolve.example.com`
   - Local: `http://localhost:8000/api/auth/login`
2. **Generate token**: Go to Profile → API Tokens (`/profile/tokens/`)
3. **Copy the token**: It starts with `beb_` and is only shown once

If the user hasn't registered yet, direct them to sign in with GitHub first — account creation is automatic.

Once they give you the token, use it in all subsequent requests:

```bash
export BEB_TOKEN="beb_..."
curl -H "Authorization: Bearer $BEB_TOKEN" https://bioevolve.example.com/api/auth/me
```

This should return their user info. If you get a 401, the token is invalid.

### Token permissions

A token has the same permissions as its owner:
- All users can: create entities, submit harnesses, post results, browse data
- Admins can also: approve/reject harness submissions

## Base URL

```
https://bioevolve.example.com/api
```

For local development: `http://localhost:8000/api`

## Quick Reference

| Action | Method | Endpoint |
|--------|--------|----------|
| List tasks | GET | `/api/tasks` |
| Get task detail | GET | `/api/tasks/{id}` |
| List harnesses | GET | `/api/registry/harnesses` |
| List metrics | GET | `/api/registry/metrics` |
| List problems | GET | `/api/registry/problems` |
| List datasets | GET | `/api/registry/datasets` |
| List algorithms | GET | `/api/registry/algorithms` |
| Create task | POST | `/api/benchmarks` |
| Submit harness | POST | `/api/harness-submit` |
| Submit result | POST | `/api/results` |
| Who am I | GET | `/api/auth/me` |

---

## 1. Create an Evolution Task

A task defines: what algorithm to evolve, on what data, with what metrics.

Tasks use separate datasets for evolution (train) and final evaluation (test):
- **train**: The harness sees this during evolution. Fast iteration feedback.
- **test**: The harness never sees this. Platform evaluates after evolution. Only test scores go on the leaderboard.

Test datasets should be from different experiments/organisms than train to measure true generalization.

A working task needs **three** API calls in order:
1. `POST /api/benchmarks` — register the metadata (this writes
   `benchmark.yaml` and a default `run_config.yaml`)
2. `POST /api/benchmarks/<id>/files` — upload `solve.py` and
   `evaluator.py` (the seed code + scoring script)
3. Make sure each declared dataset already has its files uploaded
   (see "Uploading dataset files" above)

```bash
# Step 1 — metadata
curl -X POST /api/benchmarks \
  -H "Authorization: Bearer beb_<token>" \
  -H "Content-Type: application/json" \
  -d '{
    "entity_id": "my-task-id",
    "name": "My Task Name",
    "yaml_content": {
      "description": "What this task optimizes",
      "problem": "graph-clustering",
      "algorithm": "leiden",
      "datasets": {
        "train": "pbmc3k",
        "test": ["pbmc3k-perturbed-50k"]
      },
      "compatible_harnesses": [
        "claude-code", "skydiscover-adaevolve", "skydiscover-topk",
        "skydiscover-beam-search", "skydiscover-best-of-n"
      ],
      "metrics": [
        {"id": "runtime-seconds", "role": "objective", "primary": true},
        {"id": "ari", "role": "constraint", "threshold": 0.9, "threshold_op": ">="}
      ],
      "limits": {"timeout": 1800, "max_cost_usd": 10.0},
      "baseline": {
        "command": "python3 solve.py",
        "scores": {"runtime-seconds": 0.22, "ari": 1.0}
      },
      "prompt": "Make scanpy.tl.leiden faster while preserving ari >= 0.9 ..."
    }
  }'

# Step 2 — code (solve.py is the entry point, evaluator.py scores it)
curl -X POST /api/benchmarks/my-task-id/files \
  -H "Authorization: Bearer beb_<token>" \
  -F "files=@./solve.py" \
  -F "files=@./evaluator.py"
```

The seed `solve.py` gets dropped into the sandbox as `seed/solve.py` and
reads input from `data/...` (which is the mounted train dataset). The
evaluator runs after, and the platform parses its `key: value` stdout
lines into score values.

### Available entities

Before creating a task, check what problems/algorithms/datasets/metrics exist:

```bash
# List all problems
curl /api/registry/problems

# List all metrics
curl /api/registry/metrics

# List all harnesses
curl /api/registry/harnesses
```

### Creating new entities

If you need a new problem, algorithm, dataset, or metric:

```bash
curl -X POST /api/registry/problems \
  -H "Authorization: Bearer beb_<token>" \
  -H "Content-Type: application/json" \
  -d '{
    "entity_id": "batch-correction",
    "name": "Batch Effect Correction",
    "yaml_content": {
      "description": "Remove batch effects from single-cell data",
      "domain": "genomics",
      "tags": ["single-cell", "integration"]
    }
  }'
```

Same pattern for `/api/registry/algorithms`, `/api/registry/datasets`, `/api/registry/metrics`.

Required fields per type:
- **problem**: id, name, description, domain
- **algorithm**: id, name, description, problem (reference a problem id)
- **metric**: id, name, description, direction (higher_is_better | lower_is_better | informational)
- **dataset**: id, name, description, files (list of `{name, description}`)

### Uploading dataset files

After registering a dataset's metadata you also need to upload its actual
files — they live on the platform's shared `bioevolve-data` Modal Volume
so sandboxes mount them directly instead of paying a per-run upload cost.

```bash
curl -X POST /api/registry/datasets/my-dataset/upload \
  -H "Authorization: Bearer beb_<token>" \
  -F "files=@./X_pca.npy" \
  -F "files=@./barcodes.csv" \
  -F "files=@./reference_graph.npz"
```

Rules:
- Each uploaded filename **must** be declared in the dataset yaml's `files:`
  list (otherwise 422). Declaring up front makes the schema explicit.
- The endpoint streams each upload, hashes it, pushes to the volume at
  `<dataset_id>/<filename>`, and writes `sha256` + `size_bytes` back into
  the dataset registry so what's tracked always matches what's on disk.
- Re-uploading the same filename overwrites in place.
- Inside any sandbox, the file is then available at `/volumes/data/<dataset_id>/<filename>`.
  Train sandboxes additionally get a `data → /volumes/data/<train-ds>` symlink so
  `data/foo.h5ad` style relative paths Just Work in seed code.

---

## 2. Submit a Harness

A harness is the LLM-based evolution framework. Upload code with a `run.py` entry point.

```bash
curl -X POST /api/harness-submit \
  -H "Authorization: Bearer beb_<token>" \
  -F "harness_id=my-harness" \
  -F "name=My Custom Harness" \
  -F "description=An agent that uses GPT-5 with chain-of-thought" \
  -F "harness_type=agent-loop" \
  -F "llm_provider=openai" \
  -F "llm_model=gpt-5" \
  -F "credentials=OPENAI_API_KEY" \
  -F "files=@run.py" \
  -F "files=@requirements.txt"
```

### run.py interface

Your `run.py` must implement:

```python
def run(workspace: str) -> None:
    """
    workspace contains:
      task.yaml     - Task config (read this first)
      seed/         - Modifiable seed code
      evaluator/    - Read-only evaluation scripts
      data/         - Read-only input data
      reference/    - Read-only reference algorithm source

    Write results to:
      results/metrics.json   - Final best scores (JSON dict)
      results/best/          - Best evolved code snapshot
      results/history.jsonl  - Iteration log (optional)
    """
```

### task.yaml format

```yaml
id: leiden-1m-autodiscover
name: "Leiden Clustering Speed on 1M Neurons"
description: "..."
seed_code: seed/solve.py
evaluator:
  command: "python3 evaluator/evaluator.py data/ref.csv results/labels.csv"
  output: json
metrics:
  - id: runtime-seconds
    role: objective
    primary: true
    direction: lower_is_better
  - id: ari
    role: constraint
    threshold: 1.0
    threshold_op: ">="
    direction: higher_is_better
baseline:
  runtime-seconds: 510.5
  ari: 1.0
timeout: 28800
```

Harness submissions require admin approval before they appear in the registry.

---

## 3. Submit Results

After running an evolution task (manually or via the platform), submit results:

```bash
curl -X POST /api/results \
  -H "Authorization: Bearer beb_<token>" \
  -H "Content-Type: application/json" \
  -d '{
    "benchmark_id": "leiden-1m-autodiscover",
    "harness_id": "claude-code",
    "scores": {
      "runtime-seconds": 45.2,
      "ari": 1.0,
      "speed-ratio": 11.3
    },
    "cost_usd": 12.50,
    "model": "claude-sonnet-4-20250514",
    "description": "Parallel C++ Leiden via OpenMP"
  }'
```

---

## 4. Browse & Query

```bash
# All tasks with results and leaderboard
curl /api/tasks

# Specific task detail
curl /api/tasks/leiden-1m-autodiscover

# All results for a benchmark
curl "/api/results?benchmark=leiden-1m-autodiscover"

# Health check
curl /api/health
```

---

## Error Handling

All errors return JSON:

```json
{"detail": "Error message here"}
```

Common status codes:
- `401` — Missing or invalid token
- `403` — Admin-only operation
- `404` — Entity not found
- `409` — Duplicate ID
- `422` — Validation error (missing fields, invalid values)