FARBench: How Far Are LLM Agents from Autonomous Research?

Can LLM agents move beyond assisting researchers to autonomously make progress on frontier problems? FARBench evaluates the end-to-end research loop — explore, code, run, and iterate from an empty workspace — across 29 tasks in 5 domains.

GitHub
01 · How it works

At each step the agent reads an observation (task spec, files, code, command output, metrics), writes an action (edit files, run a command, submit for evaluation), and the sandbox returns results that feed the next step — a closed empirical loop under a fixed budget.

Tasks
task.yaml
description, metric, budget
Vision
Speech
NLP
Robotics
Science
Forecast
Observation
Task spec
description + hints
Files
workspace listing
Code
file contents
Command
stdout / stderr
Error
failure signal
Metrics
eval + budget
Agent
reads observation, writes Action
files_to_write command submit_eval done
Environment
Agent Sandbox
writes workspace, runs training commands, returns logs
isolated evaluation
Eval Sandbox
runs predict.py on test inputs, evaluator scores predictions

Agent Workflow Replay

▸ obs
▸ think
▸ act
$ cmd
⟳ sandbox
running
best-so-far
Observation
YAML
name: ...
description: ...
Directory
workspace /
training_data /
Source
class
Model(nn.Module):
Artifacts
best_ckpt.pt
predict.npy
Command
command: $python...
STDOUT: INFO...
Error
CUDA out of
memory
Dataset
Training data
contains 414075...
Metric
Accuracy=18.4%
Budget
19/30 iters,
9.48/10 h
Action
AGENT
inspect
read task.yaml
execute
python train.py
edit
change backbone
to ViT-B
debug
batch_size=16
submit
best_ckpt.pt
Writable scope: /workspace
Environment
Agent Sandbox
training data
training environment
Eval Sandbox
test data
eval environment
EVAL_LOG
[eval_harness] Phase1: Running predict script: /workspace/predict.py
[eval_harness] Phase2: Running evaluator
[eval_harness] Result: {'accuracy': ...}
02 · Coverage

29 real-world ML tasks across five research domains. Each task ships with a description, a primary metric, a compute budget, and an automated evaluator.

Domain composition donut: 29 tasks across 5 domains (Computer Vision, AI for Science, Robotics, NLP, Audio/Speech).
One representative task per domain: GUI grounding, humanoid control, weather forecasting, speech enhancement, and code generation.
One representative task per domain — task, data instance, agent work, and evaluation.
03 · What we learned

Headline takeaways from agent–task episodes across frontier models.

Current frontier agents are useful research assistantsnot yet reliable autonomous researchers.
  1. 01

    Top agent reaches only / 100.

    Even the strongest agent () leaves  points still to be earned — far from doing research on its own.

  2. 02

    Running the loop is easy; improving within it is hard.

    8 of 11 agents reliably read the task, write code, run experiments, and read the results. The gap appears after that: even the strongest agent recovers less than a quarter of the remaining room to improve after its first valid result.

  3. 03

    Most agents barely improve after their first try.

    Across 301 valid agent–task attempts, the median improvement from first to best is only 3 percentage points of normalized achievement. 30% don't improve at all; 69% improve by less than 10 percentage points. The problem isn't that agents never see whether an idea worked — they fail to convert that signal into a better next decision.

  4. 04

    More iterations don't help.

    Whether an episode ends from the agent's own done signal, the 30-iteration cap, or wall-clock exhaustion, average improvement plateaus at the same ceiling. More tries alone don't unlock better research — high-gain iteration isn't yet the default behavior.

  5. 05

    Agents keep tweaking the same thing.

    In ~30% of episodes, when a result is poor the agent keeps editing the same layer of the solution — the hypothesis, the data pipeline, the modeling assumption, or the evaluation contract — instead of revising it. Productive trajectories are hypothesis-driven; the rest are continuation-driven. The paper calls this wrong-layer fixation.

04 · Citation

Paper in preparation (2026). The BibTeX entry will be updated on release.

@misc{farbench2026, title = {FARBench: How Far Are LLM Agents from Autonomous Research?}, author = {Anonymous Authors}, year = {2026}, note = {Under review. Code & data: https://anonymous.4open.science/r/FARBench/} }
FARBench — a benchmark for autonomous ML-research agents.
Anonymous submission · 2026 · Code & Data
Loading leaderboard...
Domain
All
Export tables
Final Score is the mean of domain scores. Each domain denominator is fixed to all active tasks in that domain. Task Score is clipped achievement from task.score, scaled 0-100. A displayed 0.0 can be a valid scored result; no-score cells also count as 0 in aggregate scores. Raw Metrics show the original task metrics. Arrows in headers show whether higher or lower is better.

Metric Definitions

Definitions for the leaderboard agent scores above and for the capability diagnostics used in the analysis below.

Definitions for the leaderboard agent scores above.

Leaderboard Agent Scores
Metric Formula Definition
Final Score \( \displaystyle \mathrm{FARBenchScore}_a = \frac{1}{|D|}\sum_{d\in D}\mathrm{DomainScore}_{a,d} \) Primary leaderboard ranking in the Domain Scores view; summarizes each agent across all five domains with equal domain weight.
Domain Score \( \displaystyle \mathrm{DomainScore}_{a,d}=100\cdot\frac{1}{|T_d|}\sum_{t\in T_d} A_{a,t} \) Average task-normalized achievement within domain \( d \). These values are shown as the per-domain columns in the Domain Scores view and show which research areas drive or limit the final score.
Task Score \( \displaystyle \mathrm{TaskScore}_{a,t}=100\cdot A_{a,t} \) \( \displaystyle A_{a,t}=\operatorname{clip}\!\left(\frac{s_{a,t}-s^{\mathrm{floor}}_t}{s^{\mathrm{target}}_t-s^{\mathrm{floor}}_t},0,1\right) \) Cells in the Task Scores view; compares agents on each task after metric orientation, floor/target normalization, and clipping. Valid-zero scores, no-score cells, and missing active model-task cells use \( A_{a,t}=0 \) in leaderboard aggregation, keeping active-task denominators fixed across agents.
Raw Metric \( s^{\mathrm{raw}}_{a,t} \) The evaluator's native primary metric before task normalization. Header arrows indicate whether higher or lower is better. This view exposes the original evaluator outputs before conversion into task-normalized achievement.
For lower-is-better native metrics, the raw score is first mapped to an oriented score \( s_{a,t} \) where larger is better, then the clipped achievement formula is applied. The task-specific floor and reference target are fixed benchmark properties, not the best score achieved by evaluated agents. Invalid or unscorable episodes receive \( A_{a,t}=0 \); domain denominators are fixed to all active tasks in that domain.
Agent Capability Diagnostics
Metric Formula Definition
Instruction Following \[ \mathrm{IF}_{a,t}= \begin{cases} 1-\tau^{\mathrm{first}}_{a,t}, & \text{if a valid evaluation exists}\\ 0, & \text{otherwise} \end{cases} \] \( \displaystyle \tau^{\mathrm{first}}_{a,t}=\max\!\left(\frac{i^{\mathrm{first}}_{a,t}}{I_t},\frac{e^{\mathrm{first}}_{a,t}}{H_t}\right) \) Measures early valid evaluator feedback: whether the agent quickly produces a runnable, accepted artifact that can guide later iterations. A valid evaluation is an accepted evaluator run that returns the task's scorable primary metric. \( i^{\mathrm{first}} \) and \( e^{\mathrm{first}} \) are the iteration and elapsed time used at that first valid feedback; \( I_t \) and \( H_t \) are the task budgets. A smaller \( \tau^{\mathrm{first}} \) means the agent entered the empirical feedback loop earlier, so IF is higher.
Code Execution Success \( \displaystyle \mathrm{CES}=\frac{N^{\mathrm{ok}}_{\mathrm{exec}}}{N^{\mathrm{run}}_{\mathrm{exec}}} \) Measures command stability using recorded shell-command outputs. \( N^{\mathrm{run}}_{\mathrm{exec}} \) counts iterations with a command_output.json record; \( N^{\mathrm{ok}}_{\mathrm{exec}} \) counts those with exit code 0 and no timeout. Episodes with no recorded command output receive zero.
First Achievement \( \displaystyle \mathrm{FA}=f_{a,t} \) Measures the quality of the first valid scorable artifact. It separates agents that quickly produce a meaningful baseline from agents whose first runnable result is weak.
Headroom Gain \( \displaystyle \mathrm{HG}=\operatorname{clip}_{[0,1]}\!\left(\frac{b_{a,t}-f_{a,t}}{\max(1-f_{a,t},\epsilon)}\right) \) Measures how much of the remaining room to improve is recovered after the first valid result. It rewards adaptation to feedback rather than merely starting from a low baseline.
Progress Efficiency \( \displaystyle \mathrm{PE}=\int_0^1 M_{a,t}(\tau)\,d\tau \) Measures the left-continuous area under the best-so-far clipped achievement curve over normalized budget. High PE means the agent reaches good results early and maintains them through the episode.
Pass at 0.5 \( \displaystyle \mathrm{Pass@0.5}=\mathbb{I}[b_{a,t}\ge 0.5] \) Measures whether the episode reaches at least half of the task-normalized target gap. After aggregation, it reflects breadth of competent cross-task performance.
Here \( \tau \) is normalized budget, \( f_{a,t} \) is first valid normalized achievement, \( b_{a,t} \) is best normalized achievement, and \( M_{a,t}(\tau) \) is the best-so-far achievement curve. Episodes without a valid scorable artifact set \( f_{a,t}=b_{a,t}=0 \). Each metric is evaluated at the agent-task episode level and then aggregated under the same domain-balanced averaging protocol as the FARBench Score. These six diagnostics separate operational control (IF, CES) from research judgment (FA, HG, PE, Pass@0.5); the FARBench Score itself is computed from clipped per-task achievements rather than from a composite of these six diagnostics.

Analysis

Detailed capability and iteration analysis for the trajectory-level agent capability diagnostics.

Capability Diagnostics

Section 4 trajectory metrics and outcome regimes across active FARBench tasks.
Capability diagnostics for FARBench agents. (a) Agents projected onto 6 trajectory-level diagnostics, split into operational metrics (IF, CES) and research-quality metrics (FA, HG, PE, Pass@0.5). Productive marks episodes that both produced a scorable artifact and improved beyond their starting point.

Iteration Analysis

Per episode first/best achievement, valid evaluations consumed, and headroom gain.
Iteration analysis at the per-(agent, task) episode level. (a) First achievement vs. best achievement, with the dashed diagonal marking no improvement. (b) Number of valid evaluations consumed vs. headroom gain; diamonds report bin means and the red dashed line marks the soft ceiling at HG ≈ 0.30, which the bin curve does not cross under any of the three episode-termination conditions discussed below.

Five recurring regimes

Across 319 agent-task episodes, these regimes summarize how agents respond to empirical feedback: productive iteration, early plateau, stuck-low, submitted-but-scored-zero, and never-enters-loop.

The figure shows one representative trajectory per regime; the captions on the right explain each. Read them as evidence for the aggregate diagnostics, not as another leaderboard: the productive regime couples evaluator signal to a new hypothesis, while the other four exhibit wrong-layer fixation, where the agent keeps editing the same layer despite negative feedback.
Five recurring regimes: productive iteration, early plateau, stuck-low, submitted but scored zero, and never enters the loop.
x-axis = iteration · y-axis = task-normalized achievement. Blue = valid eval, orange step = best-so-far, gray ✗ = no valid score.
No completed experiments yet

Experiments

Loading experiments...
No experiments found
Task Agent Best Metric Iters Time Evals In Tokens Out Tokens Compute Status
Loading...
Metric Trajectory
All
Loading iteration...
Select an iteration from the sidebar