FARBench: How Far Are LLM Agents from Autonomous Research?

FARBench tests whether LLM agents can move beyond assistance and make autonomous progress, running the full research loop from an empty workspace across 29 ML tasks in 5 domains.

GitHub

01 · How it works

At each step the agent reads an observation (task spec, files, code, command output, metrics), writes an action (edit files, run a command, submit for evaluation), and the sandbox returns results that feed the next step — a closed empirical loop under a fixed budget.

Tasks

task.yaml

description, metric, budget

Vision

Speech

NLP

Robotics

Science

Forecast

Observation

Task spec

description + hints

Files

workspace listing

Code

file contents

Command

stdout / stderr

Error

failure signal

Metrics

eval + budget

Agent

reads observation, writes Action

files_to_write command submit_eval done

Environment

Agent Sandbox

writes workspace, runs training commands, returns logs

isolated evaluation

Eval Sandbox

runs predict.py on test inputs, evaluator scores predictions

Agent Workflow Replay

▸ obs

▸ think

▸ act

$ cmd

⟳ sandbox

running

best-so-far

Observation

YAML

name: ...
description: ...

Directory

workspace /
training_data /

Source

class
Model(nn.Module):

Artifacts

best_ckpt.pt
predict.npy

Command

command: $python...
STDOUT: INFO...

Error

CUDA out of
memory

Dataset

Training data
contains 414075...

Metric

Accuracy=18.4%

Budget

19/30 iters,
9.48/10 h

Action

AGENT

inspect

read task.yaml

execute

python train.py

edit

change backbone
to ViT-B

debug

batch_size=16

submit

best_ckpt.pt

Writable scope: /workspace

Environment

Agent Sandbox

training data

training environment

Eval Sandbox

test data

eval environment

EVAL_LOG
[eval_harness] Phase1: Running predict script: /workspace/predict.py
[eval_harness] Phase2: Running evaluator
[eval_harness] Result: {'accuracy': ...}

02 · Coverage

29 real-world ML tasks across five research domains. Each task ships with a description, a primary metric, a compute budget, and an automated evaluator.

Domain composition donut: 29 tasks across 5 domains (Computer Vision, AI for Science, Robotics, NLP, Audio/Speech).

One representative task per domain: GUI grounding, humanoid control, weather forecasting, speech enhancement, and code generation. — One representative task per domain — task, data instance, agent work, and evaluation.

03 · What we learned

Headline takeaways from agent–task episodes across frontier models.

Current frontier agents are useful research assistants — not yet reliable autonomous researchers.

01

Top agent reaches only / 100.

Even the strongest agent () leaves points still to be earned — far from doing research on its own.
02

Running the loop is easy; improving within it is hard.

8 of 11 agents reliably read the task, write code, run experiments, and read the results. The gap appears after that: even the strongest agent recovers less than a quarter of the remaining room to improve after its first valid result.
03

Most agents barely improve after their first try.

Across 301 valid agent–task attempts, the median improvement from first to best is only 3 percentage points of normalized achievement. 30% don't improve at all; 69% improve by less than 10 percentage points. The problem isn't that agents never see whether an idea worked — they fail to convert that signal into a better next decision.
04

More iterations don't help.

Whether an episode ends from the agent's own done signal, the 30-iteration cap, or wall-clock exhaustion, average improvement plateaus at the same ceiling. More tries alone don't unlock better research — high-gain iteration isn't yet the default behavior.
05

Agents keep tweaking the same thing.

In ~30% of episodes, when a result is poor the agent keeps editing the same layer of the solution — the hypothesis, the data pipeline, the modeling assumption, or the evaluation contract — instead of revising it. Productive trajectories are hypothesis-driven; the rest are continuation-driven. The paper calls this wrong-layer fixation.

04 · Case Study

Five representative trajectories connect the aggregate findings above to concrete agent-task episodes: one productive loop, and four ways feedback fails to become a better research decision.

Case Study: Five Recurring Regimes. The figure shows one representative trajectory per regime. Read them as evidence for the aggregate diagnostics, not as another leaderboard: the productive regime couples evaluator signal to a new hypothesis, while the other four exhibit wrong-layer fixation, where the agent keeps editing the same layer despite negative feedback.

Five recurring regimes: productive iteration, early plateau, stuck-low, submitted but scored zero, and never enters the loop. — x-axis = iteration · y-axis = task-normalized achievement. Blue = valid eval, orange step = best-so-far, gray ✗ = no valid score.

05 · Citation

Paper in preparation (2026). The BibTeX entry will be updated on release.

@misc{farbench2026, title = {FARBench: How Far Are LLM Agents from Autonomous Research?}, author = {Anonymous Authors}, year = {2026}, note = {Under review. Code & data: https://anonymous.4open.science/r/FARBench/} }

Loading leaderboard...

Leaderboard contents

Domain

All

Export tables

Final Score is the mean of domain scores. Each domain denominator is fixed to all active tasks in that domain. Task Score is clipped achievement from task.score, scaled 0-100. A displayed 0.0 can be a valid scored result; no-score cells also count as 0 in aggregate scores. Raw Metrics show the original task metrics. Arrows in headers show whether higher or lower is better.

Metric Definitions

Definitions for the leaderboard agent scores above and for the capability diagnostics used in the analysis below.

Definitions for the leaderboard agent scores above.

Leaderboard Agent Scores

Metric	Formula	Definition
Final Score	$ \displaystyle \mathrm{FARBenchScore}_a = \frac{1}{\|D\|}\sum_{d\in D}\mathrm{DomainScore}_{a,d} $	Primary leaderboard ranking in the Domain Scores view; summarizes each agent across all five domains with equal domain weight.
Domain Score	$ \displaystyle \mathrm{DomainScore}_{a,d}=100\cdot\frac{1}{\|T_d\|}\sum_{t\in T_d} A_{a,t} $	Average task-normalized achievement within domain $ d $. These values are shown as the per-domain columns in the Domain Scores view and show which research areas drive or limit the final score.
Task Score	$ \displaystyle \mathrm{TaskScore}_{a,t}=100\cdot A_{a,t} $ $ \displaystyle A_{a,t}=\operatorname{clip}\!\left(\frac{s_{a,t}-s^{\mathrm{floor}}_t}{s^{\mathrm{target}}_t-s^{\mathrm{floor}}_t},0,1\right) $	Cells in the Task Scores view; compares agents on each task after metric orientation, floor/target normalization, and clipping. Valid-zero scores, no-score cells, and missing active model-task cells use $ A_{a,t}=0 $ in leaderboard aggregation, keeping active-task denominators fixed across agents.
Raw Metric	$ s^{\mathrm{raw}}_{a,t} $	The evaluator's native primary metric before task normalization. Header arrows indicate whether higher or lower is better. This view exposes the original evaluator outputs before conversion into task-normalized achievement.

For lower-is-better native metrics, the raw score is first mapped to an oriented score $ s_{a,t} $ where larger is better, then the clipped achievement formula is applied. The task-specific floor and reference target are fixed benchmark properties, not the best score achieved by evaluated agents. Invalid or unscorable episodes receive $ A_{a,t}=0 $; domain denominators are fixed to all active tasks in that domain.

Agent Capability Diagnostics

Metric	Formula	Definition
Instruction Following	\[ \mathrm{IF}_{a,t}= \begin{cases} 1-\tau^{\mathrm{first}}_{a,t}, & \text{if a valid evaluation exists}\\ 0, & \text{otherwise} \end{cases} \] $ \displaystyle \tau^{\mathrm{first}}_{a,t}=\max\!\left(\frac{i^{\mathrm{first}}_{a,t}}{I_t},\frac{e^{\mathrm{first}}_{a,t}}{H_t}\right) $	Measures early valid evaluator feedback: whether the agent quickly produces a runnable, accepted artifact that can guide later iterations. A valid evaluation is an accepted evaluator run that returns the task's scorable primary metric. $ i^{\mathrm{first}} $ and $ e^{\mathrm{first}} $ are the iteration and elapsed time used at that first valid feedback; $ I_t $ and $ H_t $ are the task budgets. A smaller $ \tau^{\mathrm{first}} $ means the agent entered the empirical feedback loop earlier, so IF is higher.
Code Execution Success	$ \displaystyle \mathrm{CES}=\frac{N^{\mathrm{ok}}_{\mathrm{exec}}}{N^{\mathrm{run}}_{\mathrm{exec}}} $	Measures command stability using recorded shell-command outputs. $ N^{\mathrm{run}}_{\mathrm{exec}} $ counts iterations with a command_output.json record; $ N^{\mathrm{ok}}_{\mathrm{exec}} $ counts those with exit code 0 and no timeout. Episodes with no recorded command output receive zero.
First Achievement	$ \displaystyle \mathrm{FA}=f_{a,t} $	Measures the quality of the first valid scorable artifact. It separates agents that quickly produce a meaningful baseline from agents whose first runnable result is weak.
Headroom Gain	$ \displaystyle \mathrm{HG}=\operatorname{clip}_{[0,1]}\!\left(\frac{b_{a,t}-f_{a,t}}{\max(1-f_{a,t},\epsilon)}\right) $	Measures how much of the remaining room to improve is recovered after the first valid result. It rewards adaptation to feedback rather than merely starting from a low baseline.
Progress Efficiency	$ \displaystyle \mathrm{PE}=\int_0^1 M_{a,t}(\tau)\,d\tau $	Measures the left-continuous area under the best-so-far clipped achievement curve over normalized budget. High PE means the agent reaches good results early and maintains them through the episode.
Pass at 0.5	$ \displaystyle \mathrm{Pass@0.5}=\mathbb{I}[b_{a,t}\ge 0.5] $	Measures whether the episode reaches at least half of the task-normalized target gap. After aggregation, it reflects breadth of competent cross-task performance.

Here $ \tau $ is normalized budget, $ f_{a,t} $ is first valid normalized achievement, $ b_{a,t} $ is best normalized achievement, and $ M_{a,t}(\tau) $ is the best-so-far achievement curve. Episodes without a valid scorable artifact set $ f_{a,t}=b_{a,t}=0 $. Each metric is evaluated at the agent-task episode level and then aggregated under the same domain-balanced averaging protocol as the FARBench Score. These six diagnostics separate operational control (IF, CES) from research judgment (FA, HG, PE, Pass@0.5); the FARBench Score itself is computed from clipped per-task achievements rather than from a composite of these six diagnostics.

Analysis

Detailed capability and iteration analysis for the trajectory-level agent capability diagnostics.

Capability Diagnostics

Section 4 trajectory metrics and outcome regimes across active FARBench tasks.

Capability diagnostics for FARBench agents. (a) Agents projected onto 6 trajectory-level diagnostics, split into operational metrics (IF, CES) and research-quality metrics (FA, HG, PE, Pass@0.5). Productive marks episodes that both produced a scorable artifact and improved beyond their starting point.

Iteration Analysis

Per episode first/best achievement, valid evaluations consumed, and headroom gain.

Iteration analysis at the per-(agent, task) episode level. (a) First achievement vs. best achievement, with the dashed diagonal marking no improvement. (b) Number of valid evaluations consumed vs. headroom gain; diamonds report bin means and the red dashed line marks the soft ceiling at HG ≈ 0.30, which the bin curve does not cross under any of the three episode-termination conditions discussed below.

☰

No completed experiments yet

Experiments

Loading experiments...

☰

No experiments found

Task	Agent	Best Metric	Iters	Time	Evals	In Tokens	Out Tokens	Compute	Status

Metric Trajectory

All

Loading iteration...

Select an iteration from the sidebar

Metric	Formula	Definition
Final Score	\( \displaystyle \mathrm{FARBenchScore}_a = \frac{1}{\|D\|}\sum_{d\in D}\mathrm{DomainScore}_{a,d} \)	Primary leaderboard ranking in the Domain Scores view; summarizes each agent across all five domains with equal domain weight.
Domain Score	\( \displaystyle \mathrm{DomainScore}_{a,d}=100\cdot\frac{1}{\|T_d\|}\sum_{t\in T_d} A_{a,t} \)	Average task-normalized achievement within domain \( d \). These values are shown as the per-domain columns in the Domain Scores view and show which research areas drive or limit the final score.
Task Score	\( \displaystyle \mathrm{TaskScore}_{a,t}=100\cdot A_{a,t} \) \( \displaystyle A_{a,t}=\operatorname{clip}\!\left(\frac{s_{a,t}-s^{\mathrm{floor}}_t}{s^{\mathrm{target}}_t-s^{\mathrm{floor}}_t},0,1\right) \)	Cells in the Task Scores view; compares agents on each task after metric orientation, floor/target normalization, and clipping. Valid-zero scores, no-score cells, and missing active model-task cells use \( A_{a,t}=0 \) in leaderboard aggregation, keeping active-task denominators fixed across agents.
Raw Metric	\( s^{\mathrm{raw}}_{a,t} \)	The evaluator's native primary metric before task normalization. Header arrows indicate whether higher or lower is better. This view exposes the original evaluator outputs before conversion into task-normalized achievement.

FARBench: How Far Are LLM Agents from Autonomous Research?

Agent Workflow Replay

Top agent reaches only / 100.

Running the loop is easy; improving within it is hard.

Most agents barely improve after their first try.

More iterations don't help.

Agents keep tweaking the same thing.

Metric Definitions

Analysis

Capability Diagnostics

Iteration Analysis

Experiments

Iteration