RealDataAgentBench

Open-source LLM evaluation framework — 12 frontier models, 1,180+ runs, 39 data science tasks

RealDataAgentBench (RDAB) evaluates LLM data science agents across four dimensions — not just correctness. Core insight: most LLMs get the right answer; RDAB checks if they did it the right way.

Evaluation Dimensions

Dimension Weight What it measures
Correctness 40–50% Ground-truth alignment
Code Quality 15–20% Vectorization, naming conventions
Efficiency 15% Token and step budgets
Statistical Validity 15–30% Uncertainty quantification, appropriate methods

Key Findings

  • Claude leads on statistical validity (Sonnet 4.6: 0.851); GPT-4.1-mini dominates correctness (0.937). These correlate at only r=0.43.
  • GPT-4.1 delivers 97% of GPT-5’s score at 1/15th the cost ($0.038 vs. $0.596 per task).
  • Free Llama 3.3-70B (0.798) beats GPT-5 (0.780).
  • Statistical validity varies by category: inference averages 0.897, feature engineering drops to 0.520.

Benchmark Structure

39 tasks · 5 categories · 6 real-world datasets · 1,180+ runs · 12 models · 95% confidence intervals

Tech Stack

Python · LangChain · OpenAI API · Anthropic API · HuggingFace · scikit-learn · Pandas

GitHub