RealDataAgentBench
Open-source LLM evaluation framework — 12 frontier models, 1,180+ runs, 39 data science tasks
RealDataAgentBench (RDAB) evaluates LLM data science agents across four dimensions — not just correctness. Core insight: most LLMs get the right answer; RDAB checks if they did it the right way.
Evaluation Dimensions
| Dimension | Weight | What it measures |
|---|---|---|
| Correctness | 40–50% | Ground-truth alignment |
| Code Quality | 15–20% | Vectorization, naming conventions |
| Efficiency | 15% | Token and step budgets |
| Statistical Validity | 15–30% | Uncertainty quantification, appropriate methods |
Key Findings
- Claude leads on statistical validity (Sonnet 4.6: 0.851); GPT-4.1-mini dominates correctness (0.937). These correlate at only r=0.43.
- GPT-4.1 delivers 97% of GPT-5’s score at 1/15th the cost ($0.038 vs. $0.596 per task).
- Free Llama 3.3-70B (0.798) beats GPT-5 (0.780).
- Statistical validity varies by category: inference averages 0.897, feature engineering drops to 0.520.
Benchmark Structure
39 tasks · 5 categories · 6 real-world datasets · 1,180+ runs · 12 models · 95% confidence intervals
Tech Stack
Python · LangChain · OpenAI API · Anthropic API · HuggingFace · scikit-learn · Pandas