RealDataAgentBench | Venkata Manideep Patibandla

RealDataAgentBench (RDAB) evaluates LLM data science agents across four dimensions — not just correctness. Core insight: most LLMs get the right answer; RDAB checks if they did it the right way.

Evaluation Dimensions

Dimension	Weight	What it measures
Correctness	40–50%	Ground-truth alignment
Code Quality	15–20%	Vectorization, naming conventions
Efficiency	15%	Token and step budgets
Statistical Validity	15–30%	Uncertainty quantification, appropriate methods

Key Findings

Claude leads on statistical validity (Sonnet 4.6: 0.851); GPT-4.1-mini dominates correctness (0.937). These correlate at only r=0.43.
GPT-4.1 delivers 97% of GPT-5’s score at 1/15th the cost ($0.038 vs. $0.596 per task).
Free Llama 3.3-70B (0.798) beats GPT-5 (0.780).
Statistical validity varies by category: inference averages 0.897, feature engineering drops to 0.520.

Benchmark Structure

39 tasks · 5 categories · 6 real-world datasets · 1,180+ runs · 12 models · 95% confidence intervals

Tech Stack

Python · LangChain · OpenAI API · Anthropic API · HuggingFace · scikit-learn · Pandas

GitHub