RealDataAgentBench

Automated model-selection engine for production LLM deployments — 12 frontier models, 1,412+ runs, real-world tasks

RealDataAgentBench (RDAB) is an automated benchmarking system built to answer a production engineering question: which model should I deploy, and what will it cost?

It runs 1,412+ evaluations across 12 frontier LLMs on real data science tasks — EDA, feature engineering, statistical inference, ML modeling — and produces cost-accuracy trade-off data that directly powers CostGuard’s model-selection and recommendation engine.

What It Produces

Finding Production implication
GPT-4.1 = 97% of GPT-5’s score at 1/15th the cost Default to GPT-4.1 for most production tasks
Free Llama 3.3-70B (0.798) outscores GPT-5 (0.780) Open-source is viable at scale
Statistical validity varies 0.45–0.87 across models Correctness alone is a misleading selection metric
Claude leads statistical validity; GPT-4.1-mini leads correctness Task-aware routing beats one-model-fits-all

Engineering Design

  • Multi-dimensional scoring — correctness, code quality, efficiency, and statistical validity — because a model that gets the right answer via the wrong method fails in production
  • 95% confidence intervals across 1,412+ runs for statistically reliable comparisons
  • Category-level breakdowns across inference, feature engineering, EDA, modeling, and ML engineering — enabling task-aware model routing decisions

Benchmark Structure

39 tasks · 5 categories · 6 real-world datasets · 1,412+ runs · 12 models · 95% confidence intervals

Tech Stack

Python · LangChain · OpenAI API · Anthropic API · HuggingFace · scikit-learn · Pandas

GitHub