Announcement_3
Released RealDataAgentBench — an open-source benchmark across 12 frontier LLMs and 1,180+ runs surfacing the correctness vs. statistical-validity gap in frontier models.
Released RealDataAgentBench — an open-source benchmark across 12 frontier LLMs and 1,180+ runs surfacing the correctness vs. statistical-validity gap in frontier models.