Manideep's Blog

Writing about LLMs, benchmarking, and production AI systems

Every LLM Has a Superpower and a Blind Spot. I Built a Benchmark Around That Observation

Building a model-selection engine by mapping where each frontier LLM excels in production — and engineering around the gaps.

1 min read · April 24, 2025

2025 · ai llm testing benchmarking
I Prompted 5 Frontier LLMs to 'Report Uncertainty' — Here's What Happened to Their Statistical Validity Scores

Engineering uncertainty-aware prompting patterns for production LLM agents — and what actually happens to reliability scores when you do.

1 min read · April 18, 2025

2025 · ai llm benchmark rag
I Ran 163 Benchmarks Across 10 LLMs So You Don't Have To. Here's What I Found

Systematic evaluation across 163 tasks and 10 LLMs — practical model selection guidance for production deployments with real cost data.

1 min read · April 15, 2025

2025 · ai llm performance tooling
I Built a Benchmark That Proves Most LLM Agents Are Statistically Blind — And Why That Costs Companies Real Money

How a production evaluation system surfaced a critical LLM agent reliability failure — and what it costs when you deploy blind.

1 min read · April 11, 2025

2025 · llm ai machinelearning
Everyone Is Calling It Prompt Engineering. They're Already Behind.

Why context engineering is replacing prompt engineering in production AI systems — and what to build instead.

1 min read · April 10, 2025

2025 · ai llm