skills
A complete inventory of tools, frameworks, and techniques drawn from production projects, open-source work, and professional experience.
Agent Engineering & Orchestration
Skills developed through Tether (durable execution for LLM agents), EnterpriseAgentEval (LangGraph-based multi-step agent pipelines), and client workflow-automation deployments as a Forward Deployed AI Engineer at SBL.
- Durable agent execution — automatic checkpoint/resume, idempotent replay, cross-provider failover (OpenAI ↔ Anthropic)
- LangGraph — multi-step agent workflows, state graphs, conditional edges, human-in-the-loop hooks
- Multi-channel outbound and lead-enrichment automation wired into CRMs and operational systems
- MCP (Model Context Protocol) — tooling integration for agent systems
- Drop-in client wrapping — adding reliability and resilience layers without rewriting agent logic
- RAG (Retrieval-Augmented Generation) pipeline design
- LangChain · Pinecone (vector store)
- Tool use, function calling, and structured output extraction
- Multi-agent coordination and agent-to-agent communication patterns
LLM Reliability & Production Proxying
Skills developed through CostGuard — a self-hostable reliability layer for LLM-powered agents.
- Real-time LLM response interception and validity filtering
- Per-provider circuit breakers (CLOSED / OPEN / HALF_OPEN state machine)
- Automatic fallback chain routing on rejection
- Exact per-call cost tracking at $0.000001 precision across 12 models and 5 providers
- 6-type alerting engine: validity drops · cost spikes · high failure rates · circuit breaker events · consecutive rejections · custom thresholds
- Slack webhook integration and custom webhook delivery
- Request-ID tracing, rate limiting, structured JSON logging
ML Engineering & Data Pipelines
Skills developed through the automated valuation model (AVM) contract work and lead-enrichment pipelines at SBL.
- Full ML lifecycle from data to deployment — cleaned ~38k subject–comp appraisal pairs, engineered 60+ features
- LightGBM regression for property valuation — 5.1% MAPE, 68% of valuations within ±5% on a production-shaped holdout
- Live RentCast API ingestion and a condition/quality scoring service
- Comparable-selection and condition-based triage logic
- Multi-source lead-enrichment pipelines — automated prospecting wired into CRMs and outbound
- LightGBM · scikit-learn · XGBoost · PyTorch · MLflow
- Regression · Random Forest · classification pipelines
- Pandas · NumPy · statistical analysis
Engineering & Infrastructure
- Python · FastAPI · Streamlit · Docker · Docker Compose
- CI/CD (GitHub Actions)
- AWS: EC2 · ECR · S3 · RDS
- Azure
- SQL · REST APIs · CRM/API integration
- Pytest — 90+ unit tests, integration test suite
- SQLite (persistent state for circuit breakers and alert history)
- Railway · Render · Fly.io · Koyeb (deployment platforms)
Observability & Monitoring
Skills developed through CostGuard (Prometheus + Grafana), EnterpriseAgentEval (MLflow), and AVM model evaluation on production-shaped holdouts.
- Prometheus metrics — request counts, latency histograms, token usage, validity rates
- Grafana dashboards — real-time observability for LLM agent traffic
- MLflow experiment tracking — agent runs, prompts, scores, token counts, latency, cost
- Model evaluation in Tableau — MAPE / within-±5% tracking, error analysis, drift detection
- Holdout evaluation on live-API data — benchmarking the AVM against the data vendor’s own model
- Per-call cost and quality tracking across 12 models and 5 providers (CostGuard)
LLM Evaluation & Benchmarking
Skills developed through RealDataAgentBench (1,412+ evaluation runs across 12 frontier models and 39 tasks) and CostGuard (RDAB-calibrated validity scoring in production).
- LLM evaluation framework design — multi-dimensional scoring (correctness, code quality, efficiency, statistical validity)
- Benchmark construction: task design, ground-truth labeling, 95% confidence intervals
- Statistical validity scoring — uncertainty quantification, p-value detection, failure-mode penalty heuristics
- Correctness vs. statistical-validity gap analysis across frontier models
- Prompt engineering, few-shot evaluation, chain-of-thought benchmarking
- OpenAI API · Anthropic API · Hugging Face Inference API
- Models evaluated: GPT-4o · GPT-4.1 · GPT-4.1-mini · GPT-5 · Claude Sonnet 4.6 · Llama 3.3-70B · DeepSeek-R1 · Gemini 1.5 Pro · and more
LLM Fine-tuning
Skills developed through LoRA Fine-tuning of DeepSeek-R1-Distill-Qwen-1.5B on GSM8K mathematical reasoning.
- LoRA (Low-Rank Adaptation) — rank 16, alpha 32, 98.8% parameter reduction (18M / 1.5B trainable)
- PEFT (Parameter-Efficient Fine-Tuning)
- Unsloth — 2× inference throughput on a single consumer GPU
- Hugging Face Transformers ·
datasets·trl - GSM8K benchmark evaluation (achieved 58.2% accuracy on a 1.5B model)
- Training on resource-constrained hardware (single consumer GPU)
Vision AI & Other
Skills developed through AI-Assisted Medical Image Diagnosis.
- LLM-powered vision models for X-ray interpretation (Groq Cloud API)
- Structured findings report generation from radiological images
- Lightweight deployment for resource-constrained environments
- Groq Cloud API · Vision LLMs · Streamlit