skills

A complete inventory of tools, frameworks, and techniques drawn from production projects, open-source work, and professional experience.


Agent Engineering & Orchestration

Skills developed through Tether (durable execution for LLM agents), EnterpriseAgentEval (LangGraph-based multi-step agent pipelines), and client workflow-automation deployments as a Forward Deployed AI Engineer at SBL.

  • Durable agent execution — automatic checkpoint/resume, idempotent replay, cross-provider failover (OpenAI ↔ Anthropic)
  • LangGraph — multi-step agent workflows, state graphs, conditional edges, human-in-the-loop hooks
  • Multi-channel outbound and lead-enrichment automation wired into CRMs and operational systems
  • MCP (Model Context Protocol) — tooling integration for agent systems
  • Drop-in client wrapping — adding reliability and resilience layers without rewriting agent logic
  • RAG (Retrieval-Augmented Generation) pipeline design
  • LangChain · Pinecone (vector store)
  • Tool use, function calling, and structured output extraction
  • Multi-agent coordination and agent-to-agent communication patterns

LLM Reliability & Production Proxying

Skills developed through CostGuard — a self-hostable reliability layer for LLM-powered agents.

  • Real-time LLM response interception and validity filtering
  • Per-provider circuit breakers (CLOSED / OPEN / HALF_OPEN state machine)
  • Automatic fallback chain routing on rejection
  • Exact per-call cost tracking at $0.000001 precision across 12 models and 5 providers
  • 6-type alerting engine: validity drops · cost spikes · high failure rates · circuit breaker events · consecutive rejections · custom thresholds
  • Slack webhook integration and custom webhook delivery
  • Request-ID tracing, rate limiting, structured JSON logging

ML Engineering & Data Pipelines

Skills developed through the automated valuation model (AVM) contract work and lead-enrichment pipelines at SBL.

  • Full ML lifecycle from data to deployment — cleaned ~38k subject–comp appraisal pairs, engineered 60+ features
  • LightGBM regression for property valuation — 5.1% MAPE, 68% of valuations within ±5% on a production-shaped holdout
  • Live RentCast API ingestion and a condition/quality scoring service
  • Comparable-selection and condition-based triage logic
  • Multi-source lead-enrichment pipelines — automated prospecting wired into CRMs and outbound
  • LightGBM · scikit-learn · XGBoost · PyTorch · MLflow
  • Regression · Random Forest · classification pipelines
  • Pandas · NumPy · statistical analysis

Engineering & Infrastructure

  • Python · FastAPI · Streamlit · Docker · Docker Compose
  • CI/CD (GitHub Actions)
  • AWS: EC2 · ECR · S3 · RDS
  • Azure
  • SQL · REST APIs · CRM/API integration
  • Pytest — 90+ unit tests, integration test suite
  • SQLite (persistent state for circuit breakers and alert history)
  • Railway · Render · Fly.io · Koyeb (deployment platforms)

Observability & Monitoring

Skills developed through CostGuard (Prometheus + Grafana), EnterpriseAgentEval (MLflow), and AVM model evaluation on production-shaped holdouts.

  • Prometheus metrics — request counts, latency histograms, token usage, validity rates
  • Grafana dashboards — real-time observability for LLM agent traffic
  • MLflow experiment tracking — agent runs, prompts, scores, token counts, latency, cost
  • Model evaluation in Tableau — MAPE / within-±5% tracking, error analysis, drift detection
  • Holdout evaluation on live-API data — benchmarking the AVM against the data vendor’s own model
  • Per-call cost and quality tracking across 12 models and 5 providers (CostGuard)

LLM Evaluation & Benchmarking

Skills developed through RealDataAgentBench (1,412+ evaluation runs across 12 frontier models and 39 tasks) and CostGuard (RDAB-calibrated validity scoring in production).

  • LLM evaluation framework design — multi-dimensional scoring (correctness, code quality, efficiency, statistical validity)
  • Benchmark construction: task design, ground-truth labeling, 95% confidence intervals
  • Statistical validity scoring — uncertainty quantification, p-value detection, failure-mode penalty heuristics
  • Correctness vs. statistical-validity gap analysis across frontier models
  • Prompt engineering, few-shot evaluation, chain-of-thought benchmarking
  • OpenAI API · Anthropic API · Hugging Face Inference API
  • Models evaluated: GPT-4o · GPT-4.1 · GPT-4.1-mini · GPT-5 · Claude Sonnet 4.6 · Llama 3.3-70B · DeepSeek-R1 · Gemini 1.5 Pro · and more

LLM Fine-tuning

Skills developed through LoRA Fine-tuning of DeepSeek-R1-Distill-Qwen-1.5B on GSM8K mathematical reasoning.

  • LoRA (Low-Rank Adaptation) — rank 16, alpha 32, 98.8% parameter reduction (18M / 1.5B trainable)
  • PEFT (Parameter-Efficient Fine-Tuning)
  • Unsloth — 2× inference throughput on a single consumer GPU
  • Hugging Face Transformers · datasets · trl
  • GSM8K benchmark evaluation (achieved 58.2% accuracy on a 1.5B model)
  • Training on resource-constrained hardware (single consumer GPU)

Vision AI & Other

Skills developed through AI-Assisted Medical Image Diagnosis.

  • LLM-powered vision models for X-ray interpretation (Groq Cloud API)
  • Structured findings report generation from radiological images
  • Lightweight deployment for resource-constrained environments
  • Groq Cloud API · Vision LLMs · Streamlit