GenAI evaluation framework
Open-source-style offline evaluation harness for RAG and LLM systems. Tracks answer relevance, faithfulness, context precision, and hallucination rate across prompt and retrieval config changes.
Tech stack
Problem
AI teams building RAG systems had no systematic way to measure whether a prompt change, a different chunk size, or a new retrieval model actually improved quality — or just felt better in ad-hoc testing. Regressions were caught in production, not in CI.
What I built
A Python evaluation library and CLI with three components:
Golden set management: YAML-based QA pair definition with expected answers, source chunk IDs, and metadata tags. Versioned in Git alongside the application code.
Metric suite: implemented RAGAS metrics (answer relevance, faithfulness, context precision, context recall) plus a custom hallucination detector using an LLM judge with structured output; all metrics logged to MLflow as nested runs.
CI integration: GitHub Actions workflow runs the full evaluation on every PR that touches prompt templates, retrieval config, or embedding models. PR is blocked if faithfulness drops below the configured threshold (default: 97%).
Results
Used across three production RAG systems. Caught 4 prompt regressions before deployment in the first two months. Reduced the hallucination debugging cycle from days to hours.