← Back to Catalog
Agent Skill

rag-eval

Replace eyeballing with structured RAG sweeps that track cost and quality.

Iterate on RAG systems with structured evals instead of eyeballing. Runs cost-aware sweep grids, ranks variants, and returns structured feedback on architecture, stack, and likely issues.

What it does

Replaces the “tweak, squint, swap model, burn credits” loop with a structured eval pipeline. Runs a grid of retrieval variants against a human-labeled gold set, ranks them by a cost-aware score, and returns concrete next experiments. Learns from prior runs so you never re-test a rejected variant.

Key features

  • Cost-aware scoring — Ranks variants by quality x (1 / log(1 + cost)), so cheap-and-good beats expensive-and-slightly-better.
  • Session ingest — Extracts signals from prior Claude Code transcripts or Fathom meetings without pasting 100k tokens of noise.
  • Stack audit — Inspects your repo, vector store, and chunking strategy against an evidence-based checklist before proposing experiments.
  • Budget guardrails — Hard dollar cap per sweep. Halts mid-run if budget is reached. Always confirms cost estimate before executing.
  • Iterative memory — Reads history.jsonl across runs to surface patterns and avoid re-testing dead ends.
  • Gold set bootstrapping — If you don’t have labeled data, generates a starter set from your corpus for human review.

When to use

When you’re tuning a RAG pipeline and want structured experiments instead of gut-feel adjustments. Also when eval costs are climbing with no clear signal of improvement.