Skip to main content

Evaluation & LLMOps

You cannot ship an LLM application the way you ship ordinary code, because its output is non-deterministic -- the same input can produce different output. Evaluation answers "is it good enough, often enough?" and LLMOps is the discipline of operating the whole system in production. This page is the deep dive behind the LLMOps overview on the Tooling page.

LLM evaluation

LLM evaluation ("eval") measures whether an LLM application actually does its job -- the equivalent of a test suite for a system you cannot test with assertEquals. It replaces "did it return exactly X?" with "is the output good enough by these criteria, often enough?"

An eval is built from three parts:

  1. A dataset -- representative inputs, optionally paired with reference answers. Curate it from real or realistic cases, including edge cases and known failures.
  2. A predict function -- the thing under test: your prompt, chain, or agent.
  3. Scorers -- functions that grade each output. The result is aggregate scores across the dataset, not a single pass/fail.

Kinds of scorers

Scorer typeExamplesWhen to use
Deterministic / heuristicexact match, regex, JSON-schema validity, latency, costObjective, cheap checks
LLM-as-judgea second LLM rates against a rubric (helpfulness, correctness, tone)Open-ended output with no single right answer
Reference-basedsemantic similarity / factual overlap vs a golden answerWhen known-good answers exist
RAG-specificgroundedness / faithfulness, context relevance, answer relevanceDid the answer stick to retrieved sources or hallucinate?
Safetyrefusal rate, toxicity, PII leakageOverlaps with guardrails

LLM-as-judge is powerful but must be calibrated against human judgment -- treat the judge as a model that itself needs validation.

Eval-driven development

Build the eval first, then iterate the application against it -- the LLM analogue of test-driven development. Every prompt tweak, model swap, or retrieval change is judged by whether eval scores improve, not by eyeballing a few outputs. Golden traces (known-good example runs) act as regression tests. This is what separates "I read some outputs and they looked fine" from a defensible quality story.

Harness engineering

Harness engineering is building the evaluation and execution scaffolding that turns a non-deterministic model into a testable, observable, comparable system. A harness consists of: eval datasets, a runner, judges, a diff/regression view, trace capture (every model and tool call recorded), and CI integration that blocks deploys when metrics regress.

Two reasons the field coined a new word instead of "tests":

  1. Non-determinism is the default. A unit test asserts equality; a harness asserts statistical properties over many runs, so it must resample and aggregate.
  2. The system under test is dynamic. Prompt, model, tools, retrieval index, and even tokenizer all change. The harness pins them as a versioned bundle and re-evaluates when any changes.

Put differently: prompt engineering optimizes the input; harness engineering optimizes your ability to know whether the optimization worked. For agents, a single harness over the top-level prompt is too coarse -- you need harnesses at the tool-selection, planner, and final-answer levels, all backed by replayable traces.

LLMOps

LLMOps is the operational discipline of running LLM apps and agents in production: the LLM-flavored sibling of MLOps. The central shift is from retraining as the core loop to prompt + retrieval + tool changes as the core loop.

ConcernClassical MLOpsLLMOps
Primary artifactmodel weightsprompt + tools + model + RAG index
Versioning unitmodel checkpointprompt x model x tool spec
Failure modedistribution drifthallucination, jailbreak, agent stall
Evaluationaccuracy / AUC vs labelsgroundedness, helpfulness, safety (often LLM-as-judge)
Latencyusually statictail-sensitive, scales with token count
Cost profiletraining-heavy, serving-cheaptraining-free, serving-expensive per token

LLMOps covers: model + prompt versioning (pin model + tokenizer + system prompt as one artifact), prompt management (registry, A/B testing, rollback), evaluation in CI and in production, cost/latency optimization (token budgeting, caching, model cascading from small to large), monitoring (input/output drift, refusal rate, groundedness, p50/p95 latency, per-tenant cost), and incident response runbooks for non-deterministic failures.

Evaluation runs continuously, not once

Eval is not a pre-launch gate. Wire it into the production loop:

  • In CI -- run the eval on every change so a prompt edit cannot silently regress quality before shipping.
  • In production -- sample live traffic and score it continuously to catch quality drift as real inputs diverge from your test set.

The MLOps foundation underneath

When the model is custom rather than a hosted LLM, classical MLOps applies. Its maturity is often described in three levels (Google Cloud's model):

  • Level 0 -- Manual. Notebook-driven, manual handoffs, infrequent releases, minimal monitoring. Failure mode: silent model staleness and training-serving skew.
  • Level 1 -- Pipeline automation. Continuous training triggered by new data, with a feature store, data validation, and metadata/lineage tracking.
  • Level 2 -- CI/CD automation. Automated testing, building, and deployment of pipelines, with a model registry, ML metadata store, and orchestrator closing the loop.

The key MLOps failure modes -- training-serving skew, model staleness, and unmanaged infrastructure debt -- are exactly what monitoring and testing exist to prevent. LLMOps inherits all of them and adds prompt versioning, RAG/embedding pipelines, token economics, and eval harnesses on top.

See also