Skip to main content

Debugging LLM Apps

LLM applications fail in ways traditional software does not: the same input can produce different output, "bugs" look like bad judgment, and the fault may sit in the prompt, retrieval, tools, model tier, or user data -- not your business logic. This page is a production runbook: classify the symptom, inspect the right layer, and avoid fixing random prompts when the problem is elsewhere. Pair it with Evaluation & LLMOps for prevention and Which Pattern When? for architecture choices.

Step zero: classify the failure

Symptom classLikely layersFirst checks
InfrastructureAPI, network, quotasStatus codes, rate limits, timeouts, provider status
Wrong or hallucinated contentPrompt, RAG, model tierRetrieved chunks, citations, eval set regression
Unparseable / schema errorsStructured output, toolsValidator errors, repair loop logs
Agent stuck or loopingTools, agent instructionsRound count, repeated tool calls, context size
Slow or expensiveRouting, context, agent depthToken counts, model ID, retrieval size
Unsafe or policy violationSafety, injection, toolsUser input, retrieved text, tool permissions
IntermittentNondeterminism, routing, cacheTemperature, fallback tier, stale cache

Write down which class you are in before changing code. Mixed symptoms are common -- fix infrastructure first, then content quality.

What to collect for every incident

Minimum debug bundle (most LLMOps tools capture this):

  • Request ID, timestamp, user/tenant (redacted per privacy)
  • Model ID and parameters (temperature, max tokens)
  • Full prompt messages or hash + stored trace if PII-heavy
  • For RAG: query, retrieved chunk IDs/scores, reranker output
  • For agents: each tool name, input, output, latency per round
  • Token counts and estimated cost
  • Final output and any validation errors

If you cannot reproduce without this, you are guessing.

Wrong answers (chat and RAG)

1. Is it hallucination or missing context?

  • No relevant retrieval -- empty or low-score chunks → indexing, chunking, embedding model, or query rewrite
  • Relevant chunks retrieved but ignored -- prompt burying context, context rot, or model tier too weak → reorder prompt, summarize chunks, upgrade tier on hard queries
  • Answer contradicts sources -- groundedness failure → cite-or-refuse instructions, eval faithfulness scorers, reranker

2. Did retrieval poison the answer?

Prompt injection via documents is common in RAG. Check whether retrieved text contains instructions. Mitigate: source trust tiers, sanitization, separate system vs document channels.

3. Regression or drift?

Compare against a fixed eval set. If yesterday passed and today fails: prompt/version change, model swap, embedding reindex incomplete, or corpus update.

Quick fixes (in order): better chunks → reranker → prompt/citations → model tier → eval gate before deploy.

Structured output and tool failures

From Structured Outputs:

  • Log validation errors verbatim -- missing fields, wrong enum, extra keys
  • Check whether native schema mode is enabled or you are prompt-only JSON
  • One repair retry is normal; repeated failure → schema too large or ambiguous instructions
  • Tool failures: wrong arguments often mean bad tool descriptions (treat as prompt engineering)

Agent loops

Symptoms: never finishes, repeats the same tool, or cost spikes.

ObservationLikely causeFix
Same tool, same args repeatedlyNo exit condition; tool error swallowedMax rounds; surface tool errors to model; fix tool
Explores foreverGoal too vagueNarrow task; structured plan step; sub-agent with summary
Wrong tool chosenTool sprawl / overlapFewer tools; clearer names and descriptions
Context exceeded mid-loopTool outputs too largeTruncate/summarize results; compaction

Always cap max iterations and log when the cap hits -- that is a product signal, not just a safety net. High-impact tools still need human-in-the-loop.

Latency and cost spikes

See Cost, Latency & Model Routing:

  • Sudden cost ↑ -- new traffic, agent loop, larger context, wrong model ID, cache miss
  • p95 latency ↑ -- longer prompts, retrieval slowness, queue on provider, sequential agent steps

Dashboards: tokens/request, cost/request, rounds/agent session, retrieval latency, model tier mix.

Safety and abuse

  • Spike in refusals -- policy or guardrail change
  • Unexpected harmful output -- jailbreak or injection; check full trace including retrieved content
  • Tool exfiltration -- agent fetched data it should not; tighten tool queries and ACLs

Cross-reference Safety & Guardrails and Privacy & Data Handling.

Local vs cloud

IssueCloudLocal / self-hosted
OOM / crashN/A (provider)Quantization, smaller model, GPU RAM
Stale model behaviorProvider updatePin model file/version; document in runbook
CORS / browserAPI key exposure riskLocal LLM app proxy pattern

Fix workflow (do not skip steps)

  1. Reproduce with saved trace or minimal fixture
  2. Isolate layer -- disable RAG, then tools, then swap model tier
  3. One change at a time -- LLM stacks punish shotgun edits
  4. Add eval case so the bug does not return silently
  5. Document in team runbook or postmortem

When to escalate vs patch

  • Patch in prompt -- single edge case, clear missing instruction
  • Fix retrieval/tools -- systematic wrong facts or actions
  • Change architecture -- chronic loop/cost issues (Which Pattern When?)
  • Human queue -- high-stakes errors until eval proves fix (Human-in-the-Loop)

See also