Debugging LLM Apps

LLM applications fail in ways traditional software does not: the same input can produce different output, "bugs" look like bad judgment, and the fault may sit in the prompt, retrieval, tools, model tier, or user data - not your business logic. This page is a production runbook: classify the symptom, inspect the right layer, and avoid fixing random prompts when the problem is elsewhere. Pair it with Evaluation & LLMOps for prevention and Which Pattern When? for architecture choices.

Step zero: classify the failure

Symptom class	Likely layers	First checks
Infrastructure	API, network, quotas	Status codes, rate limits, timeouts, provider status
Wrong or hallucinated content	Prompt, RAG, model tier	Retrieved chunks, citations, eval set regression
Unparseable / schema errors	Structured output, tools	Validator errors, repair loop logs
Agent stuck or looping	Tools, agent instructions	Round count, repeated tool calls, context size
Slow or expensive	Routing, context, agent depth	Token counts, model ID, retrieval size
Unsafe or policy violation	Safety, injection, tools	User input, retrieved text, tool permissions
Intermittent	Nondeterminism, routing, cache	Temperature, fallback tier, stale cache

Write down which class you are in before changing code. Mixed symptoms are common - fix infrastructure first, then content quality.

What to collect for every incident

Minimum debug bundle (most LLMOps tools capture this):

Request ID, timestamp, user/tenant (redacted per privacy)
Model ID and parameters (temperature, max tokens)
Full prompt messages or hash + stored trace if PII-heavy
For RAG: query, retrieved chunk IDs/scores, reranker output
For agents: each tool name, input, output, latency per round
Token counts and estimated cost
Final output and any validation errors

If you cannot reproduce without this, you are guessing.

Wrong answers (chat and RAG)

1. Is it hallucination or missing context?

No relevant retrieval - empty or low-score chunks → indexing, chunking, embedding model, or query rewrite
Relevant chunks retrieved but ignored - prompt burying context, context rot, or model tier too weak → reorder prompt, summarize chunks, upgrade tier on hard queries
Answer contradicts sources - groundedness failure → cite-or-refuse instructions, eval faithfulness scorers, reranker

2. Did retrieval poison the answer?

Prompt injection via documents is common in RAG. Check whether retrieved text contains instructions. Mitigate: source trust tiers, sanitization, separate system vs document channels.

3. Regression or drift?

Compare against a fixed eval set. If yesterday passed and today fails: prompt/version change, model swap, embedding reindex incomplete, or corpus update.

Quick fixes (in order): better chunks → reranker → prompt/citations → model tier → eval gate before deploy.

Structured output and tool failures

From Structured Outputs:

Log validation errors verbatim - missing fields, wrong enum, extra keys
Check whether native schema mode is enabled or you are prompt-only JSON
One repair retry is normal; repeated failure → schema too large or ambiguous instructions
Tool failures: wrong arguments often mean bad tool descriptions (treat as prompt engineering)

Agent loops

Symptoms: never finishes, repeats the same tool, or cost spikes.

Observation	Likely cause	Fix
Same tool, same args repeatedly	No exit condition; tool error swallowed	Max rounds; surface tool errors to model; fix tool
Explores forever	Goal too vague	Narrow task; structured plan step; sub-agent with summary
Wrong tool chosen	Tool sprawl / overlap	Fewer tools; clearer names and descriptions
Context exceeded mid-loop	Tool outputs too large	Truncate/summarize results; compaction

Always cap max iterations and log when the cap hits - that is a product signal, not just a safety net. High-impact tools still need human-in-the-loop.

Latency and cost spikes

See Cost, Latency & Model Routing:

Sudden cost ↑ - new traffic, agent loop, larger context, wrong model ID, cache miss
p95 latency ↑ - longer prompts, retrieval slowness, queue on provider, sequential agent steps

Dashboards: tokens/request, cost/request, rounds/agent session, retrieval latency, model tier mix.

Safety and abuse

Spike in refusals - policy or guardrail change
Unexpected harmful output - jailbreak or injection; check full trace including retrieved content
Tool exfiltration - agent fetched data it should not; tighten tool queries and ACLs

Cross-reference Safety & Guardrails and Privacy & Data Handling.

Local vs cloud

Issue	Cloud	Local / self-hosted
OOM / crash	N/A (provider)	Quantization, smaller model, GPU RAM
Stale model behavior	Provider update	Pin model file/version; document in runbook
CORS / browser	API key exposure risk	Local LLM app proxy pattern

Fix workflow (do not skip steps)

Reproduce with saved trace or minimal fixture
Isolate layer - disable RAG, then tools, then swap model tier
One change at a time - LLM stacks punish shotgun edits
Add eval case so the bug does not return silently
Document in team runbook or postmortem

When to escalate vs patch

Patch in prompt - single edge case, clear missing instruction
Fix retrieval/tools - systematic wrong facts or actions
Change architecture - chronic loop/cost issues (Which Pattern When?)
Human queue - high-stakes errors until eval proves fix (Human-in-the-Loop)

Step zero: classify the failure​

What to collect for every incident​

Wrong answers (chat and RAG)​

1. Is it hallucination or missing context?​

2. Did retrieval poison the answer?​

3. Regression or drift?​

Structured output and tool failures​

Agent loops​

Latency and cost spikes​

Safety and abuse​

Local vs cloud​

Fix workflow (do not skip steps)​

When to escalate vs patch​

See also​