Skip to main content

AI Safety & Guardrails

LLMs hallucinate, can be manipulated into harmful output, and act on tools in the real world. Safety is therefore not a single filter but a layered, defense-in-depth discipline. This page covers guardrails (what they are and where they fall short), the attack surface, red-teaming, and how the pieces fit together in production.

Guardrails

Guardrails are systems -- rule-based or ML-based -- that decide whether a given text (a user query or a model response) is allowed or forbidden under a specified policy. They operationalize normative principles by evaluating inputs and outputs for policy compliance, keeping system behavior within ethical, legal, and safety boundaries.

The field evolved from rule-based filters, to trained classifiers for fixed harm types (toxicity, hate speech), to modern instruction-tuned guardrails that frame safety as instruction-following and accept a policy description alongside the input. Two architectural patterns dominate:

  • Multi-class single-pass (e.g. Llama Guard) -- process the input alongside the full policy taxonomy in one forward pass, returning a label plus violating categories.
  • Binary per-category (e.g. ShieldGemma, Granite Guardian) -- one call per risk category, evaluating each policy independently.

Major open-source families: Llama Guard (Meta), ShieldGemma (Google), Granite Guardian (IBM), and Aegis (NVIDIA). Managed options include Bedrock Guardrails (see Cloud vs Local Models).

What guardrails get right -- and wrong

They reach strong precision on in-distribution policies and cover the bulk of widely recognized hazards (violence, sexual content, hate, self-harm, illegal activity). But the current generation has documented limits -- worth knowing so you do not over-trust a single guardrail:

  1. Recall is the bottleneck. Guardrails systematically favor precision (few false positives) at the cost of missing genuinely unsafe content.
  2. Poor generalization to unseen policies. Moving from a standard taxonomy to a domain-specific one can drop F1 by 24+ points -- sometimes below the model's own non-safety-tuned base model.
  3. Prompt extension is not enough. Bolting new categories onto the policy prompt tends to either not improve recall or trade recall for large false-positive spikes.
  4. Domain-specific risks are nearly invisible. In financial-services red-teaming, evaluated guardrails caught roughly a third or less of unsafe queries even with extended taxonomies.

Mitigations split into training-time (e.g. perturbing policies during training so the model attends to the supplied policy text rather than memorizing one taxonomy) and deployment-time (multi-layer strategies, governance, disclaimers, human review).

The attack surface: prompt injection and jailbreaking

Prompt injection and jailbreaking are attacks meant to override the limitations imposed on an LLM system to elicit harmful or undesirable output. A common method disguises a malicious instruction as normal input and manipulates the system into ignoring its original instructions.

  • It is a method, not an outcome -- it describes how an attack happens, not what the harmful content is. Attackers often use injection to achieve some other category of violation.
  • Indirect prompt injection compromises LLM-integrated apps via malicious content hidden in retrieved data -- a direct risk for any RAG or agent system that ingests untrusted text.
  • The OWASP Top 10 for LLM Applications ranks prompt injection as the #1 risk class. It is also largely not covered by general content guardrails -- dedicated detectors (e.g. Meta's Prompt Guard) exist for it.

For agents this compounds: a single user turn fans out to many tool calls, and an injected instruction can trigger real-world actions. Constrain what tools exist and what they may do, and treat tool inputs/outputs as untrusted (see Agents).

Red-teaming

Red-teaming is a safety evaluation method where evaluators continuously and adversarially probe a system to discover new failure modes -- in contrast to static benchmarks that test against a fixed set of examples.

  • Adaptive -- evaluators steer exploration using the risk taxonomy and the intended use case; multi-turn attacks can grow progressively complex.
  • Complementary to benchmarks -- red-teaming data should be frozen into static benchmarks for regression testing and to accumulate institutional domain expertise over time.
  • Diverse participants matter -- security backgrounds drive injection attempts, AI engineers know model failure modes, domain experts know which questions probe real regulatory boundaries.

Red-teaming inputs slot directly into the same eval runner described in Evaluation and LLMOps.

Defense in depth

No single control is sufficient. A responsible deployment layers them:

  • Input side -- guardrail classification plus prompt-injection detection on untrusted text.
  • Model/agent side -- least-privilege tools, explicit user consent for sensitive actions, and grounding via RAG to reduce hallucination.
  • Output side -- output guardrails, groundedness/faithfulness checks, and citations.
  • Process side -- governance (logging, escalation, manual review, access suspension), continuous monitoring, and frameworks like the NIST AI Risk Management Framework (Govern / Map / Measure / Manage).

Safety is ultimately a sociotechnical problem: it depends on the context the system operates in, not just the model. Evaluate risk holistically -- in context, with humans in the loop where the stakes warrant it.

See also