AI Safety & Guardrails

LLMs hallucinate, can be manipulated into harmful output, and act on tools in the real world. Safety is therefore not a single filter but a layered, defense-in-depth discipline. This page covers guardrails (what they are and where they fall short), the attack surface, red-teaming, and how the pieces fit together in production.

Guardrails

Guardrails are systems - rule-based or ML-based - that decide whether a given text (a user query or a model response) is allowed or forbidden under a specified policy. They operationalize normative principles by evaluating inputs and outputs for policy compliance, keeping system behavior within ethical, legal, and safety boundaries.

The field evolved from rule-based filters, to trained classifiers for fixed harm types (toxicity, hate speech), to modern instruction-tuned guardrails that frame safety as instruction-following and accept a policy description alongside the input. Two architectural patterns dominate:

Multi-class single-pass (e.g. Llama Guard) - process the input alongside the full policy taxonomy in one forward pass, returning a label plus violating categories.
Binary per-category (e.g. ShieldGemma, Granite Guardian) - one call per risk category, evaluating each policy independently.

Major open-source families: Llama Guard (Meta), ShieldGemma (Google), Granite Guardian (IBM), and Aegis (NVIDIA). Managed options include Bedrock Guardrails (see Cloud vs Local Models).

What guardrails get right - and wrong

They reach strong precision on in-distribution policies and cover the bulk of widely recognized hazards (violence, sexual content, hate, self-harm, illegal activity). But the current generation has documented limits - worth knowing so you do not over-trust a single guardrail:

Recall is the bottleneck. Guardrails systematically favor precision (few false positives) at the cost of missing genuinely unsafe content.
Poor generalization to unseen policies. Moving from a standard taxonomy to a domain-specific one can drop F1 by 24+ points - sometimes below the model's own non-safety-tuned base model.
Prompt extension is not enough. Bolting new categories onto the policy prompt tends to either not improve recall or trade recall for large false-positive spikes.
Domain-specific risks are nearly invisible. In financial-services red-teaming, evaluated guardrails caught roughly a third or less of unsafe queries even with extended taxonomies.

Mitigations split into training-time (e.g. perturbing policies during training so the model attends to the supplied policy text rather than memorizing one taxonomy) and deployment-time (multi-layer strategies, governance, disclaimers, human review).

The attack surface: prompt injection and jailbreaking

Prompt injection and jailbreaking are attacks meant to override the limitations imposed on an LLM system to elicit harmful or undesirable output. A common method disguises a malicious instruction as normal input and manipulates the system into ignoring its original instructions.

It is a method, not an outcome - it describes how an attack happens, not what the harmful content is. Attackers often use injection to achieve some other category of violation.
Indirect prompt injection compromises LLM-integrated apps via malicious content hidden in retrieved data - a direct risk for any RAG or agent system that ingests untrusted text.
The OWASP Top 10 for LLM Applications ranks prompt injection as the #1 risk class. It is also largely not covered by general content guardrails - dedicated detectors (e.g. Meta's Prompt Guard) exist for it.

For agents this compounds: a single user turn fans out to many tool calls, and an injected instruction can trigger real-world actions. Constrain what tools exist and what they may do, and treat tool inputs/outputs as untrusted (see Agents).

Red-teaming

Red-teaming is a safety evaluation method where evaluators continuously and adversarially probe a system to discover new failure modes - in contrast to static benchmarks that test against a fixed set of examples.

Adaptive - evaluators steer exploration using the risk taxonomy and the intended use case; multi-turn attacks can grow progressively complex.
Complementary to benchmarks - red-teaming data should be frozen into static benchmarks for regression testing and to accumulate institutional domain expertise over time.
Diverse participants matter - security backgrounds drive injection attempts, AI engineers know model failure modes, domain experts know which questions probe real regulatory boundaries.

Red-teaming inputs slot directly into the same eval runner described in Evaluation and LLMOps.

Defense in depth

No single control is sufficient. A responsible deployment layers them:

Input side - guardrail classification plus prompt-injection detection on untrusted text.
Model/agent side - least-privilege tools, explicit user consent for sensitive actions, and grounding via RAG to reduce hallucination.
Output side - output guardrails, groundedness/faithfulness checks, and citations.
Process side - governance (logging, escalation, manual review, access suspension), continuous monitoring, and frameworks like the NIST AI Risk Management Framework (Govern / Map / Measure / Manage).

Safety is ultimately a sociotechnical problem: it depends on the context the system operates in, not just the model. Evaluate risk holistically - in context, with humans in the loop where the stakes warrant it.

Guardrails​

What guardrails get right - and wrong​

The attack surface: prompt injection and jailbreaking​

Red-teaming​

Defense in depth​

See also​

Guardrails

What guardrails get right - and wrong

The attack surface: prompt injection and jailbreaking

Red-teaming

Defense in depth

See also