Skip to content

AI Engineering Safety & Governance

Research-oriented AI safety notes explain attacks and defenses in principle. Engineering safety asks a different question: once the system is live, what controls must actually exist? For LLM applications, a “safer model” is not enough. Permissions, audit logs, content governance, model documentation, and compliance controls all need to work together.

This page focuses on the application and governance layers. For lower-level system-security topics such as tool boundaries, secret isolation, hardware leakage, and multi-tenant accelerator risk, continue to LLM & Agent System Security.

1. Layered engineering controls

flowchart TD
    A[Input / Retrieval] --> B[Model Policy]
    B --> C[Tool Invocation]
    C --> D[Output Delivery]
    D --> E[Logging / Audit]
    E --> F[Governance / Compliance]

The point of this layered view is simple: controls must span the full call chain, not just the model prompt and response.

2. Prompt injection and input governance

Prompt injection is often discussed under LLM Jailbreaking, but in production it is fundamentally a trust-boundary problem. High-risk sources include:

  • raw user input
  • retrieved web pages, PDFs, emails, and database rows
  • tool outputs
  • long-lived memory or cached state

Useful controls include:

  • injection classifiers and pattern filters
  • provenance-aware message schemas
  • length and complexity limits
  • PII redaction before logging or reuse

System prompts help, but they are not a security boundary by themselves.

3. Privacy, logs, and retention

Teams often focus on training-data privacy and forget that production risk often comes from:

  • chat logs
  • retrieved snippets
  • tool results
  • debug traces
  • analytics events

Good engineering defaults are:

  • collect the minimum necessary data
  • retain it for the minimum justified duration
  • separate access by role and workflow

Training-time privacy methods such as differential privacy belong with Privacy Attacks; engineering work is about deletion flows, auditability, and operational consistency.

4. Permissions and tool use

Once an LLM can invoke tools, safety becomes a capability-management problem.

Layer What must be controlled Typical mechanism
User who may initiate requests authN / authZ / RBAC
Session what memory or docs are visible scoped sessions
Model which model can be routed where policy routing
Tool which APIs / DBs / shells are callable allowlists + approvals
Output what content is visible downstream redaction + policy gates

High-risk actions such as sending email, writing databases, deleting files, or executing shell commands should never be governed only by prompt text.

5. Content governance

Common categories include:

  • dangerous instructions
  • PII leakage
  • hallucinated claims
  • biased or discriminatory content
  • copyrighted or policy-restricted outputs

Output moderation matters, but it does not replace model governance. In mature systems:

  • the main model handles normal task behavior
  • a review model or rule layer performs additional risk checks
  • external policy or human approval gates the highest-risk actions

6. Red teaming, regression, and model cards

Red teaming should be a continuous pipeline, not a one-time launch checklist:

  1. define dangerous capabilities and misuse cases
  2. test them manually and automatically
  3. add successful failures to regression suites
  4. rerun before every major release

Model cards matter because they document:

  • what the model is
  • how it was trained or adapted
  • what has been evaluated
  • what is explicitly out of scope
  • who owns incidents and limitations

7. Regulation, standards, and audit

Regulations and standards do not replace technical controls. In practice they help organizations turn safety expectations into:

  • auditable evidence
  • documented responsibilities
  • repeatable change-management and incident workflows

A minimal audit trail should answer:

  • who initiated the request
  • which model version and template were used
  • what was retrieved
  • which tools were called
  • what was blocked or rewritten
  • whether human approval was required

Relations to other topics

References

  • Tufts EE141 Trusted AI Course Slides, LLM Security and System Security lectures, Spring 2026.
  • Mitchell et al., "Model Cards for Model Reporting", FAT* 2019.
  • Greshake et al., "Not What You've Signed Up For", AISec 2023.
  • OWASP, "Top 10 for Large Language Model Applications", 2025.

评论 #