AI Engineering Safety & Governance
Research-oriented AI safety notes explain attacks and defenses in principle. Engineering safety asks a different question: once the system is live, what controls must actually exist? For LLM applications, a “safer model” is not enough. Permissions, audit logs, content governance, model documentation, and compliance controls all need to work together.
This page focuses on the application and governance layers. For lower-level system-security topics such as tool boundaries, secret isolation, hardware leakage, and multi-tenant accelerator risk, continue to LLM & Agent System Security.
1. Layered engineering controls
flowchart TD
A[Input / Retrieval] --> B[Model Policy]
B --> C[Tool Invocation]
C --> D[Output Delivery]
D --> E[Logging / Audit]
E --> F[Governance / Compliance]
The point of this layered view is simple: controls must span the full call chain, not just the model prompt and response.
2. Prompt injection and input governance
Prompt injection is often discussed under LLM Jailbreaking, but in production it is fundamentally a trust-boundary problem. High-risk sources include:
- raw user input
- retrieved web pages, PDFs, emails, and database rows
- tool outputs
- long-lived memory or cached state
Useful controls include:
- injection classifiers and pattern filters
- provenance-aware message schemas
- length and complexity limits
- PII redaction before logging or reuse
System prompts help, but they are not a security boundary by themselves.
3. Privacy, logs, and retention
Teams often focus on training-data privacy and forget that production risk often comes from:
- chat logs
- retrieved snippets
- tool results
- debug traces
- analytics events
Good engineering defaults are:
- collect the minimum necessary data
- retain it for the minimum justified duration
- separate access by role and workflow
Training-time privacy methods such as differential privacy belong with Privacy Attacks; engineering work is about deletion flows, auditability, and operational consistency.
4. Permissions and tool use
Once an LLM can invoke tools, safety becomes a capability-management problem.
| Layer | What must be controlled | Typical mechanism |
|---|---|---|
| User | who may initiate requests | authN / authZ / RBAC |
| Session | what memory or docs are visible | scoped sessions |
| Model | which model can be routed where | policy routing |
| Tool | which APIs / DBs / shells are callable | allowlists + approvals |
| Output | what content is visible downstream | redaction + policy gates |
High-risk actions such as sending email, writing databases, deleting files, or executing shell commands should never be governed only by prompt text.
5. Content governance
Common categories include:
- dangerous instructions
- PII leakage
- hallucinated claims
- biased or discriminatory content
- copyrighted or policy-restricted outputs
Output moderation matters, but it does not replace model governance. In mature systems:
- the main model handles normal task behavior
- a review model or rule layer performs additional risk checks
- external policy or human approval gates the highest-risk actions
6. Red teaming, regression, and model cards
Red teaming should be a continuous pipeline, not a one-time launch checklist:
- define dangerous capabilities and misuse cases
- test them manually and automatically
- add successful failures to regression suites
- rerun before every major release
Model cards matter because they document:
- what the model is
- how it was trained or adapted
- what has been evaluated
- what is explicitly out of scope
- who owns incidents and limitations
7. Regulation, standards, and audit
Regulations and standards do not replace technical controls. In practice they help organizations turn safety expectations into:
- auditable evidence
- documented responsibilities
- repeatable change-management and incident workflows
A minimal audit trail should answer:
- who initiated the request
- which model version and template were used
- what was retrieved
- which tools were called
- what was blocked or rewritten
- whether human approval was required
Relations to other topics
- For the overall framing, see AI Safety Overview
- For jailbreaks and prompt attacks, see LLM Jailbreaking
- For red-team methodology, see Red Teaming
- For system-level isolation and hardware/runtime risks, see LLM & Agent System Security
References
- Tufts EE141 Trusted AI Course Slides, LLM Security and System Security lectures, Spring 2026.
- Mitchell et al., "Model Cards for Model Reporting", FAT* 2019.
- Greshake et al., "Not What You've Signed Up For", AISec 2023.
- OWASP, "Top 10 for Large Language Model Applications", 2025.