AI Engineering Safety & Governance

Research-oriented AI safety notes explain attacks and defenses in principle. Engineering safety asks a different question: once the system is live, what controls must actually exist? For LLM applications, a “safer model” is not enough. Permissions, audit logs, content governance, model documentation, and compliance controls all need to work together.

This page focuses on the application and governance layers. For lower-level system-security topics such as tool boundaries, secret isolation, hardware leakage, and multi-tenant accelerator risk, continue to LLM & Agent System Security.

1. Layered engineering controls

flowchart TD
    A[Input / Retrieval] --> B[Model Policy]
    B --> C[Tool Invocation]
    C --> D[Output Delivery]
    D --> E[Logging / Audit]
    E --> F[Governance / Compliance]

The point of this layered view is simple: controls must span the full call chain, not just the model prompt and response.

2. Prompt injection and input governance

Prompt injection is often discussed under LLM Jailbreaking, but in production it is fundamentally a trust-boundary problem. High-risk sources include:

raw user input
retrieved web pages, PDFs, emails, and database rows
tool outputs
long-lived memory or cached state

Useful controls include:

injection classifiers and pattern filters
provenance-aware message schemas
length and complexity limits
PII redaction before logging or reuse

System prompts help, but they are not a security boundary by themselves.

3. Privacy, logs, and retention

Teams often focus on training-data privacy and forget that production risk often comes from:

chat logs
retrieved snippets
tool results
debug traces
analytics events

Good engineering defaults are:

collect the minimum necessary data
retain it for the minimum justified duration
separate access by role and workflow

Training-time privacy methods such as differential privacy belong with Privacy Attacks; engineering work is about deletion flows, auditability, and operational consistency.

4. Permissions and tool use

Once an LLM can invoke tools, safety becomes a capability-management problem.

Layer	What must be controlled	Typical mechanism
User	who may initiate requests	authN / authZ / RBAC
Session	what memory or docs are visible	scoped sessions
Model	which model can be routed where	policy routing
Tool	which APIs / DBs / shells are callable	allowlists + approvals
Output	what content is visible downstream	redaction + policy gates

High-risk actions such as sending email, writing databases, deleting files, or executing shell commands should never be governed only by prompt text.

5. Content governance

Common categories include:

dangerous instructions
PII leakage
hallucinated claims
biased or discriminatory content
copyrighted or policy-restricted outputs

Output moderation matters, but it does not replace model governance. In mature systems:

the main model handles normal task behavior
a review model or rule layer performs additional risk checks
external policy or human approval gates the highest-risk actions

6. Red teaming, regression, and model cards

Red teaming should be a continuous pipeline, not a one-time launch checklist:

define dangerous capabilities and misuse cases
test them manually and automatically
add successful failures to regression suites
rerun before every major release

Model cards matter because they document:

what the model is
how it was trained or adapted
what has been evaluated
what is explicitly out of scope
who owns incidents and limitations

7. Regulation, standards, and audit

Regulations and standards do not replace technical controls. In practice they help organizations turn safety expectations into:

auditable evidence
documented responsibilities
repeatable change-management and incident workflows

A minimal audit trail should answer:

who initiated the request
which model version and template were used
what was retrieved
which tools were called
what was blocked or rewritten
whether human approval was required

Relations to other topics

For the overall framing, see AI Safety Overview
For jailbreaks and prompt attacks, see LLM Jailbreaking
For red-team methodology, see Red Teaming
For system-level isolation and hardware/runtime risks, see LLM & Agent System Security

References

Tufts EE141 Trusted AI Course Slides, LLM Security and System Security lectures, Spring 2026.
Mitchell et al., "Model Cards for Model Reporting", FAT* 2019.
Greshake et al., "Not What You've Signed Up For", AISec 2023.
OWASP, "Top 10 for Large Language Model Applications", 2025.