AI Safety & Trustworthiness

This section organizes AI safety through a trustworthy-AI lens. It covers risks across training, inference, deployment, and governance, including adversarial attacks, backdoors, privacy leakage, jailbreaks, system security, interpretability, alignment, and red teaming.

This section reorganizes threat models, figure evidence, and engineering controls by canonical knowledge topic rather than by lecture order. For a fast overview, start with AI Safety Overview, then branch into the attack surface that matters most.

Section Map

1. Overview and framing

AI Safety Overview: the capability-risk-control-governance frame, unified threat models, and the reading map for the section

2. Model attacks and privacy

Adversarial Attack & Defense: white-box, black-box, physical attacks, transferability, and major defenses
FGSM & PGD: mathematical foundations of gradient-based white-box attacks
Backdoor Attacks: training-time poisoning, trigger design, detection, and mitigation
Privacy Attacks: membership inference, model inversion, model extraction, differential privacy, and unlearning

3. LLM and system security

LLM Jailbreaking: prompt-based, token-based, and multi-turn jailbreaks plus layered defenses
Visual Instruction Injection: malicious instructions hidden in multimodal inputs
Red Teaming: systematic security evaluation, dangerous capability evals, and regression testing
For engineering deployment, see AI Engineering Safety & Governance and LLM & Agent System Security

4. Trustworthiness and governance

AI Ethics & Governance: responsibility, fairness, transparency, regulation, and institutional governance
AI Alignment: RLHF, Constitutional AI, reward hacking, and scalable oversight
Explainability & Robustness: XAI, Grad-CAM, LIME, mechanistic interpretability, and trustworthiness evidence
Authenticity & Privacy Protection: provenance, authenticity, and privacy-preserving deployment