Skip to content

Security and Sandboxing

Overview

Security is the most critical consideration in AI Agent deployment. Agents can execute code, call APIs, and operate file systems, meaning security vulnerabilities could lead to severe consequences. This section systematically discusses agent sandboxing technology, permission models, prompt injection defense, and data protection strategies.

Sandboxing Strategies

Sandbox Technology Comparison

Technology Isolation Level Performance Overhead Security Applicable Scenario
Docker Container-level Low Medium General scenarios
gVisor Kernel-level Medium High High security needs
E2B Micro-VM Medium High Code execution
Firecracker Micro-VM Low High AWS Lambda
WebAssembly In-process Very low Medium Lightweight isolation

Docker Sandbox

# Secure Agent code execution sandbox
FROM python:3.11-slim

# Security hardening
RUN apt-get update && apt-get install -y --no-install-recommends \
    && rm -rf /var/lib/apt/lists/*

# Non-root user
RUN useradd -m -s /bin/bash sandbox
USER sandbox
WORKDIR /home/sandbox

# Network access restricted (via Docker network policies)
# File system access restricted (read-only mounts)
# Resource usage limited (Docker resource limits)

Launch parameters:

docker run \
  --memory=512m \
  --cpus=1 \
  --network=none \           # Disable network access
  --read-only \              # Read-only file system
  --tmpfs /tmp:size=100m \   # Temporary write space
  --security-opt no-new-privileges \
  --pids-limit=50 \          # Limit process count
  agent-sandbox

gVisor

gVisor provides an application-level kernel that intercepts all system calls:

graph TD
    A[Agent Code] --> B[gVisor Sentry]
    B --> C{System Call}
    C -->|Allowed| D[gVisor Kernel Implementation]
    C -->|Denied| E[Return Error]
    D --> F[Host Kernel]

    style B fill:#fff3e0
    style E fill:#ffcdd2

Advantages:

  • Stronger isolation than Docker (implements a Linux kernel subset)
  • Compatible with existing container images
  • Used by default in Google Cloud Run

E2B (Code Interpreter SDK)

A code execution sandbox designed specifically for AI Agents:

from e2b_code_interpreter import Sandbox

sandbox = Sandbox()

# Securely execute agent-generated code
execution = sandbox.run_code("""
import pandas as pd
df = pd.read_csv('data.csv')
print(df.describe())
""")

print(execution.logs)
sandbox.close()

Features:

  • Creates a fresh micro-VM for each execution
  • Supports file upload and download
  • Built-in timeout and resource limits
  • Supports Python, JavaScript, R, and other languages

Permission Model

Principle of Least Privilege

Agents should only possess the minimum permissions needed to complete a task:

\[ \text{Permissions}(agent) = \min\{P : P \text{ sufficient for task}\} \]

Capability-based Permissions

# Capability-based permission model
class AgentCapabilities:
    def __init__(self):
        self.capabilities = {
            "file_read": {
                "allowed_paths": ["/workspace/*"],
                "denied_paths": ["/etc/*", "/root/*"],
            },
            "file_write": {
                "allowed_paths": ["/workspace/output/*"],
                "max_file_size": "10MB",
            },
            "network": {
                "allowed_domains": ["api.openai.com", "pypi.org"],
                "denied_ports": [22, 23, 3389],
            },
            "code_execution": {
                "allowed_languages": ["python"],
                "timeout": 30,  # seconds
                "max_memory": "512MB",
            },
        }

    def check_permission(self, action, resource):
        cap = self.capabilities.get(action)
        if cap is None:
            return False  # Default deny
        return self._match_resource(cap, resource)

Tiered Permissions

Permission Level Allowed Operations Approval Required
Level 0 Read-only (search, read) None
Level 1 Create (new files, new messages) None
Level 2 Modify (edit files, update data) First-time confirmation
Level 3 Delete/Send (delete files, send emails) Every-time confirmation
Level 4 System operations (install software, modify config) Human approval

Prompt Injection Defense

Attack Types

graph TD
    A[Prompt Injection Attacks] --> B[Direct Injection]
    A --> C[Indirect Injection]

    B --> B1[User directly inputs malicious instructions]

    C --> C1[Web page content contains hidden instructions]
    C --> C2[Documents embed malicious text]
    C --> C3[Tool return values contain injection]

    style A fill:#ffcdd2
    style C fill:#fff3e0

Input Validation

class InputValidator:
    # Suspicious pattern detection
    SUSPICIOUS_PATTERNS = [
        r"ignore\s+(previous|above|all)\s+instructions",
        r"you\s+are\s+now\s+",
        r"new\s+instructions?\s*:",
        r"system\s*prompt\s*:",
        r"</?(system|user|assistant)>",
    ]

    def validate(self, user_input: str) -> tuple[bool, str]:
        for pattern in self.SUSPICIOUS_PATTERNS:
            if re.search(pattern, user_input, re.IGNORECASE):
                return False, f"Suspicious pattern detected: {pattern}"
        return True, "OK"

Output Filtering

Security checks before agent output reaches the user:

  • PII detection: Detect and redact personally identifiable information
  • Content policy: Filter harmful or non-compliant content
  • Format validation: Ensure output format meets expectations
  • Link checking: Verify the safety of generated URLs

Defense Strategy Summary

Strategy Defense Layer Implementation Difficulty Effectiveness
Input validation Input layer Low Medium
System prompt hardening Prompt layer Low Medium
Output filtering Output layer Medium High
Tool permission control Execution layer Medium High
Sandbox isolation Infrastructure layer High High
Multi-model review Review layer High Very high

PII Protection

Personal Information Types

Type Example Risk Level
Name John Smith Medium
National ID 110101199001011234 High
Phone number 13800138000 High
Email address user@example.com Medium
Bank card number 6222021234567890 Very high
Address 123 Main Street... Medium

Redaction Strategy

\[ \text{Redacted}(text) = \text{replace}(text, \text{PII}_i, \text{mask}_i) \quad \forall i \]
# PII redaction example
def redact_pii(text):
    # Phone numbers
    text = re.sub(r'1[3-9]\d{9}', '[PHONE]', text)
    # National ID numbers
    text = re.sub(r'\d{17}[\dXx]', '[ID_NUMBER]', text)
    # Email addresses
    text = re.sub(r'\S+@\S+\.\S+', '[EMAIL]', text)
    # Bank card numbers
    text = re.sub(r'\d{16,19}', '[CARD_NUMBER]', text)
    return text

Content Filtering

Multi-layer Filtering Architecture

User Input → Input Filter → Agent Processing → Output Filter → User
                ↓                                   ↓
            Reject/Modify                       Redact/Filter

Filtering Rules

  • Harmful content: Violence, hate speech, pornography
  • Illegal content: Criminal solicitation, fraud
  • Enterprise policies: Competitor information, confidential data
  • Compliance requirements: Industry-specific content restrictions

Security Best Practices

  1. Defense in depth: Multiple security layers, not relying on a single line of defense
  2. Default deny: Operations not explicitly allowed are denied by default
  3. Audit logging: Record all agent operations for post-hoc review
  4. Regular penetration testing: Conduct periodic security assessments
  5. Emergency stop: Implement kill switch mechanisms
  6. Security updates: Promptly update dependencies and security patches

References

  1. Greshake, K., et al. "Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection." AISec 2023.
  2. E2B. "Code Interpreter SDK." 2024.
  3. Google. "gVisor: Container Runtime Sandbox." 2024.

Cross-references: - Code execution sandbox → Code Execution and Sandboxing - Reliability → Reliability and Robustness - Alignment safety → Alignment and Safety Strategies


评论 #