Security and Sandboxing

Overview

Security is the most critical consideration in AI Agent deployment. Agents can execute code, call APIs, and operate file systems, meaning security vulnerabilities could lead to severe consequences. This section systematically discusses agent sandboxing technology, permission models, prompt injection defense, and data protection strategies.

Sandboxing Strategies

Sandbox Technology Comparison

Technology	Isolation Level	Performance Overhead	Security	Applicable Scenario
Docker	Container-level	Low	Medium	General scenarios
gVisor	Kernel-level	Medium	High	High security needs
E2B	Micro-VM	Medium	High	Code execution
Firecracker	Micro-VM	Low	High	AWS Lambda
WebAssembly	In-process	Very low	Medium	Lightweight isolation

Docker Sandbox

# Secure Agent code execution sandbox
FROM python:3.11-slim

# Security hardening
RUN apt-get update && apt-get install -y --no-install-recommends \
    && rm -rf /var/lib/apt/lists/*

# Non-root user
RUN useradd -m -s /bin/bash sandbox
USER sandbox
WORKDIR /home/sandbox

# Network access restricted (via Docker network policies)
# File system access restricted (read-only mounts)
# Resource usage limited (Docker resource limits)

Launch parameters:

docker run \
  --memory=512m \
  --cpus=1 \
  --network=none \           # Disable network access
  --read-only \              # Read-only file system
  --tmpfs /tmp:size=100m \   # Temporary write space
  --security-opt no-new-privileges \
  --pids-limit=50 \          # Limit process count
  agent-sandbox

gVisor

gVisor provides an application-level kernel that intercepts all system calls:

graph TD
    A[Agent Code] --> B[gVisor Sentry]
    B --> C{System Call}
    C -->|Allowed| D[gVisor Kernel Implementation]
    C -->|Denied| E[Return Error]
    D --> F[Host Kernel]

    style B fill:#fff3e0
    style E fill:#ffcdd2

Advantages:

Stronger isolation than Docker (implements a Linux kernel subset)
Compatible with existing container images
Used by default in Google Cloud Run

E2B (Code Interpreter SDK)

A code execution sandbox designed specifically for AI Agents:

from e2b_code_interpreter import Sandbox

sandbox = Sandbox()

# Securely execute agent-generated code
execution = sandbox.run_code("""
import pandas as pd
df = pd.read_csv('data.csv')
print(df.describe())
""")

print(execution.logs)
sandbox.close()

Features:

Creates a fresh micro-VM for each execution
Supports file upload and download
Built-in timeout and resource limits
Supports Python, JavaScript, R, and other languages

Permission Model

Principle of Least Privilege

Agents should only possess the minimum permissions needed to complete a task:

\[ \text{Permissions}(agent) = \min\{P : P \text{ sufficient for task}\} \]

Capability-based Permissions

# Capability-based permission model
class AgentCapabilities:
    def __init__(self):
        self.capabilities = {
            "file_read": {
                "allowed_paths": ["/workspace/*"],
                "denied_paths": ["/etc/*", "/root/*"],
            },
            "file_write": {
                "allowed_paths": ["/workspace/output/*"],
                "max_file_size": "10MB",
            },
            "network": {
                "allowed_domains": ["api.openai.com", "pypi.org"],
                "denied_ports": [22, 23, 3389],
            },
            "code_execution": {
                "allowed_languages": ["python"],
                "timeout": 30,  # seconds
                "max_memory": "512MB",
            },
        }

    def check_permission(self, action, resource):
        cap = self.capabilities.get(action)
        if cap is None:
            return False  # Default deny
        return self._match_resource(cap, resource)

Tiered Permissions

Permission Level	Allowed Operations	Approval Required
Level 0	Read-only (search, read)	None
Level 1	Create (new files, new messages)	None
Level 2	Modify (edit files, update data)	First-time confirmation
Level 3	Delete/Send (delete files, send emails)	Every-time confirmation
Level 4	System operations (install software, modify config)	Human approval

Prompt Injection Defense

Attack Types

graph TD
    A[Prompt Injection Attacks] --> B[Direct Injection]
    A --> C[Indirect Injection]

    B --> B1[User directly inputs malicious instructions]

    C --> C1[Web page content contains hidden instructions]
    C --> C2[Documents embed malicious text]
    C --> C3[Tool return values contain injection]

    style A fill:#ffcdd2
    style C fill:#fff3e0

Input Validation

class InputValidator:
    # Suspicious pattern detection
    SUSPICIOUS_PATTERNS = [
        r"ignore\s+(previous|above|all)\s+instructions",
        r"you\s+are\s+now\s+",
        r"new\s+instructions?\s*:",
        r"system\s*prompt\s*:",
        r"</?(system|user|assistant)>",
    ]

    def validate(self, user_input: str) -> tuple[bool, str]:
        for pattern in self.SUSPICIOUS_PATTERNS:
            if re.search(pattern, user_input, re.IGNORECASE):
                return False, f"Suspicious pattern detected: {pattern}"
        return True, "OK"

Output Filtering

Security checks before agent output reaches the user:

PII detection: Detect and redact personally identifiable information
Content policy: Filter harmful or non-compliant content
Format validation: Ensure output format meets expectations
Link checking: Verify the safety of generated URLs

Defense Strategy Summary

Strategy	Defense Layer	Implementation Difficulty	Effectiveness
Input validation	Input layer	Low	Medium
System prompt hardening	Prompt layer	Low	Medium
Output filtering	Output layer	Medium	High
Tool permission control	Execution layer	Medium	High
Sandbox isolation	Infrastructure layer	High	High
Multi-model review	Review layer	High	Very high

PII Protection

Personal Information Types

Type	Example	Risk Level
Name	John Smith	Medium
National ID	110101199001011234	High
Phone number	13800138000	High
Email address	user@example.com	Medium
Bank card number	6222021234567890	Very high
Address	123 Main Street...	Medium

Redaction Strategy

\[ \text{Redacted}(text) = \text{replace}(text, \text{PII}_i, \text{mask}_i) \quad \forall i \]

# PII redaction example
def redact_pii(text):
    # Phone numbers
    text = re.sub(r'1[3-9]\d{9}', '[PHONE]', text)
    # National ID numbers
    text = re.sub(r'\d{17}[\dXx]', '[ID_NUMBER]', text)
    # Email addresses
    text = re.sub(r'\S+@\S+\.\S+', '[EMAIL]', text)
    # Bank card numbers
    text = re.sub(r'\d{16,19}', '[CARD_NUMBER]', text)
    return text

Content Filtering

Multi-layer Filtering Architecture

User Input → Input Filter → Agent Processing → Output Filter → User
                ↓                                   ↓
            Reject/Modify                       Redact/Filter

Filtering Rules

Harmful content: Violence, hate speech, pornography
Illegal content: Criminal solicitation, fraud
Enterprise policies: Competitor information, confidential data
Compliance requirements: Industry-specific content restrictions

Security Best Practices

Defense in depth: Multiple security layers, not relying on a single line of defense
Default deny: Operations not explicitly allowed are denied by default
Audit logging: Record all agent operations for post-hoc review
Regular penetration testing: Conduct periodic security assessments
Emergency stop: Implement kill switch mechanisms
Security updates: Promptly update dependencies and security patches

References

Greshake, K., et al. "Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection." AISec 2023.
E2B. "Code Interpreter SDK." 2024.
Google. "gVisor: Container Runtime Sandbox." 2024.

Cross-references: - Code execution sandbox → Code Execution and Sandboxing - Reliability → Reliability and Robustness - Alignment safety → Alignment and Safety Strategies