Setting Up a Docker Swarm AI Agent Cluster for Security Research
Build AI agent lab on Raspberry Pi with Docker Swarm: encrypted networks, HMAC auth, and security monitoring. Production-grade patterns on $600 hardware.
Securing AI systems requires understanding how they work in production environments. Reading about distributed architectures isn’t enough—you need hands-on experience with orchestration, networking, secrets management, and monitoring.
This post documents building a distributed AI agent security lab using Docker Swarm on Raspberry Pi hardware. It’s deliberately simpler than Kubernetes while teaching the same security fundamentals.
Why Build a Lab?
Theory vs Practice:
- Reading about container security ≠ Understanding container escape vulnerabilities
- Studying networking concepts ≠ Configuring encrypted overlay networks
- Reviewing agent architectures ≠ Debugging inter-agent authentication failures
Hands-on labs teach:
- How attacks actually work (not just theory)
- Where security controls fail (firsthand experience)
- How to detect compromises (real monitoring data)
- How to respond to incidents (practice under pressure)
Hardware: Raspberry Pi 5 Cluster
Why Raspberry Pi instead of cloud?
Cost: $200 one-time vs $50+/month cloud forever Control: Own hardware, no vendor dependencies Learning: Physical networking teaches concepts cloud abstracts Permanence: Lab stays up, no billing concerns
My configuration:
- 3× Raspberry Pi 5 (8GB RAM each)
- 1× Manager node, 2× Worker nodes
- Gigabit Ethernet networking
- External NVMe storage for performance
Total cost: ~$600 (including cases, power, networking)
Architecture Overview
Internet
│
│ HTTPS
▼
┌───────────────────────┐
│ Manager Node (node-1)│
├───────────────────────┤
│ • Traefik :80/:443 │
│ • Registry :5000 │
│ • Redis :6379 │
└───────────────────────┘
│
Encrypted Overlay Network
│
┌───────────┴───────────┐
│ │
▼ ▼
┌───────────────┐ ┌───────────────┐
│ Worker (node-2│ │ Worker (node-3│
├───────────────┤ ├───────────────┤
│ • AI Agent A │ │ • AI Agent B │
│ (Claude) │ │ (GPT-4) │
│ │ │ │
│ │ │ • AI Agent C │
│ │ │ (GLM) │
└───────────────┘ └───────────────┘
Key security features:
- Manager node handles orchestration only (no agent workloads)
- Worker nodes run agents in isolation
- Encrypted overlay network (AES-256)
- Inter-agent HMAC authentication
- Traefik reverse proxy with TLS termination
Step 1: Initialize Docker Swarm
On manager node:
# Initialize swarm
docker swarm init --advertise-addr 192.168.1.10
# Output includes worker join token
# docker swarm join --token SWMTKN-1-... 192.168.1.10:2377
On worker nodes:
# Join swarm as worker
docker swarm join --token SWMTKN-1-xxxxxxxxxxxxx 192.168.1.10:2377
Verify cluster:
docker node ls
# Output:
# ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
# abc123 * node-1 Ready Active Leader
# def456 node-2 Ready Active
# ghi789 node-3 Ready Active
Security considerations:
- Join tokens are sensitive (rotate regularly)
- Manager nodes should be separate from worker nodes
- Limit manager node exposure (no public services)
Step 2: Create Encrypted Overlay Network
Why overlay networks:
- Spans multiple hosts transparently
- Built-in encryption (optional but recommended)
- Automatic service discovery
- Network isolation
Create encrypted network:
docker network create \
--driver overlay \
--encrypted \
--attachable \
agent-network
Verify:
docker network ls
docker network inspect agent-network
What --encrypted does:
- All inter-node traffic encrypted with AES-256
- Automatic key rotation
- IPsec encapsulation
- Performance impact: ~10% (acceptable for security benefit)
Security benefit: Even if physical network is compromised, inter-container traffic remains encrypted.
Step 3: Setup Secrets Management
Never hardcode credentials. Docker Swarm has built-in secrets management.
Create secrets:
# API keys for AI providers
echo "sk-..." | docker secret create anthropic_api_key -
echo "sk-..." | docker secret create openai_api_key -
echo "..." | docker secret create glm_api_key -
# HMAC signing key for inter-agent auth
openssl rand -base64 32 | docker secret create agent_hmac_key -
List secrets:
docker secret ls
# Output:
# ID NAME CREATED UPDATED
# abc123 anthropic_api_key 2 minutes ago 2 minutes ago
# def456 openai_api_key 1 minute ago 1 minute ago
Security properties:
- Secrets encrypted at rest
- Only accessible to authorized services
- Never stored in images or logs
- Transmitted over encrypted channels only
Access in containers:
# Secrets mounted as files in /run/secrets/
with open('/run/secrets/anthropic_api_key', 'r') as f:
api_key = f.read().strip()
Step 4: Deploy AI Agent Stack
docker-compose.yml (stack file):
version: '3.8'
services:
ai-agent-claude:
image: registry.local:5000/ai-agent:latest
deploy:
replicas: 1
placement:
constraints:
- node.role == worker
- node.labels.agent_type == premium
resources:
limits:
cpus: '2.0'
memory: 4G
reservations:
cpus: '1.0'
memory: 2G
update_config:
parallelism: 1
delay: 10s
failure_action: rollback
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 3
environment:
- AI_PROVIDER=anthropic
- AGENT_PROFILE=claude
- LOG_LEVEL=info
secrets:
- anthropic_api_key
- agent_hmac_key
networks:
- agent-network
healthcheck:
test: ['CMD', 'curl', '-f', 'http://localhost:8000/health']
interval: 30s
timeout: 3s
retries: 3
start_period: 40s
ai-agent-gpt:
image: registry.local:5000/ai-agent:latest
deploy:
replicas: 1
placement:
constraints:
- node.role == worker
resources:
limits:
cpus: '2.0'
memory: 4G
environment:
- AI_PROVIDER=openai
- AGENT_PROFILE=gpt4
secrets:
- openai_api_key
- agent_hmac_key
networks:
- agent-network
ai-agent-glm:
image: registry.local:5000/ai-agent:latest
deploy:
replicas: 1
placement:
constraints:
- node.role == worker
- node.labels.agent_type == budget
resources:
limits:
cpus: '1.0'
memory: 2G
environment:
- AI_PROVIDER=z.ai
- AGENT_PROFILE=glm
- ENABLE_BIAS_MONITORING=true
secrets:
- glm_api_key
- agent_hmac_key
networks:
- agent-network
traefik:
image: traefik:v2.10
command:
- '--api.insecure=false'
- '--providers.docker.swarmMode=true'
- '--providers.docker.exposedbydefault=false'
- '--entrypoints.web.address=:80'
- '--entrypoints.websecure.address=:443'
ports:
- '80:80'
- '443:443'
deploy:
placement:
constraints:
- node.role == manager
volumes:
- /var/run/docker.sock:/var/run/docker.sock:ro
- ./traefik/certs:/certs:ro
networks:
- agent-network
secrets:
anthropic_api_key:
external: true
openai_api_key:
external: true
glm_api_key:
external: true
agent_hmac_key:
external: true
networks:
agent-network:
external: true
Deploy stack:
docker stack deploy -c docker-compose.yml ai-agents
Verify deployment:
docker stack ps ai-agents
docker service ls
Step 5: Implement Inter-Agent Authentication
Problem: Agents need to communicate, but must authenticate each other to prevent spoofing.
Solution: HMAC signatures on all inter-agent messages.
Implementation:
import hmac
import hashlib
import json
import time
class AgentAuthenticator:
def __init__(self, shared_secret):
self.secret = shared_secret.encode()
def sign_message(self, message_dict):
"""Add HMAC signature to message"""
message_dict['timestamp'] = time.time()
message_dict['sender'] = self.agent_id
# Create canonical representation
canonical = json.dumps(message_dict, sort_keys=True)
# Generate HMAC
signature = hmac.new(
self.secret,
canonical.encode(),
hashlib.sha256
).hexdigest()
message_dict['signature'] = signature
return message_dict
def verify_message(self, message_dict):
"""Verify HMAC signature"""
# Extract signature
received_sig = message_dict.pop('signature', None)
if not received_sig:
raise ValueError("No signature present")
# Check timestamp (prevent replay attacks)
timestamp = message_dict.get('timestamp', 0)
if time.time() - timestamp > 60: # 60 second window
raise ValueError("Message expired")
# Recreate canonical representation
canonical = json.dumps(message_dict, sort_keys=True)
# Compute expected signature
expected_sig = hmac.new(
self.secret,
canonical.encode(),
hashlib.sha256
).hexdigest()
# Constant-time comparison
if not hmac.compare_digest(received_sig, expected_sig):
raise ValueError("Invalid signature")
return message_dict
Usage:
# Agent A sends command to Agent B
auth = AgentAuthenticator(shared_secret)
message = {
"command": "analyze_code",
"code": code_sample,
"priority": "high"
}
signed_message = auth.sign_message(message)
send_to_agent_b(signed_message)
# Agent B verifies before processing
try:
verified_message = auth.verify_message(received_message)
process_command(verified_message)
except ValueError as e:
log_security_event(f"Authentication failed: {e}")
return error_response()
Security benefits:
- Prevents agent impersonation
- Detects message tampering
- Timestamp prevents replay attacks
- Shared secret from Docker Secrets (never in code)
Step 6: Comprehensive Logging
Centralized logging architecture:
logging-stack:
image: grafana/loki:latest
deploy:
placement:
constraints:
- node.role == manager
networks:
- agent-network
promtail:
image: grafana/promtail:latest
deploy:
mode: global # Runs on every node
volumes:
- /var/lib/docker/containers:/var/lib/docker/containers:ro
networks:
- agent-network
What to log:
1. All AI interactions:
log_entry = {
"timestamp": datetime.now().isoformat(),
"agent_id": agent.id,
"provider": "anthropic",
"model": "claude-3.5-sonnet",
"prompt_hash": hash(prompt), # Don't log full prompts (may contain PII)
"prompt_length": len(prompt),
"response_length": len(response),
"tokens_used": usage.total_tokens,
"latency_ms": latency,
"cost_usd": calculate_cost(usage),
}
logger.info("ai_interaction", extra=log_entry)
2. Security events:
security_event = {
"event_type": "potential_prompt_injection",
"severity": "medium",
"agent_id": agent.id,
"indicators": ["ignore previous", "system override"],
"action_taken": "blocked",
}
security_logger.warning("security_event", extra=security_event)
3. Agent authentication:
auth_event = {
"event": "agent_auth_failure",
"sender_claimed": message.get("sender"),
"signature_valid": False,
"source_ip": request.remote_addr,
}
security_logger.error("auth_failure", extra=auth_event)
Query logs for analysis:
# Find all failed authentication attempts
{job="ai-agents"} |= "auth_failure"
# Track cost per agent
sum by (agent_id) (cost_usd)
# Detect anomalous response patterns
avg(response_length) by (agent_id)
Step 7: Monitoring and Alerting
Prometheus metrics:
from prometheus_client import Counter, Histogram, Gauge
# Request metrics
ai_requests_total = Counter(
'ai_requests_total',
'Total AI requests',
['agent_id', 'provider', 'status']
)
ai_request_duration = Histogram(
'ai_request_duration_seconds',
'AI request duration',
['agent_id', 'provider']
)
# Cost tracking
ai_cost_total = Counter(
'ai_cost_usd_total',
'Total AI cost in USD',
['agent_id', 'provider']
)
# Security metrics
auth_failures_total = Counter(
'agent_auth_failures_total',
'Total authentication failures',
['agent_id']
)
# Record metrics
ai_requests_total.labels(agent_id=id, provider="anthropic", status="success").inc()
ai_cost_total.labels(agent_id=id, provider="anthropic").inc(cost)
Alert rules:
# Alert on high authentication failure rate
- alert: HighAuthFailureRate
expr: rate(agent_auth_failures_total[5m]) > 0.1
for: 5m
annotations:
summary: 'High authentication failure rate detected'
# Alert on cost anomaly
- alert: UnexpectedCostSpike
expr: rate(ai_cost_usd_total[1h]) > 10
for: 10m
annotations:
summary: 'AI cost spike detected'
# Alert on service down
- alert: AgentDown
expr: up{job="ai-agents"} == 0
for: 2m
annotations:
summary: 'AI agent is down'
Security Features Implemented
1. Role Segregation
Manager node:
- ✅ Orchestration and coordination only
- ✅ No agent workloads (prevents compromise affecting orchestration)
- ✅ Limited external exposure
Worker nodes:
- ✅ Run agent containers
- ✅ Isolated from management plane
- ✅ No direct internet access (traffic through Traefik only)
2. Resource Limits
Prevent resource exhaustion:
resources:
limits:
cpus: '2.0' # Maximum CPU
memory: 4G # Maximum memory
reservations:
cpus: '1.0' # Guaranteed minimum
memory: 2G
Benefit: Compromised container cannot DoS entire cluster.
3. Health Checks and Auto-Recovery
healthcheck:
test: ['CMD', 'curl', '-f', 'http://localhost:8000/health']
interval: 30s
timeout: 3s
retries: 3
What happens:
- Swarm monitors health every 30 seconds
- After 3 consecutive failures, container marked unhealthy
- Swarm automatically restarts unhealthy containers
- Zero-downtime deployments through rolling updates
4. Placement Constraints
placement:
constraints:
- node.role == worker
- node.labels.agent_type == premium
Control where workloads run:
- Separate premium models (Claude) from budget (GLM)
- Keep sensitive workloads on specific nodes
- Implement hardware-based isolation
Lessons Learned
1. Docker Swarm is Simpler Than Kubernetes
For learning security fundamentals, Swarm provides:
- Simpler configuration (YAML is straightforward)
- Faster setup (minutes vs hours)
- Built-in secrets management
- Adequate for most labs and small deployments
Kubernetes advantages (when you need them):
- Larger ecosystem (Helm charts, operators)
- More granular control
- Better for very large clusters
- Industry standard for production
For learning AI security: Swarm is sufficient and faster to iterate.
2. Encrypted Overlay Networks Are Mandatory
Network encryption is not optional for distributed AI systems:
- Prevents eavesdropping on inter-agent communication
- Protects API keys in transit
- Ensures prompt confidentiality
- Minimal performance impact (~10%)
3. HMAC Authentication Prevents Agent Impersonation
Without inter-agent authentication:
- Attacker can send commands pretending to be legitimate agent
- Difficult to audit which agent performed which action
- No way to detect man-in-the-middle attacks
With HMAC signatures:
- Each message cryptographically authenticated
- Tampering detected immediately
- Replay attacks prevented via timestamps
- Simple to implement, strong security guarantees
4. Comprehensive Logging Enables Forensics
When (not if) security incidents occur:
- Logs show exactly what happened when
- Can correlate events across multiple agents
- Cost tracking identifies abuse
- Behavioral baselines detect anomalies
Conclusion: Production-Grade Patterns in a Lab
This architecture demonstrates production-grade security patterns:
- ✅ Zero-trust networking (encrypted, authenticated)
- ✅ Secrets management (never in code or logs)
- ✅ Role segregation (manager vs worker)
- ✅ Resource limits (prevent DoS)
- ✅ Comprehensive logging (audit trail)
- ✅ Health monitoring (auto-recovery)
- ✅ Multi-vendor architecture (no lock-in)
Built on $600 of Raspberry Pi hardware in a homelab environment.
Why this matters: You don’t need enterprise budgets to learn enterprise security patterns. A Raspberry Pi cluster teaches the same fundamentals as a Kubernetes cluster—but faster, cheaper, and with more control.
The best way to learn AI security is building systems and intentionally breaking them. This architecture provides the foundation for security research: a distributed, production-like environment where you can safely experiment with attacks and defenses.
Start building. Start breaking. That’s how you learn.