Setting Up a Docker Swarm AI Agent Cluster for Security Research
Build AI agent lab on Raspberry Pi with Docker Swarm: encrypted networks, HMAC auth, and security monitoring. Production-grade patterns on $600 hardware.
If you want to secure AI systems, you need to understand how they behave in production. Reading about distributed architectures only gets you so far — at some point you have to wire up the orchestration, configure the networking, manage the secrets, and watch what happens when things go wrong.
This post walks through building a distributed AI agent security lab with Docker Swarm on Raspberry Pi hardware. Swarm is deliberately simpler than Kubernetes, but it teaches the same core security fundamentals.
Why Build a Lab?
Theory vs Practice:
- Reading about container security ≠ Understanding container escape vulnerabilities
- Studying networking concepts ≠ Configuring encrypted overlay networks
- Reviewing agent architectures ≠ Debugging inter-agent authentication failures
Hands-on labs teach:
- How attacks actually work (not just theory)
- Where security controls fail (firsthand experience)
- How to detect compromises (real monitoring data)
- How to respond to incidents (practice under pressure)
Hardware: Raspberry Pi 5 Cluster
Why Raspberry Pi instead of cloud?
Cost: $200 one-time vs $50+/month cloud forever Control: Own hardware, no vendor dependencies Learning: Physical networking teaches concepts cloud abstracts Permanence: Lab stays up, no billing concerns
My configuration:
- 3x Raspberry Pi 5 (8GB RAM each)
- 1x Manager node, 2x Worker nodes
- Gigabit Ethernet networking
- External NVMe storage for performance
Total cost: ~$600 (including cases, power, networking)
Architecture Overview
Internet
│
│ HTTPS
▼
┌───────────────────────┐
│ Manager Node (node-1)│
├───────────────────────┤
│ • Traefik :80/:443 │
│ • Registry :5000 │
│ • Redis :6379 │
└───────────────────────┘
│
Encrypted Overlay Network
│
┌───────────┴───────────┐
│ │
▼ ▼
┌───────────────┐ ┌───────────────┐
│Worker (node-2)│ │Worker (node-3)│
├───────────────┤ ├───────────────┤
│ • AI Agent A │ │ • AI Agent B │
│ (Claude) │ │ (GPT-4) │
│ │ │ │
│ │ │ • AI Agent C │
│ │ │ (GLM) │
└───────────────┘ └───────────────┘
Key security features:
- Manager node handles orchestration only (no agent workloads)
- Worker nodes run agents in isolation
- Encrypted overlay network (AES-256)
- Inter-agent HMAC authentication
- Traefik reverse proxy with TLS termination
Step 1: Initialize Docker Swarm
On manager node:
# Initialize swarm
docker swarm init --advertise-addr 192.168.1.10
# Output includes worker join token
# docker swarm join --token SWMTKN-1-... 192.168.1.10:2377
On worker nodes:
# Join swarm as worker
docker swarm join --token SWMTKN-1-xxxxxxxxxxxxx 192.168.1.10:2377
Verify cluster:
docker node ls
# Output:
# ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
# abc123 * node-1 Ready Active Leader
# def456 node-2 Ready Active
# ghi789 node-3 Ready Active
Security considerations:
- Join tokens are sensitive (rotate regularly)
- Manager nodes should be separate from worker nodes
- Limit manager node exposure (no public services)
Step 2: Create Encrypted Overlay Network
Why overlay networks:
- Spans multiple hosts transparently
- Built-in encryption (optional but recommended)
- Automatic service discovery
- Network isolation
Create encrypted network:
docker network create \
--driver overlay \
--encrypted \
--attachable \
agent-network
Verify:
docker network ls
docker network inspect agent-network
What --encrypted does:
- All inter-node traffic encrypted with AES-256
- Automatic key rotation
- IPsec encapsulation
- Performance impact: ~10% (acceptable for security benefit)
Security benefit: Even if the physical network is compromised, inter-container traffic stays encrypted. That single flag buys you a lot of protection for very little overhead.
Step 3: Setup Secrets Management
Never hardcode credentials. Docker Swarm has built-in secrets management that makes this straightforward.
Create secrets:
# API keys for AI providers
echo "sk-..." | docker secret create anthropic_api_key -
echo "sk-..." | docker secret create openai_api_key -
echo "..." | docker secret create glm_api_key -
# HMAC signing key for inter-agent auth
openssl rand -base64 32 | docker secret create agent_hmac_key -
List secrets:
docker secret ls
# Output:
# ID NAME CREATED UPDATED
# abc123 anthropic_api_key 2 minutes ago 2 minutes ago
# def456 openai_api_key 1 minute ago 1 minute ago
Security properties:
- Secrets encrypted at rest
- Only accessible to authorized services
- Never stored in images or logs
- Transmitted over encrypted channels only
Access in containers:
# Secrets mounted as files in /run/secrets/
with open('/run/secrets/anthropic_api_key', 'r') as f:
api_key = f.read().strip()
Step 4: Deploy AI Agent Stack
docker-compose.yml (stack file):
version: '3.8'
services:
ai-agent-claude:
image: registry.local:5000/ai-agent:latest
deploy:
replicas: 1
placement:
constraints:
- node.role == worker
- node.labels.agent_type == premium
resources:
limits:
cpus: '2.0'
memory: 4G
reservations:
cpus: '1.0'
memory: 2G
update_config:
parallelism: 1
delay: 10s
failure_action: rollback
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 3
environment:
- AI_PROVIDER=anthropic
- AGENT_PROFILE=claude
- LOG_LEVEL=info
secrets:
- anthropic_api_key
- agent_hmac_key
networks:
- agent-network
healthcheck:
test: ['CMD', 'curl', '-f', 'http://localhost:8000/health']
interval: 30s
timeout: 3s
retries: 3
start_period: 40s
ai-agent-gpt:
image: registry.local:5000/ai-agent:latest
deploy:
replicas: 1
placement:
constraints:
- node.role == worker
resources:
limits:
cpus: '2.0'
memory: 4G
environment:
- AI_PROVIDER=openai
- AGENT_PROFILE=gpt4
secrets:
- openai_api_key
- agent_hmac_key
networks:
- agent-network
ai-agent-glm:
image: registry.local:5000/ai-agent:latest
deploy:
replicas: 1
placement:
constraints:
- node.role == worker
- node.labels.agent_type == budget
resources:
limits:
cpus: '1.0'
memory: 2G
environment:
- AI_PROVIDER=z.ai
- AGENT_PROFILE=glm
- ENABLE_BIAS_MONITORING=true
secrets:
- glm_api_key
- agent_hmac_key
networks:
- agent-network
traefik:
image: traefik:v2.10
command:
- '--api.insecure=false'
- '--providers.docker.swarmMode=true'
- '--providers.docker.exposedbydefault=false'
- '--entrypoints.web.address=:80'
- '--entrypoints.websecure.address=:443'
ports:
- '80:80'
- '443:443'
deploy:
placement:
constraints:
- node.role == manager
volumes:
- /var/run/docker.sock:/var/run/docker.sock:ro
- ./traefik/certs:/certs:ro
networks:
- agent-network
secrets:
anthropic_api_key:
external: true
openai_api_key:
external: true
glm_api_key:
external: true
agent_hmac_key:
external: true
networks:
agent-network:
external: true
Deploy stack:
docker stack deploy -c docker-compose.yml ai-agents
Verify deployment:
docker stack ps ai-agents
docker service ls
If you’re running distributed agents, ask yourself: what happens if one agent starts sending commands the others blindly trust?
Step 5: Implement Inter-Agent Authentication
Problem: Agents need to talk to each other, but every message must be authenticated to prevent spoofing.
Solution: HMAC signatures on all inter-agent messages.
Implementation:
import hmac
import hashlib
import json
import time
class AgentAuthenticator:
def __init__(self, shared_secret):
self.secret = shared_secret.encode()
def sign_message(self, message_dict):
"""Add HMAC signature to message"""
message_dict['timestamp'] = time.time()
message_dict['sender'] = self.agent_id
# Create canonical representation
canonical = json.dumps(message_dict, sort_keys=True)
# Generate HMAC
signature = hmac.new(
self.secret,
canonical.encode(),
hashlib.sha256
).hexdigest()
message_dict['signature'] = signature
return message_dict
def verify_message(self, message_dict):
"""Verify HMAC signature"""
# Extract signature
received_sig = message_dict.pop('signature', None)
if not received_sig:
raise ValueError("No signature present")
# Check timestamp (prevent replay attacks)
timestamp = message_dict.get('timestamp', 0)
if time.time() - timestamp > 60: # 60 second window
raise ValueError("Message expired")
# Recreate canonical representation
canonical = json.dumps(message_dict, sort_keys=True)
# Compute expected signature
expected_sig = hmac.new(
self.secret,
canonical.encode(),
hashlib.sha256
).hexdigest()
# Constant-time comparison
if not hmac.compare_digest(received_sig, expected_sig):
raise ValueError("Invalid signature")
return message_dict
Usage:
# Agent A sends command to Agent B
auth = AgentAuthenticator(shared_secret)
message = {
"command": "analyze_code",
"code": code_sample,
"priority": "high"
}
signed_message = auth.sign_message(message)
send_to_agent_b(signed_message)
# Agent B verifies before processing
try:
verified_message = auth.verify_message(received_message)
process_command(verified_message)
except ValueError as e:
log_security_event(f"Authentication failed: {e}")
return error_response()
Security benefits:
- Prevents agent impersonation
- Detects message tampering
- Timestamp prevents replay attacks
- Shared secret from Docker Secrets (never in code)
Step 6: Comprehensive Logging
Centralized logging architecture:
logging-stack:
image: grafana/loki:latest
deploy:
placement:
constraints:
- node.role == manager
networks:
- agent-network
promtail:
image: grafana/promtail:latest
deploy:
mode: global # Runs on every node
volumes:
- /var/lib/docker/containers:/var/lib/docker/containers:ro
networks:
- agent-network
What to log:
1. All AI interactions:
log_entry = {
"timestamp": datetime.now().isoformat(),
"agent_id": agent.id,
"provider": "anthropic",
"model": "claude-3.5-sonnet",
"prompt_hash": hash(prompt), # Don't log full prompts (may contain PII)
"prompt_length": len(prompt),
"response_length": len(response),
"tokens_used": usage.total_tokens,
"latency_ms": latency,
"cost_usd": calculate_cost(usage),
}
logger.info("ai_interaction", extra=log_entry)
2. Security events:
security_event = {
"event_type": "potential_prompt_injection",
"severity": "medium",
"agent_id": agent.id,
"indicators": ["ignore previous", "system override"],
"action_taken": "blocked",
}
security_logger.warning("security_event", extra=security_event)
3. Agent authentication:
auth_event = {
"event": "agent_auth_failure",
"sender_claimed": message.get("sender"),
"signature_valid": False,
"source_ip": request.remote_addr,
}
security_logger.error("auth_failure", extra=auth_event)
Query logs for analysis:
# Find all failed authentication attempts
{job="ai-agents"} |= "auth_failure"
# Track cost per agent
sum by (agent_id) (cost_usd)
# Detect anomalous response patterns
avg(response_length) by (agent_id)
Step 7: Monitoring and Alerting
Prometheus metrics:
from prometheus_client import Counter, Histogram, Gauge
# Request metrics
ai_requests_total = Counter(
'ai_requests_total',
'Total AI requests',
['agent_id', 'provider', 'status']
)
ai_request_duration = Histogram(
'ai_request_duration_seconds',
'AI request duration',
['agent_id', 'provider']
)
# Cost tracking
ai_cost_total = Counter(
'ai_cost_usd_total',
'Total AI cost in USD',
['agent_id', 'provider']
)
# Security metrics
auth_failures_total = Counter(
'agent_auth_failures_total',
'Total authentication failures',
['agent_id']
)
# Record metrics
ai_requests_total.labels(agent_id=id, provider="anthropic", status="success").inc()
ai_cost_total.labels(agent_id=id, provider="anthropic").inc(cost)
Alert rules:
# Alert on high authentication failure rate
- alert: HighAuthFailureRate
expr: rate(agent_auth_failures_total[5m]) > 0.1
for: 5m
annotations:
summary: 'High authentication failure rate detected'
# Alert on cost anomaly
- alert: UnexpectedCostSpike
expr: rate(ai_cost_usd_total[1h]) > 10
for: 10m
annotations:
summary: 'AI cost spike detected'
# Alert on service down
- alert: AgentDown
expr: up{job="ai-agents"} == 0
for: 2m
annotations:
summary: 'AI agent is down'
Security Features Implemented
1. Role Segregation
Manager node:
- Orchestration and coordination only
- No agent workloads (prevents compromise from affecting orchestration)
- Limited external exposure
Worker nodes:
- Run agent containers
- Isolated from the management plane
- No direct internet access (traffic routes through Traefik only)
2. Resource Limits
Prevent resource exhaustion:
resources:
limits:
cpus: '2.0' # Maximum CPU
memory: 4G # Maximum memory
reservations:
cpus: '1.0' # Guaranteed minimum
memory: 2G
A compromised container cannot DoS the entire cluster when resource limits are in place.
3. Health Checks and Auto-Recovery
healthcheck:
test: ['CMD', 'curl', '-f', 'http://localhost:8000/health']
interval: 30s
timeout: 3s
retries: 3
Swarm monitors health every 30 seconds. After three consecutive failures, it marks the container unhealthy and automatically restarts it. Rolling updates give you zero-downtime deployments on top of that.
4. Placement Constraints
placement:
constraints:
- node.role == worker
- node.labels.agent_type == premium
Control where workloads run:
- Separate premium models (Claude) from budget (GLM)
- Keep sensitive workloads on specific nodes
- Implement hardware-based isolation
Lessons Learned
1. Docker Swarm is Simpler Than Kubernetes
For learning security fundamentals, Swarm gives you:
- Simpler configuration (straightforward YAML)
- Faster setup (minutes vs hours)
- Built-in secrets management
- Adequate tooling for most labs and small deployments
Kubernetes advantages (when you actually need them):
- Larger ecosystem (Helm charts, operators)
- More granular control
- Better suited for very large clusters
- Industry standard for production
For learning AI security: Swarm gets out of your way and lets you focus on the security problems, not the orchestration plumbing.
2. Encrypted Overlay Networks Are Non-Negotiable
Network encryption is not optional for distributed AI systems:
- Prevents eavesdropping on inter-agent communication
- Protects API keys in transit
- Ensures prompt confidentiality
- Minimal performance impact (~10%)
3. HMAC Authentication Prevents Agent Impersonation
Without inter-agent authentication:
- An attacker can send commands while pretending to be a legitimate agent
- Auditing which agent performed which action becomes nearly impossible
- Man-in-the-middle attacks go undetected
With HMAC signatures:
- Every message is cryptographically authenticated
- Tampering is detected immediately
- Replay attacks are blocked via timestamps
- The implementation is simple, but the security guarantees are strong
4. Comprehensive Logging Enables Forensics
When (not if) security incidents occur:
- Logs reveal exactly what happened and when
- You can correlate events across multiple agents
- Cost tracking surfaces abuse
- Behavioral baselines make anomalies visible
Production-Grade Patterns on Lab Hardware
This architecture puts production-grade security patterns into practice:
- Zero-trust networking (encrypted, authenticated)
- Secrets management (never in code or logs)
- Role segregation (manager vs worker)
- Resource limits (prevent DoS)
- Comprehensive logging (full audit trail)
- Health monitoring (auto-recovery)
- Multi-vendor architecture (no lock-in)
All of it runs on ~$600 of Raspberry Pi hardware in a homelab.
You do not need enterprise budgets to learn enterprise security patterns. A Raspberry Pi cluster teaches the same fundamentals as a managed Kubernetes deployment — faster, cheaper, and with more direct control over every layer.
The best way to understand AI security is to build these systems and then deliberately break them. This architecture gives you that foundation: a distributed, production-like environment where you can safely experiment with attacks and defenses.
What Does Your Cluster Look Like?
Are you running AI agents on Docker Swarm, Kubernetes, or something else entirely? Whether it’s a Raspberry Pi homelab or a cloud-based setup, I’d like to hear about your architecture choices — what security patterns you implemented, what surprised you, and what you’d change. Share your setup in the comments. Comparing real-world configurations teaches more than any documentation.