Setting Up a Docker Swarm AI Agent Cluster for Security Research

Securing AI systems requires understanding how they work in production environments. Reading about distributed architectures isn’t enough—you need hands-on experience with orchestration, networking, secrets management, and monitoring.

This post documents building a distributed AI agent security lab using Docker Swarm on Raspberry Pi hardware. It’s deliberately simpler than Kubernetes while teaching the same security fundamentals.

Why Build a Lab?

Theory vs Practice:

Reading about container security ≠ Understanding container escape vulnerabilities
Studying networking concepts ≠ Configuring encrypted overlay networks
Reviewing agent architectures ≠ Debugging inter-agent authentication failures

Hands-on labs teach:

How attacks actually work (not just theory)
Where security controls fail (firsthand experience)
How to detect compromises (real monitoring data)
How to respond to incidents (practice under pressure)

Hardware: Raspberry Pi 5 Cluster

Why Raspberry Pi instead of cloud?

Cost: $200 one-time vs $50+/month cloud forever Control: Own hardware, no vendor dependencies Learning: Physical networking teaches concepts cloud abstracts Permanence: Lab stays up, no billing concerns

My configuration:

3× Raspberry Pi 5 (8GB RAM each)
1× Manager node, 2× Worker nodes
Gigabit Ethernet networking
External NVMe storage for performance

Total cost: ~$600 (including cases, power, networking)

Architecture Overview

                 Internet
                    │
                    │ HTTPS
                    ▼
        ┌───────────────────────┐
        │  Manager Node (node-1)│
        ├───────────────────────┤
        │ • Traefik    :80/:443 │
        │ • Registry   :5000    │
        │ • Redis      :6379    │
        └───────────────────────┘
                    │
          Encrypted Overlay Network
                    │
        ┌───────────┴───────────┐
        │                       │
        ▼                       ▼
┌───────────────┐       ┌───────────────┐
│ Worker (node-2│       │ Worker (node-3│
├───────────────┤       ├───────────────┤
│ • AI Agent A  │       │ • AI Agent B  │
│   (Claude)    │       │   (GPT-4)     │
│               │       │               │
│               │       │ • AI Agent C  │
│               │       │   (GLM)       │
└───────────────┘       └───────────────┘

Key security features:

Manager node handles orchestration only (no agent workloads)
Worker nodes run agents in isolation
Encrypted overlay network (AES-256)
Inter-agent HMAC authentication
Traefik reverse proxy with TLS termination

Step 1: Initialize Docker Swarm

On manager node:

# Initialize swarm
docker swarm init --advertise-addr 192.168.1.10

# Output includes worker join token
# docker swarm join --token SWMTKN-1-... 192.168.1.10:2377

On worker nodes:

# Join swarm as worker
docker swarm join --token SWMTKN-1-xxxxxxxxxxxxx 192.168.1.10:2377

Verify cluster:

docker node ls

# Output:
# ID         HOSTNAME   STATUS  AVAILABILITY  MANAGER STATUS
# abc123 *   node-1     Ready   Active        Leader
# def456     node-2     Ready   Active
# ghi789     node-3     Ready   Active

Security considerations:

Join tokens are sensitive (rotate regularly)
Manager nodes should be separate from worker nodes
Limit manager node exposure (no public services)

Step 2: Create Encrypted Overlay Network

Why overlay networks:

Spans multiple hosts transparently
Built-in encryption (optional but recommended)
Automatic service discovery
Network isolation

Create encrypted network:

docker network create \
  --driver overlay \
  --encrypted \
  --attachable \
  agent-network

Verify:

docker network ls
docker network inspect agent-network

What --encrypted does:

All inter-node traffic encrypted with AES-256
Automatic key rotation
IPsec encapsulation
Performance impact: ~10% (acceptable for security benefit)

Security benefit: Even if physical network is compromised, inter-container traffic remains encrypted.

Step 3: Setup Secrets Management

Never hardcode credentials. Docker Swarm has built-in secrets management.

Create secrets:

# API keys for AI providers
echo "sk-..." | docker secret create anthropic_api_key -
echo "sk-..." | docker secret create openai_api_key -
echo "..." | docker secret create glm_api_key -

# HMAC signing key for inter-agent auth
openssl rand -base64 32 | docker secret create agent_hmac_key -

List secrets:

docker secret ls

# Output:
# ID         NAME                CREATED          UPDATED
# abc123     anthropic_api_key   2 minutes ago    2 minutes ago
# def456     openai_api_key      1 minute ago     1 minute ago

Security properties:

Secrets encrypted at rest
Only accessible to authorized services
Never stored in images or logs
Transmitted over encrypted channels only

Access in containers:

# Secrets mounted as files in /run/secrets/
with open('/run/secrets/anthropic_api_key', 'r') as f:
    api_key = f.read().strip()

Step 4: Deploy AI Agent Stack

docker-compose.yml (stack file):

version: '3.8'

services:
  ai-agent-claude:
    image: registry.local:5000/ai-agent:latest
    deploy:
      replicas: 1
      placement:
        constraints:
          - node.role == worker
          - node.labels.agent_type == premium
      resources:
        limits:
          cpus: '2.0'
          memory: 4G
        reservations:
          cpus: '1.0'
          memory: 2G
      update_config:
        parallelism: 1
        delay: 10s
        failure_action: rollback
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3
    environment:
      - AI_PROVIDER=anthropic
      - AGENT_PROFILE=claude
      - LOG_LEVEL=info
    secrets:
      - anthropic_api_key
      - agent_hmac_key
    networks:
      - agent-network
    healthcheck:
      test: ['CMD', 'curl', '-f', 'http://localhost:8000/health']
      interval: 30s
      timeout: 3s
      retries: 3
      start_period: 40s

  ai-agent-gpt:
    image: registry.local:5000/ai-agent:latest
    deploy:
      replicas: 1
      placement:
        constraints:
          - node.role == worker
      resources:
        limits:
          cpus: '2.0'
          memory: 4G
    environment:
      - AI_PROVIDER=openai
      - AGENT_PROFILE=gpt4
    secrets:
      - openai_api_key
      - agent_hmac_key
    networks:
      - agent-network

  ai-agent-glm:
    image: registry.local:5000/ai-agent:latest
    deploy:
      replicas: 1
      placement:
        constraints:
          - node.role == worker
          - node.labels.agent_type == budget
      resources:
        limits:
          cpus: '1.0'
          memory: 2G
    environment:
      - AI_PROVIDER=z.ai
      - AGENT_PROFILE=glm
      - ENABLE_BIAS_MONITORING=true
    secrets:
      - glm_api_key
      - agent_hmac_key
    networks:
      - agent-network

  traefik:
    image: traefik:v2.10
    command:
      - '--api.insecure=false'
      - '--providers.docker.swarmMode=true'
      - '--providers.docker.exposedbydefault=false'
      - '--entrypoints.web.address=:80'
      - '--entrypoints.websecure.address=:443'
    ports:
      - '80:80'
      - '443:443'
    deploy:
      placement:
        constraints:
          - node.role == manager
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - ./traefik/certs:/certs:ro
    networks:
      - agent-network

secrets:
  anthropic_api_key:
    external: true
  openai_api_key:
    external: true
  glm_api_key:
    external: true
  agent_hmac_key:
    external: true

networks:
  agent-network:
    external: true

Deploy stack:

docker stack deploy -c docker-compose.yml ai-agents

Verify deployment:

docker stack ps ai-agents
docker service ls

Step 5: Implement Inter-Agent Authentication

Problem: Agents need to communicate, but must authenticate each other to prevent spoofing.

Solution: HMAC signatures on all inter-agent messages.

Implementation:

import hmac
import hashlib
import json
import time

class AgentAuthenticator:
    def __init__(self, shared_secret):
        self.secret = shared_secret.encode()

    def sign_message(self, message_dict):
        """Add HMAC signature to message"""
        message_dict['timestamp'] = time.time()
        message_dict['sender'] = self.agent_id

        # Create canonical representation
        canonical = json.dumps(message_dict, sort_keys=True)

        # Generate HMAC
        signature = hmac.new(
            self.secret,
            canonical.encode(),
            hashlib.sha256
        ).hexdigest()

        message_dict['signature'] = signature
        return message_dict

    def verify_message(self, message_dict):
        """Verify HMAC signature"""
        # Extract signature
        received_sig = message_dict.pop('signature', None)
        if not received_sig:
            raise ValueError("No signature present")

        # Check timestamp (prevent replay attacks)
        timestamp = message_dict.get('timestamp', 0)
        if time.time() - timestamp > 60:  # 60 second window
            raise ValueError("Message expired")

        # Recreate canonical representation
        canonical = json.dumps(message_dict, sort_keys=True)

        # Compute expected signature
        expected_sig = hmac.new(
            self.secret,
            canonical.encode(),
            hashlib.sha256
        ).hexdigest()

        # Constant-time comparison
        if not hmac.compare_digest(received_sig, expected_sig):
            raise ValueError("Invalid signature")

        return message_dict

Usage:

# Agent A sends command to Agent B
auth = AgentAuthenticator(shared_secret)

message = {
    "command": "analyze_code",
    "code": code_sample,
    "priority": "high"
}

signed_message = auth.sign_message(message)
send_to_agent_b(signed_message)

# Agent B verifies before processing
try:
    verified_message = auth.verify_message(received_message)
    process_command(verified_message)
except ValueError as e:
    log_security_event(f"Authentication failed: {e}")
    return error_response()

Security benefits:

Prevents agent impersonation
Detects message tampering
Timestamp prevents replay attacks
Shared secret from Docker Secrets (never in code)

Step 6: Comprehensive Logging

Centralized logging architecture:

logging-stack:
  image: grafana/loki:latest
  deploy:
    placement:
      constraints:
        - node.role == manager
  networks:
    - agent-network

promtail:
  image: grafana/promtail:latest
  deploy:
    mode: global # Runs on every node
  volumes:
    - /var/lib/docker/containers:/var/lib/docker/containers:ro
  networks:
    - agent-network

What to log:

1. All AI interactions:

log_entry = {
    "timestamp": datetime.now().isoformat(),
    "agent_id": agent.id,
    "provider": "anthropic",
    "model": "claude-3.5-sonnet",
    "prompt_hash": hash(prompt),  # Don't log full prompts (may contain PII)
    "prompt_length": len(prompt),
    "response_length": len(response),
    "tokens_used": usage.total_tokens,
    "latency_ms": latency,
    "cost_usd": calculate_cost(usage),
}
logger.info("ai_interaction", extra=log_entry)

2. Security events:

security_event = {
    "event_type": "potential_prompt_injection",
    "severity": "medium",
    "agent_id": agent.id,
    "indicators": ["ignore previous", "system override"],
    "action_taken": "blocked",
}
security_logger.warning("security_event", extra=security_event)

3. Agent authentication:

auth_event = {
    "event": "agent_auth_failure",
    "sender_claimed": message.get("sender"),
    "signature_valid": False,
    "source_ip": request.remote_addr,
}
security_logger.error("auth_failure", extra=auth_event)

Query logs for analysis:

# Find all failed authentication attempts
{job="ai-agents"} |= "auth_failure"

# Track cost per agent
sum by (agent_id) (cost_usd)

# Detect anomalous response patterns
avg(response_length) by (agent_id)

Step 7: Monitoring and Alerting

Prometheus metrics:

from prometheus_client import Counter, Histogram, Gauge

# Request metrics
ai_requests_total = Counter(
    'ai_requests_total',
    'Total AI requests',
    ['agent_id', 'provider', 'status']
)

ai_request_duration = Histogram(
    'ai_request_duration_seconds',
    'AI request duration',
    ['agent_id', 'provider']
)

# Cost tracking
ai_cost_total = Counter(
    'ai_cost_usd_total',
    'Total AI cost in USD',
    ['agent_id', 'provider']
)

# Security metrics
auth_failures_total = Counter(
    'agent_auth_failures_total',
    'Total authentication failures',
    ['agent_id']
)

# Record metrics
ai_requests_total.labels(agent_id=id, provider="anthropic", status="success").inc()
ai_cost_total.labels(agent_id=id, provider="anthropic").inc(cost)

Alert rules:

# Alert on high authentication failure rate
- alert: HighAuthFailureRate
  expr: rate(agent_auth_failures_total[5m]) > 0.1
  for: 5m
  annotations:
    summary: 'High authentication failure rate detected'

# Alert on cost anomaly
- alert: UnexpectedCostSpike
  expr: rate(ai_cost_usd_total[1h]) > 10
  for: 10m
  annotations:
    summary: 'AI cost spike detected'

# Alert on service down
- alert: AgentDown
  expr: up{job="ai-agents"} == 0
  for: 2m
  annotations:
    summary: 'AI agent is down'

Security Features Implemented

1. Role Segregation

Manager node:

✅ Orchestration and coordination only
✅ No agent workloads (prevents compromise affecting orchestration)
✅ Limited external exposure

Worker nodes:

✅ Run agent containers
✅ Isolated from management plane
✅ No direct internet access (traffic through Traefik only)

2. Resource Limits

Prevent resource exhaustion:

resources:
  limits:
    cpus: '2.0' # Maximum CPU
    memory: 4G # Maximum memory
  reservations:
    cpus: '1.0' # Guaranteed minimum
    memory: 2G

Benefit: Compromised container cannot DoS entire cluster.

3. Health Checks and Auto-Recovery

healthcheck:
  test: ['CMD', 'curl', '-f', 'http://localhost:8000/health']
  interval: 30s
  timeout: 3s
  retries: 3

What happens:

Swarm monitors health every 30 seconds
After 3 consecutive failures, container marked unhealthy
Swarm automatically restarts unhealthy containers
Zero-downtime deployments through rolling updates

4. Placement Constraints

placement:
  constraints:
    - node.role == worker
    - node.labels.agent_type == premium

Control where workloads run:

Separate premium models (Claude) from budget (GLM)
Keep sensitive workloads on specific nodes
Implement hardware-based isolation

Lessons Learned

1. Docker Swarm is Simpler Than Kubernetes

For learning security fundamentals, Swarm provides:

Simpler configuration (YAML is straightforward)
Faster setup (minutes vs hours)
Built-in secrets management
Adequate for most labs and small deployments

Kubernetes advantages (when you need them):

Larger ecosystem (Helm charts, operators)
More granular control
Better for very large clusters
Industry standard for production

For learning AI security: Swarm is sufficient and faster to iterate.

2. Encrypted Overlay Networks Are Mandatory

Network encryption is not optional for distributed AI systems:

Prevents eavesdropping on inter-agent communication
Protects API keys in transit
Ensures prompt confidentiality
Minimal performance impact (~10%)

3. HMAC Authentication Prevents Agent Impersonation

Without inter-agent authentication:

Attacker can send commands pretending to be legitimate agent
Difficult to audit which agent performed which action
No way to detect man-in-the-middle attacks

With HMAC signatures:

Each message cryptographically authenticated
Tampering detected immediately
Replay attacks prevented via timestamps
Simple to implement, strong security guarantees

4. Comprehensive Logging Enables Forensics

When (not if) security incidents occur:

Logs show exactly what happened when
Can correlate events across multiple agents
Cost tracking identifies abuse
Behavioral baselines detect anomalies

Conclusion: Production-Grade Patterns in a Lab

This architecture demonstrates production-grade security patterns:

✅ Zero-trust networking (encrypted, authenticated)
✅ Secrets management (never in code or logs)
✅ Role segregation (manager vs worker)
✅ Resource limits (prevent DoS)
✅ Comprehensive logging (audit trail)
✅ Health monitoring (auto-recovery)
✅ Multi-vendor architecture (no lock-in)

Built on $600 of Raspberry Pi hardware in a homelab environment.

Why this matters: You don’t need enterprise budgets to learn enterprise security patterns. A Raspberry Pi cluster teaches the same fundamentals as a Kubernetes cluster—but faster, cheaper, and with more control.

The best way to learn AI security is building systems and intentionally breaking them. This architecture provides the foundation for security research: a distributed, production-like environment where you can safely experiment with attacks and defenses.

Start building. Start breaking. That’s how you learn.