Setting Up a Docker Swarm AI Agent Cluster for Security Research

If you want to secure AI systems, you need to understand how they behave in production. Reading about distributed architectures only gets you so far — at some point you have to wire up the orchestration, configure the networking, manage the secrets, and watch what happens when things go wrong.

This post walks through building a distributed AI agent security lab with Docker Swarm on Raspberry Pi hardware. Swarm is deliberately simpler than Kubernetes, but it teaches the same core security fundamentals.

Why Build a Lab?

Theory vs Practice:

Reading about container security ≠ Understanding container escape vulnerabilities
Studying networking concepts ≠ Configuring encrypted overlay networks
Reviewing agent architectures ≠ Debugging inter-agent authentication failures

Hands-on labs teach:

How attacks actually work (not just theory)
Where security controls fail (firsthand experience)
How to detect compromises (real monitoring data)
How to respond to incidents (practice under pressure)

Hardware: Raspberry Pi 5 Cluster

Why Raspberry Pi instead of cloud?

Cost: $200 one-time vs $50+/month cloud forever Control: Own hardware, no vendor dependencies Learning: Physical networking teaches concepts cloud abstracts Permanence: Lab stays up, no billing concerns

My configuration:

3x Raspberry Pi 5 (8GB RAM each)
1x Manager node, 2x Worker nodes
Gigabit Ethernet networking
External NVMe storage for performance

Total cost: ~$600 (including cases, power, networking)

Architecture Overview

                 Internet
                    │
                    │ HTTPS
                    ▼
        ┌───────────────────────┐
        │  Manager Node (node-1)│
        ├───────────────────────┤
        │ • Traefik    :80/:443 │
        │ • Registry   :5000    │
        │ • Redis      :6379    │
        └───────────────────────┘
                    │
          Encrypted Overlay Network
                    │
        ┌───────────┴───────────┐
        │                       │
        ▼                       ▼
┌───────────────┐       ┌───────────────┐
│Worker (node-2)│       │Worker (node-3)│
├───────────────┤       ├───────────────┤
│ • AI Agent A  │       │ • AI Agent B  │
│   (Claude)    │       │   (GPT-4)     │
│               │       │               │
│               │       │ • AI Agent C  │
│               │       │   (GLM)       │
└───────────────┘       └───────────────┘

Key security features:

Manager node handles orchestration only (no agent workloads)
Worker nodes run agents in isolation
Encrypted overlay network (AES-256)
Inter-agent HMAC authentication
Traefik reverse proxy with TLS termination

Step 1: Initialize Docker Swarm

On manager node:

# Initialize swarm
docker swarm init --advertise-addr 192.168.1.10

# Output includes worker join token
# docker swarm join --token SWMTKN-1-... 192.168.1.10:2377

On worker nodes:

# Join swarm as worker
docker swarm join --token SWMTKN-1-xxxxxxxxxxxxx 192.168.1.10:2377

Verify cluster:

docker node ls

# Output:
# ID         HOSTNAME   STATUS  AVAILABILITY  MANAGER STATUS
# abc123 *   node-1     Ready   Active        Leader
# def456     node-2     Ready   Active
# ghi789     node-3     Ready   Active

Security considerations:

Join tokens are sensitive (rotate regularly)
Manager nodes should be separate from worker nodes
Limit manager node exposure (no public services)

Step 2: Create Encrypted Overlay Network

Why overlay networks:

Spans multiple hosts transparently
Built-in encryption (optional but recommended)
Automatic service discovery
Network isolation

Create encrypted network:

docker network create \
  --driver overlay \
  --encrypted \
  --attachable \
  agent-network

Verify:

docker network ls
docker network inspect agent-network

What --encrypted does:

All inter-node traffic encrypted with AES-256
Automatic key rotation
IPsec encapsulation
Performance impact: ~10% (acceptable for security benefit)

Security benefit: Even if the physical network is compromised, inter-container traffic stays encrypted. That single flag buys you a lot of protection for very little overhead.

Step 3: Setup Secrets Management

Never hardcode credentials. Docker Swarm has built-in secrets management that makes this straightforward.

Create secrets:

# API keys for AI providers
echo "sk-..." | docker secret create anthropic_api_key -
echo "sk-..." | docker secret create openai_api_key -
echo "..." | docker secret create glm_api_key -

# HMAC signing key for inter-agent auth
openssl rand -base64 32 | docker secret create agent_hmac_key -

List secrets:

docker secret ls

# Output:
# ID         NAME                CREATED          UPDATED
# abc123     anthropic_api_key   2 minutes ago    2 minutes ago
# def456     openai_api_key      1 minute ago     1 minute ago

Security properties:

Secrets encrypted at rest
Only accessible to authorized services
Never stored in images or logs
Transmitted over encrypted channels only

Access in containers:

# Secrets mounted as files in /run/secrets/
with open('/run/secrets/anthropic_api_key', 'r') as f:
    api_key = f.read().strip()

Step 4: Deploy AI Agent Stack

docker-compose.yml (stack file):

version: '3.8'

services:
  ai-agent-claude:
    image: registry.local:5000/ai-agent:latest
    deploy:
      replicas: 1
      placement:
        constraints:
          - node.role == worker
          - node.labels.agent_type == premium
      resources:
        limits:
          cpus: '2.0'
          memory: 4G
        reservations:
          cpus: '1.0'
          memory: 2G
      update_config:
        parallelism: 1
        delay: 10s
        failure_action: rollback
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3
    environment:
      - AI_PROVIDER=anthropic
      - AGENT_PROFILE=claude
      - LOG_LEVEL=info
    secrets:
      - anthropic_api_key
      - agent_hmac_key
    networks:
      - agent-network
    healthcheck:
      test: ['CMD', 'curl', '-f', 'http://localhost:8000/health']
      interval: 30s
      timeout: 3s
      retries: 3
      start_period: 40s

  ai-agent-gpt:
    image: registry.local:5000/ai-agent:latest
    deploy:
      replicas: 1
      placement:
        constraints:
          - node.role == worker
      resources:
        limits:
          cpus: '2.0'
          memory: 4G
    environment:
      - AI_PROVIDER=openai
      - AGENT_PROFILE=gpt4
    secrets:
      - openai_api_key
      - agent_hmac_key
    networks:
      - agent-network

  ai-agent-glm:
    image: registry.local:5000/ai-agent:latest
    deploy:
      replicas: 1
      placement:
        constraints:
          - node.role == worker
          - node.labels.agent_type == budget
      resources:
        limits:
          cpus: '1.0'
          memory: 2G
    environment:
      - AI_PROVIDER=z.ai
      - AGENT_PROFILE=glm
      - ENABLE_BIAS_MONITORING=true
    secrets:
      - glm_api_key
      - agent_hmac_key
    networks:
      - agent-network

  traefik:
    image: traefik:v2.10
    command:
      - '--api.insecure=false'
      - '--providers.docker.swarmMode=true'
      - '--providers.docker.exposedbydefault=false'
      - '--entrypoints.web.address=:80'
      - '--entrypoints.websecure.address=:443'
    ports:
      - '80:80'
      - '443:443'
    deploy:
      placement:
        constraints:
          - node.role == manager
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - ./traefik/certs:/certs:ro
    networks:
      - agent-network

secrets:
  anthropic_api_key:
    external: true
  openai_api_key:
    external: true
  glm_api_key:
    external: true
  agent_hmac_key:
    external: true

networks:
  agent-network:
    external: true

Deploy stack:

docker stack deploy -c docker-compose.yml ai-agents

Verify deployment:

docker stack ps ai-agents
docker service ls

If you’re running distributed agents, ask yourself: what happens if one agent starts sending commands the others blindly trust?

Step 5: Implement Inter-Agent Authentication

Problem: Agents need to talk to each other, but every message must be authenticated to prevent spoofing.

Solution: HMAC signatures on all inter-agent messages.

Implementation:

import hmac
import hashlib
import json
import time

class AgentAuthenticator:
    def __init__(self, shared_secret):
        self.secret = shared_secret.encode()

    def sign_message(self, message_dict):
        """Add HMAC signature to message"""
        message_dict['timestamp'] = time.time()
        message_dict['sender'] = self.agent_id

        # Create canonical representation
        canonical = json.dumps(message_dict, sort_keys=True)

        # Generate HMAC
        signature = hmac.new(
            self.secret,
            canonical.encode(),
            hashlib.sha256
        ).hexdigest()

        message_dict['signature'] = signature
        return message_dict

    def verify_message(self, message_dict):
        """Verify HMAC signature"""
        # Extract signature
        received_sig = message_dict.pop('signature', None)
        if not received_sig:
            raise ValueError("No signature present")

        # Check timestamp (prevent replay attacks)
        timestamp = message_dict.get('timestamp', 0)
        if time.time() - timestamp > 60:  # 60 second window
            raise ValueError("Message expired")

        # Recreate canonical representation
        canonical = json.dumps(message_dict, sort_keys=True)

        # Compute expected signature
        expected_sig = hmac.new(
            self.secret,
            canonical.encode(),
            hashlib.sha256
        ).hexdigest()

        # Constant-time comparison
        if not hmac.compare_digest(received_sig, expected_sig):
            raise ValueError("Invalid signature")

        return message_dict

Usage:

# Agent A sends command to Agent B
auth = AgentAuthenticator(shared_secret)

message = {
    "command": "analyze_code",
    "code": code_sample,
    "priority": "high"
}

signed_message = auth.sign_message(message)
send_to_agent_b(signed_message)

# Agent B verifies before processing
try:
    verified_message = auth.verify_message(received_message)
    process_command(verified_message)
except ValueError as e:
    log_security_event(f"Authentication failed: {e}")
    return error_response()

Security benefits:

Prevents agent impersonation
Detects message tampering
Timestamp prevents replay attacks
Shared secret from Docker Secrets (never in code)

Step 6: Comprehensive Logging

Centralized logging architecture:

logging-stack:
  image: grafana/loki:latest
  deploy:
    placement:
      constraints:
        - node.role == manager
  networks:
    - agent-network

promtail:
  image: grafana/promtail:latest
  deploy:
    mode: global # Runs on every node
  volumes:
    - /var/lib/docker/containers:/var/lib/docker/containers:ro
  networks:
    - agent-network

What to log:

1. All AI interactions:

log_entry = {
    "timestamp": datetime.now().isoformat(),
    "agent_id": agent.id,
    "provider": "anthropic",
    "model": "claude-3.5-sonnet",
    "prompt_hash": hash(prompt),  # Don't log full prompts (may contain PII)
    "prompt_length": len(prompt),
    "response_length": len(response),
    "tokens_used": usage.total_tokens,
    "latency_ms": latency,
    "cost_usd": calculate_cost(usage),
}
logger.info("ai_interaction", extra=log_entry)

2. Security events:

security_event = {
    "event_type": "potential_prompt_injection",
    "severity": "medium",
    "agent_id": agent.id,
    "indicators": ["ignore previous", "system override"],
    "action_taken": "blocked",
}
security_logger.warning("security_event", extra=security_event)

3. Agent authentication:

auth_event = {
    "event": "agent_auth_failure",
    "sender_claimed": message.get("sender"),
    "signature_valid": False,
    "source_ip": request.remote_addr,
}
security_logger.error("auth_failure", extra=auth_event)

Query logs for analysis:

# Find all failed authentication attempts
{job="ai-agents"} |= "auth_failure"

# Track cost per agent
sum by (agent_id) (cost_usd)

# Detect anomalous response patterns
avg(response_length) by (agent_id)

Step 7: Monitoring and Alerting

Prometheus metrics:

from prometheus_client import Counter, Histogram, Gauge

# Request metrics
ai_requests_total = Counter(
    'ai_requests_total',
    'Total AI requests',
    ['agent_id', 'provider', 'status']
)

ai_request_duration = Histogram(
    'ai_request_duration_seconds',
    'AI request duration',
    ['agent_id', 'provider']
)

# Cost tracking
ai_cost_total = Counter(
    'ai_cost_usd_total',
    'Total AI cost in USD',
    ['agent_id', 'provider']
)

# Security metrics
auth_failures_total = Counter(
    'agent_auth_failures_total',
    'Total authentication failures',
    ['agent_id']
)

# Record metrics
ai_requests_total.labels(agent_id=id, provider="anthropic", status="success").inc()
ai_cost_total.labels(agent_id=id, provider="anthropic").inc(cost)

Alert rules:

# Alert on high authentication failure rate
- alert: HighAuthFailureRate
  expr: rate(agent_auth_failures_total[5m]) > 0.1
  for: 5m
  annotations:
    summary: 'High authentication failure rate detected'

# Alert on cost anomaly
- alert: UnexpectedCostSpike
  expr: rate(ai_cost_usd_total[1h]) > 10
  for: 10m
  annotations:
    summary: 'AI cost spike detected'

# Alert on service down
- alert: AgentDown
  expr: up{job="ai-agents"} == 0
  for: 2m
  annotations:
    summary: 'AI agent is down'

Security Features Implemented

1. Role Segregation

Manager node:

Orchestration and coordination only
No agent workloads (prevents compromise from affecting orchestration)
Limited external exposure

Worker nodes:

Run agent containers
Isolated from the management plane
No direct internet access (traffic routes through Traefik only)

2. Resource Limits

Prevent resource exhaustion:

resources:
  limits:
    cpus: '2.0' # Maximum CPU
    memory: 4G # Maximum memory
  reservations:
    cpus: '1.0' # Guaranteed minimum
    memory: 2G

A compromised container cannot DoS the entire cluster when resource limits are in place.

3. Health Checks and Auto-Recovery

healthcheck:
  test: ['CMD', 'curl', '-f', 'http://localhost:8000/health']
  interval: 30s
  timeout: 3s
  retries: 3

Swarm monitors health every 30 seconds. After three consecutive failures, it marks the container unhealthy and automatically restarts it. Rolling updates give you zero-downtime deployments on top of that.

4. Placement Constraints

placement:
  constraints:
    - node.role == worker
    - node.labels.agent_type == premium

Control where workloads run:

Separate premium models (Claude) from budget (GLM)
Keep sensitive workloads on specific nodes
Implement hardware-based isolation

Lessons Learned

1. Docker Swarm is Simpler Than Kubernetes

For learning security fundamentals, Swarm gives you:

Simpler configuration (straightforward YAML)
Faster setup (minutes vs hours)
Built-in secrets management
Adequate tooling for most labs and small deployments

Kubernetes advantages (when you actually need them):

Larger ecosystem (Helm charts, operators)
More granular control
Better suited for very large clusters
Industry standard for production

For learning AI security: Swarm gets out of your way and lets you focus on the security problems, not the orchestration plumbing.

2. Encrypted Overlay Networks Are Non-Negotiable

Network encryption is not optional for distributed AI systems:

Prevents eavesdropping on inter-agent communication
Protects API keys in transit
Ensures prompt confidentiality
Minimal performance impact (~10%)

3. HMAC Authentication Prevents Agent Impersonation

Without inter-agent authentication:

An attacker can send commands while pretending to be a legitimate agent
Auditing which agent performed which action becomes nearly impossible
Man-in-the-middle attacks go undetected

With HMAC signatures:

Every message is cryptographically authenticated
Tampering is detected immediately
Replay attacks are blocked via timestamps
The implementation is simple, but the security guarantees are strong

4. Comprehensive Logging Enables Forensics

When (not if) security incidents occur:

Logs reveal exactly what happened and when
You can correlate events across multiple agents
Cost tracking surfaces abuse
Behavioral baselines make anomalies visible

Production-Grade Patterns on Lab Hardware

This architecture puts production-grade security patterns into practice:

Zero-trust networking (encrypted, authenticated)
Secrets management (never in code or logs)
Role segregation (manager vs worker)
Resource limits (prevent DoS)
Comprehensive logging (full audit trail)
Health monitoring (auto-recovery)
Multi-vendor architecture (no lock-in)

All of it runs on ~$600 of Raspberry Pi hardware in a homelab.

You do not need enterprise budgets to learn enterprise security patterns. A Raspberry Pi cluster teaches the same fundamentals as a managed Kubernetes deployment — faster, cheaper, and with more direct control over every layer.

The best way to understand AI security is to build these systems and then deliberately break them. This architecture gives you that foundation: a distributed, production-like environment where you can safely experiment with attacks and defenses.

What Does Your Cluster Look Like?

Are you running AI agents on Docker Swarm, Kubernetes, or something else entirely? Whether it’s a Raspberry Pi homelab or a cloud-based setup, I’d like to hear about your architecture choices — what security patterns you implemented, what surprised you, and what you’d change. Share your setup in the comments. Comparing real-world configurations teaches more than any documentation.