May 18, 2026
Running 1,000 Hermes Agents on Three Machines
Running 1,000 Hermes Agents on Three Machines
There's a class of AI product where every customer gets their own agent. Not a shared chatbot behind an API. A dedicated, always-on process that maintains state, remembers conversations, browses the web, and acts autonomously on behalf of one specific customer.
This is fundamentally different from serving LLM completions. The LLM call is stateless — you can load-balance it across a fleet. The agent wrapping it is stateful. It has a database, a config file, learned preferences, conversation history. It needs to be running when the customer sends a message at 2am. It needs to survive server restarts without losing memory.
The naive way to run this: one VM per customer. It works. It's also absurdly wasteful. Here's how to run a thousand of them on three machines.
The Utilization Problem
We measured ten production agents after a week of real customer usage:
┌──────────┬───────────┬───────────┬───────────────┐
│ Agent │ RAM (MB) │ CPU (%) │ VM Allocated │
├──────────┼───────────┼───────────┼───────────────┤
│ #1 │ 290 │ 1.2 │ 4096 MB │
│ #2 │ 268 │ 1.2 │ 4096 MB │
│ #3 │ 329 │ 0.7 │ 4096 MB │
│ #4 │ 275 │ 0.9 │ 4096 MB │
│ #5 │ 312 │ 1.1 │ 4096 MB │
│ #6 │ 284 │ 0.8 │ 4096 MB │
│ #7 │ 301 │ 1.3 │ 4096 MB │
│ #8 │ 258 │ 0.6 │ 4096 MB │
│ #9 │ 347 │ 1.0 │ 4096 MB │
│ #10 │ 291 │ 1.1 │ 4096 MB │
├──────────┼───────────┼───────────┼───────────────┤
│ Average │ 295 │ 1.0 │ 4096 MB │
│ Util % │ 7.2% │ 0.5% │ │
└──────────┴───────────┴───────────┴───────────────┘
Seven percent memory utilization. Half a percent CPU. Each agent is an LLM gateway process (Python), a headless Chromium instance (for web research), and a SQLite database (for state). The process sits idle 99% of the time, waiting for the customer to send a message. When it does work, it makes an API call to Claude, maybe opens a browser tab, and goes back to sleep.
We were paying for 4GB of RAM per customer to use 300MB. Multiply by a thousand customers and you're burning $4,500/month on empty VMs.
Why Containers Aren't Enough
The obvious step: containerize the agents and pack them onto fewer, larger VMs. You can fit 20 agents on one 8GB machine with Docker Compose.
But now you're writing your own orchestrator:
- ·Customer signs up. Which machine has room? You check. You deploy.
- ·Machine dies at 3am. Which agents were on it? You SSH into the others. You redeploy manually.
- ·Agent needs persistent state. You mount a local directory. Machine dies, state is gone.
- ·You need to update agent code. You SSH into every machine. You pull. You restart.
This is fine for 5 machines. It's untenable for 50.
The tool that solves all four problems is Kubernetes. Specifically, k3s — a lightweight distribution that ships as a single binary, uses SQLite instead of etcd, and runs on machines as small as 2 CPU / 4GB RAM.
┌─────────────────────────┬───────────────┬──────────────────┐
│ Capability │ Manual VMs │ k3s Cluster │
├─────────────────────────┼───────────────┼──────────────────┤
│ Scheduling │ You decide │ Automatic │
│ Node failure │ You fix it │ Auto-reschedule │
│ Persistent state │ Local disk │ Replicated vols │
│ Rolling updates │ SSH + restart │ Zero-downtime │
│ Resource tracking │ Guesswork │ Requests/limits │
│ Service discovery │ IP tracking │ DNS │
│ Adding capacity │ Provision VM │ Join node │
└─────────────────────────┴───────────────┴──────────────────┘
The Cluster
Three nodes on Hetzner Cloud, connected via a private network:
┌───────────────────────────────────────────────────────────┐
│ Private Network (10.0.0.0/16) │
│ │
│ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │
│ │ Control Plane │ │ Worker 1 │ │ Worker 2 │ │
│ │ 2 CPU / 4 GB │ │ 4 CPU / 8 GB │ │ 4 CPU / 8 GB │ │
│ │ $7/mo │ │ $15/mo │ │ $15/mo │ │
│ │ │ │ │ │ │ │
│ │ k3s server │ │ k3s agent │ │ k3s agent │ │
│ │ traefik │ │ longhorn │ │ longhorn │ │
│ │ cert-manager │ │ ~24 agents │ │ ~24 agents │ │
│ │ cloud-ctrl │ │ │ │ │ │
│ └───────────────┘ └───────────────┘ └───────────────┘ │
│ │
│ Total: ~$37/month for the base cluster │
└───────────────────────────────────────────────────────────┘
Supporting tools:
- ·Traefik — ingress controller, TLS termination. Bundled with k3s.
- ·Longhorn — replicated block storage for persistent volumes.
- ·cert-manager — automated Let's Encrypt certificates.
- ·hcloud-ccm — provisions Hetzner Load Balancers from Kubernetes Service annotations.
The request flow:
INTERNET
│
┌────────┴────────┐
│ Hetzner LB │
│ (auto-created │
│ by hcloud-ccm)│
└────────┬────────┘
│
┌────────┴────────┐
│ Traefik │
│ TLS via │
│ cert-manager │
└────────┬────────┘
│
┌────────┴────────┐
│ Your API │
│ (routes msgs │
│ to agents) │
└──┬──────┬────┬──┘
│ │ │ internal DNS
▼ ▼ ▼
agent-A agent-B agent-C ...
│ │ │
PVC PVC PVC
(Longhorn, replicated)
No public IPs on agents. No per-customer firewall rules. Your API is the single entry point. Agents are reached by cluster DNS names.
The Density Math
This is the key to the whole approach. Kubernetes schedules pods based on resource requests — the guaranteed minimum each pod needs. The limit is the burst ceiling.
┌─────────────────────────────────────────────────────┐
│ Per-Agent Resource Profile │
│ │
│ Request (guaranteed): 256 MB RAM │ 0.1 CPU │
│ Limit (burst ceiling): 1 GB RAM │ 0.5 CPU │
│ │
├─────────────────────────────────────────────────────┤
│ │
│ Worker node: 8 GB total │
│ System + storage overhead: ~2 GB │
│ Available for agents: ~6 GB │
│ │
│ Agents per worker: 6144 / 256 = 24 │
│ │
│ ┌────────────────────────────────────────────┐ │
│ │ 2 workers = ~48 agents │ │
│ │ 5 workers = ~120 agents │ │
│ │ 10 workers = ~240 agents │ │
│ │ 42 workers = ~1,000 agents │ │
│ └────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────┘
The 4:1 overcommit (256MB request vs 1GB limit) works because agents don't burst simultaneously. At any moment, maybe 5% are in active LLM conversations with Chromium open. The rest sit idle at ~295MB. The burst is short-lived — a few seconds of API calls and browser activity — then back to sleep.
If you set requests equal to limits (no overcommit), you'd fit 6 agents per worker instead of 24. Four times the hardware for no practical benefit.
State That Survives Failures
Each agent maintains persistent state:
- ·Conversation history (SQLite)
- ·Customer context (brand, products, preferences)
- ·Agent identity and learned knowledge
- ·Integration credentials and config flags
- ·Task management state
Losing this state means the agent forgets everything. It's the difference between a useful assistant and a stranger.
Longhorn replicates each agent's volume across worker nodes:
Worker 1 Worker 2
┌────────────────┐ ┌────────────────┐
│ agent-A (pod) │ │ │
│ │ │ replicate │ │
│ ┌────┴──────┐ │ ───────────► │ ┌────────────┐ │
│ │ 5 GB vol │ │ │ │ replica │ │
│ │ (state) │ │ │ │ (state) │ │
│ └───────────┘ │ │ └────────────┘ │
└────────────────┘ └────────────────┘
Worker 1 dies:
┌────────────────┐
│ agent-A (pod) │ ← rescheduled
│ │ │ automatically
│ ┌────┴──────┐ │
│ │ 5 GB vol │ │ ← already here
│ │ (intact) │ │
│ └───────────┘ │
└────────────────┘
Recovery time: ~15 seconds
Data loss: zero
The tradeoff is disk space. Replication factor 2 means 1,000 agents at 5GB each consume 10TB total across nodes. At Hetzner's prices, this is roughly $15/month in extra storage. Nothing compared to the cost of losing customer data.
Provisioning in Seconds, Not Minutes
With VMs, provisioning a new agent meant creating a Hetzner server from a Packer snapshot, writing cloud-init config, waiting for boot, polling for health. Two to five minutes per customer.
With k3s, provisioning means creating four Kubernetes resources via the API:
Customer signs up
│
├─ Create Secret (env vars: API keys, config)
├─ Create StatefulSet (1 pod, resource limits, health probes)
├─ Create Service (internal DNS name)
└─ Create PVC (5 GB, Longhorn)
│
│ k3s scheduler places pod on a worker with room
│ Init container fixes volume permissions
│ Entrypoint copies runtime, boots agent
│ Readiness probe passes: GET /health → 200
│
▼
Agent active. Total: 30-45 seconds.
The provisioner is ~200 lines of Go using client-go. No Helm charts, no YAML templates. The Kubernetes resources are ephemeral artifacts created by application code, not static manifests committed to a repo.
Deprovisioning is the reverse: delete Service, StatefulSet, Secret. The PVC can be retained for a grace period or deleted immediately.
The Container Image
The agent image is 2.4GB. It bundles a Python runtime, an LLM agent framework, headless Chromium, and FFmpeg. Large, but manageable with a multi-stage build:
┌───────────────────────────────────────────────┐
│ Stage 1: Builder (discarded) │
│ │
│ All build tools: git, Node.js, gcc, npm │
│ Clone + build agent framework venv │
│ Download + install Chromium │
│ Build app dependencies │
│ │
│ Size: ~4.2 GB │
├───────────────────────────────────────────────┤
│ Stage 2: Runtime (shipped) │
│ │
│ Python slim + Chromium libs + FFmpeg + curl │
│ COPY venvs, Chromium binaries, app code │
│ NO git, NO Node.js, NO build tools │
│ │
│ Size: 2.4 GB (-43%) │
└───────────────────────────────────────────────┘
First pull on a new worker: 2-3 minutes. After that, layers are cached. Code updates only change the top layers — new pods start in seconds.
The entrypoint handles two cases:
- ·First boot (empty volume): copies runtime into the persistent volume, initializes databases, writes default config. ~8 seconds.
- ·Restart (existing volume): syncs code from the image, skips init. ~2 seconds.
This means code updates are deployed by pushing a new image. State persists on the volume.
CI/CD: Push to Main, Deploy Everywhere
┌───────────────────────────────────────────────────┐
│ Push to main │
│ │ │
│ ├─ Build + test │
│ ├─ Docker build → push to registry │
│ ├─ Sync env vars: GitHub Secrets → k8s Secret │
│ ├─ kubectl set image → new version │
│ └─ kubectl rollout status → wait for healthy │
│ │
│ Total: ~3 minutes │
└───────────────────────────────────────────────────┘
The secret sync is important. Every deploy recreates the environment secret from CI secrets. Adding a new env var is one command + one line in the workflow. No SSH, no manual patching, no drift.
New customers get the latest agent image automatically. Existing agents keep running their current version until restarted.
The Cost Curve
Monthly cost ($)
│
4500 ┤ ╱ One VM per agent
│ ╱
3000 ┤ ╱
│ ╱
2000 ┤ ╱
│ ╱
1000 ┤ ╱
│ ╱
500 ┤ ╱
│ ╱
285 ┤ ── ── ── ── ── ── ── ── ── ── ── ── ── k3s cluster
│ ╱─────────────╱
52 ┤──────╱── ── ── ── ── ── ── ── ── ──
37 ┤────╱
│ ╱
└──┬───────┬───────┬───────┬───────┬───────┬──
10 50 100 300 500 1000
Number of agents
┌──────────┬─────────────┬─────────────┬──────────┐
│ Agents │ VMs / mo │ k3s / mo │ Saving │
├──────────┼─────────────┼─────────────┼──────────┤
│ 10 │ $45 │ $37 │ 18% │
│ 50 │ $225 │ $37 │ 84% │
│ 100 │ $450 │ $52 │ 88% │
│ 500 │ $2,250 │ $150 │ 93% │
│ 1,000 │ $4,500 │ $285 │ 94% │
└──────────┴─────────────┴─────────────┴──────────┘
Breakdown at 1,000 agents:
1 control plane: $7/mo
~42 workers: $278/mo (24 agents each)
Load balancer: $5.50/mo
─────────────────────────
Total: ~$285/mo
For context, a single m5.xlarge on AWS costs $140/month. We're running 1,000 agents for the price of two EC2 instances.
Lessons From Production
Idle connections are invisible killers. When services move from a local Docker network to communicating over real HTTPS, long-lived connections become fragile. SSE streams that worked fine over localhost die when an HTTP/2 proxy has a 10-second idle timeout. Send keepalive data. Always.
Abstracting the host early pays for itself. If your routing layer stores "the IP address where this customer's agent runs," you'll touch every file when migrating to DNS-based routing. Store a generic host field from day one. IP addresses and DNS names both work in a URL.
Don't hand-craft cluster state. Every manual kubectl command is a patch that will drift from what your code expects. The third time a feature breaks because an env var is missing from the cluster secret, you'll automate it. Save yourself the first two incidents.
Overcommit memory deliberately. The 4:1 ratio (256MB request, 1GB limit) isn't a hack — it's how you make the economics work. Agents don't burst simultaneously. If they ever do, the OOM killer takes out one pod, not the node. Kubernetes restarts it. Design for this.
Let the orchestrator orchestrate. Don't write custom health checks, custom scheduling, custom failover. That's what Kubernetes does. Your code creates resources and reads DNS names. Everything else is the cluster's job.
The Stack
| Component | Tool | Why |
|---|---|---|
| Orchestrator | k3s | Kubernetes in 70MB. SQLite control plane. |
| Storage | Longhorn | Replicated volumes. Survives node failure. |
| Ingress | Traefik | Bundled with k3s. SSE-friendly with config. |
| TLS | cert-manager | Let's Encrypt. No Cloudflare dependency. |
| Cloud glue | hcloud-ccm | Auto-provisions Hetzner LBs. |
| K8s client | client-go | Programmatic resource creation from Go. |
| Infrastructure | Hetzner Cloud | $7/mo VMs. Private networking. EU data. |
The Takeaway
The hardest part of running a thousand AI agents isn't the AI. LLM calls are stateless — scale those with money. The hard part is the persistent process wrapping the LLM: its state, its uptime, its health, its isolation from other tenants.
One VM per agent is the right starting point for the first ten customers. It's simple and debuggable. But the cost curve diverges fast, and the migration isn't free.
If you know you'll cross 50 agents, start with containers and a lightweight orchestrator. k3s + Longhorn on Hetzner gets you from $4.50/agent to $0.28/agent, with better reliability than individual VMs ever had.
The setup takes a day. The debugging takes another. The savings compound every month after that.