DDactic Blog / Engineering
Engineering · February 2026 · 8 min read

How We Built a Self-Healing Fleet Across 13 Cloud Providers

Most distributed systems run on one cloud. Ours runs on thirteen. Here's how we built a deployment engine that provisions instances across AWS, GCP, Azure, Alibaba, Tencent, IBM, Oracle, DigitalOcean, Vultr, Linode, Hetzner, OVH, and Scaleway, and automatically replaces spot instances when they're reclaimed.

Why 19 Clouds?

DDactic is a DDoS resilience testing platform. When customers want to test their infrastructure's ability to handle distributed traffic, the keyword is distributed. Traffic from a single cloud provider is trivially identifiable and filterable. Traffic from 19 different ASNs, spread across dozens of regions, behaves much more like a real-world attack.

There's also a practical cost argument. Spot instances on AWS, GCP, Azure, Alibaba, and Tencent cost 60-90% less than on-demand, but they can be reclaimed at any time. If one provider reclaims your instances, having 18 others ensures the test continues.

13
Cloud Platforms
5
Spot-Capable
~15s
Deploy to Running

The Deploy Service

The core is a single Go binary running on a dedicated server. It exposes three endpoints: POST /deploy, POST /destroy, and GET /status. Each cloud platform has a dedicated adapter that handles authentication, API translation, region selection, and error handling.

We deliberately avoided Terraform and Pulumi. Infrastructure-as-code tools are designed for static, declared infrastructure. Our fleet is ephemeral: instances spin up for a test, run for hours, and get destroyed. We needed imperative, fast, API-first deployment.

The Platform Quirks Nobody Warns You About

Building one cloud integration is straightforward. Building thirteen means encountering every edge case in cloud computing:

The Boot Sequence

Every instance, on every platform, runs the same boot.sh script on startup. It has one job: get the bot binary running and connected to the Fleet Controller.

# Simplified flow:
1. Self-update boot.sh from Binary Server
2. Detect platform via metadata endpoints
   169.254.42.42      → Scaleway
   100.100.100.200    → Alibaba
   169.254.0.23       → Tencent
   169.254.169.254    → AWS/GCP/Azure/OVH (differentiated by headers)
   /sys/class/dmi/id  → Linode/Vultr (vendor string)
   hostname pattern   → IBM (fallback)
3. Generate unique BOT_ID from instance metadata
4. Write Fleet Controller configuration
5. Download latest bot binary (18MB) from Binary Server
6. Start bot service, register with Fleet Controller

The critical insight: every metadata endpoint needs a timeout. Without --connect-timeout 3, a curl to a non-existent metadata IP hangs indefinitely. On the wrong platform, that's a boot script that never finishes.

Spot Instance Recovery

Five of our nineteen platforms support spot/preemptible instances. The savings are significant, but any instance can be reclaimed with as little as 30 seconds notice.

The Spot Monitor is a separate Go service that polls a /status/instances endpoint on the Deploy Service every 60 seconds. This endpoint queries all 19 platform adapters in parallel and returns every known instance with its spot status.

The recovery timeline: Cloud reclaims instance (T+0) → Spot Monitor detects on next poll (T+60s worst case) → Deploy Service provisions replacement (T+61s) → Bot running and registered with the Fleet Controller (T+75s). Total fleet downtime: one polling cycle.

A subtlety: when redeploying, the replacement request includes spot: true. The replacement is also a spot instance, maintaining cost savings. If the same platform keeps reclaiming, the monitor retries up to 3 times before giving up.

The GCP Gotcha

GCP spot VMs have a configuration option called instanceTerminationAction. Set it to DELETE and GCP automatically deletes the instance when it's reclaimed. This sounds convenient until you realize your monitoring system can't see an instance that no longer exists.

We set it to STOP instead. The instance appears as TERMINATED in the API, which the Spot Monitor detects and reacts to. After triggering a redeploy, the old terminated instance can be cleaned up.

Fleet-Wide Updates Without SSH

We never SSH into bot instances. Updates happen through the boot sequence: on every service restart, the bot downloads the latest binary from the Binary Server (Nginx on port 9999). To update the entire fleet, we:

  1. Upload the new binary to the Binary Server
  2. The next time any bot restarts (or a new one is provisioned), it pulls the new version
  3. For immediate rollout, the Fleet Controller can issue a restart command to the entire fleet

This is simpler than it sounds because bot instances are ephemeral. They exist for the duration of a test (hours, sometimes days) and then get destroyed. The natural churn means the fleet converges to the latest binary within a test cycle.

What We Learned

The result: A deployment engine that provisions load generation instances across 19 cloud providers in ~15 seconds, auto-recovers spot reclamations in ~75 seconds, and updates the entire fleet without SSH access to any machine.

DDactic is a DDoS resilience testing platform. If you're interested in our engineering challenges or want to learn more about multi-cloud orchestration, reach out at [email protected].