How We Built a Self-Healing Fleet Across 13 Cloud Providers
Why 19 Clouds?
DDactic is a DDoS resilience testing platform. When customers want to test their infrastructure's ability to handle distributed traffic, the keyword is distributed. Traffic from a single cloud provider is trivially identifiable and filterable. Traffic from 19 different ASNs, spread across dozens of regions, behaves much more like a real-world attack.
There's also a practical cost argument. Spot instances on AWS, GCP, Azure, Alibaba, and Tencent cost 60-90% less than on-demand, but they can be reclaimed at any time. If one provider reclaims your instances, having 18 others ensures the test continues.
The Deploy Service
The core is a single Go binary running on a dedicated server. It exposes three endpoints: POST /deploy, POST /destroy, and GET /status. Each cloud platform has a dedicated adapter that handles authentication, API translation, region selection, and error handling.
We deliberately avoided Terraform and Pulumi. Infrastructure-as-code tools are designed for static, declared infrastructure. Our fleet is ephemeral: instances spin up for a test, run for hours, and get destroyed. We needed imperative, fast, API-first deployment.
The Platform Quirks Nobody Warns You About
Building one cloud integration is straightforward. Building thirteen means encountering every edge case in cloud computing:
- OVH and AWS share the same metadata IP (169.254.169.254). Our boot script checks
/openstack/latest/vendor_data.jsonfirst to distinguish them. - Scaleway's cloud-init doesn't run on snapshot-based instances. The cached cloud-init state from the snapshot prevents API-provided userdata from executing. We had to make the bot binary handle its own configuration.
- Tencent's mainland China regions are GFW-blocked. Instances in Shanghai or Beijing can't reach our Fleet Controllers. We only use non-China regions (ap-singapore, etc.).
- IBM requires VPC infrastructure pre-provisioning. Unlike other providers where you just create an instance, IBM needs a VPC, subnet, and security group first.
- GCP's
instanceTerminationAction: DELETEsilently removes reclaimed instances. We discovered this when our spot monitor couldn't find terminated instances to redeploy. They'd been auto-deleted by GCP.
The Boot Sequence
Every instance, on every platform, runs the same boot.sh script on startup. It has one job: get the bot binary running and connected to the Fleet Controller.
# Simplified flow: 1. Self-update boot.sh from Binary Server 2. Detect platform via metadata endpoints 169.254.42.42 → Scaleway 100.100.100.200 → Alibaba 169.254.0.23 → Tencent 169.254.169.254 → AWS/GCP/Azure/OVH (differentiated by headers) /sys/class/dmi/id → Linode/Vultr (vendor string) hostname pattern → IBM (fallback) 3. Generate unique BOT_ID from instance metadata 4. Write Fleet Controller configuration 5. Download latest bot binary (18MB) from Binary Server 6. Start bot service, register with Fleet Controller
The critical insight: every metadata endpoint needs a timeout. Without --connect-timeout 3, a curl to a non-existent metadata IP hangs indefinitely. On the wrong platform, that's a boot script that never finishes.
Spot Instance Recovery
Five of our nineteen platforms support spot/preemptible instances. The savings are significant, but any instance can be reclaimed with as little as 30 seconds notice.
The Spot Monitor is a separate Go service that polls a /status/instances endpoint on the Deploy Service every 60 seconds. This endpoint queries all 19 platform adapters in parallel and returns every known instance with its spot status.
The recovery timeline: Cloud reclaims instance (T+0) → Spot Monitor detects on next poll (T+60s worst case) → Deploy Service provisions replacement (T+61s) → Bot running and registered with the Fleet Controller (T+75s). Total fleet downtime: one polling cycle.
A subtlety: when redeploying, the replacement request includes spot: true. The replacement is also a spot instance, maintaining cost savings. If the same platform keeps reclaiming, the monitor retries up to 3 times before giving up.
The GCP Gotcha
GCP spot VMs have a configuration option called instanceTerminationAction. Set it to DELETE and GCP automatically deletes the instance when it's reclaimed. This sounds convenient until you realize your monitoring system can't see an instance that no longer exists.
We set it to STOP instead. The instance appears as TERMINATED in the API, which the Spot Monitor detects and reacts to. After triggering a redeploy, the old terminated instance can be cleaned up.
Fleet-Wide Updates Without SSH
We never SSH into bot instances. Updates happen through the boot sequence: on every service restart, the bot downloads the latest binary from the Binary Server (Nginx on port 9999). To update the entire fleet, we:
- Upload the new binary to the Binary Server
- The next time any bot restarts (or a new one is provisioned), it pulls the new version
- For immediate rollout, the Fleet Controller can issue a restart command to the entire fleet
This is simpler than it sounds because bot instances are ephemeral. They exist for the duration of a test (hours, sometimes days) and then get destroyed. The natural churn means the fleet converges to the latest binary within a test cycle.
What We Learned
- Every cloud API is different in surprising ways. Response formats, error codes, async vs sync operations, pagination styles. Nothing is standardized. Budget 2-3x the time you think you'll need for each new platform.
- Metadata endpoints are the universal bootstrapping mechanism, but they're not universal. Each platform's metadata has different paths, auth requirements, and response formats.
- Spot savings are real but require investment. The monitoring and recovery infrastructure costs engineering time. For us, the 60-90% savings across hundreds of instances justifies it.
- Don't use infrastructure-as-code for ephemeral workloads. Terraform state files and reconciliation loops add complexity without value when instances live for hours.
The result: A deployment engine that provisions load generation instances across 19 cloud providers in ~15 seconds, auto-recovers spot reclamations in ~75 seconds, and updates the entire fleet without SSH access to any machine.
DDactic is a DDoS resilience testing platform. If you're interested in our engineering challenges or want to learn more about multi-cloud orchestration, reach out at [email protected].