How We Find Every Subdomain: 13 Sources, AI Validation, and the Discovery Pipeline
The Single-Source Problem
If you search your company on crt.sh right now, you'll get a list of subdomains from Certificate Transparency logs. It's the most popular subdomain discovery tool in existence, and security teams use it daily.
The problem: crt.sh only sees domains that have had TLS certificates issued for them. It misses:
- Subdomains using wildcard certificates (the individual hostnames never appear in CT logs)
- Internal services that use self-signed or private CA certificates
- Domains registered but not yet deployed (no certificate issued)
- Subsidiary brands that use completely different domain names
- Legacy domains from pre-HTTPS era that still resolve
Every single data source has blind spots like these. Shodan finds internet-facing services but misses CDN-proxied domains. SecurityTrails has historical DNS but gaps in certain TLDs. VirusTotal aggregates malware reports but doesn't cover domains that were never flagged.
No single source gives you the full picture. The only way to approach completeness is to query many sources and merge the results.
The 13 Sources
DDactic's discovery pipeline queries these sources for every scan:
| Category | Source | What It Finds |
|---|---|---|
| Certificate Transparency | crt.sh | Every TLS certificate ever issued for a domain |
| Certificate Transparency | CertSpotter | Real-time CT log monitoring with different coverage than crt.sh |
| Threat Intelligence | VirusTotal | Subdomains seen in malware analysis, URL scans, passive DNS |
| Threat Intelligence | SecurityTrails | Historical DNS records, WHOIS history, subdomain enumeration |
| Threat Intelligence | CIRCL Passive DNS | European passive DNS collection from network sensors |
| Search & OSINT | Google CSE | Indexed subdomains from web crawling |
| Search & OSINT | WHOISXML | Domain registration data, reverse WHOIS lookups |
| Search & OSINT | Shodan | Internet-wide port scanning, service banners, SSL certificates |
| Network | RIPE Atlas | BGP routing data, IP prefix announcements |
| Network | BGP Routing Tables | ASN mapping, IP range ownership |
| Discovery | DNS Brute-force | Common subdomain patterns (dev, staging, api, vpn, mail) |
| Discovery | Reverse WHOIS | Other domains registered by the same organization |
| Credential | Breach Databases | Employee credentials exposed in known data breaches |
Rate limit bypass: Most of these APIs rate-limit or block requests from cloud IPs and known scanner ranges. DDactic routes discovery through residential IP infrastructure, bypassing bot detection, JS challenges, and IP reputation filters. This surfaces results that scanners running from AWS or GCP never see.
The Pipeline: Three Stages
Stage 1: SLD Discovery
The first stage identifies all Second-Level Domains (SLDs) that belong to the target organization. If we're scanning "Migdal Insurance," we need to find every domain they operate: migdal.co.il, migdalins.co.il, migdal-capital.co.il, and potentially dozens more.
We query crt.sh by organization name (not just domain), which returns certificates issued to "Migdal Insurance" regardless of domain. Then we cross-reference with reverse WHOIS, SecurityTrails, and other sources to find domains registered by the same entity.
Stage 2: AI Validation and Expansion
This is where DDactic diverges from every other scanner we've seen.
Raw SLD discovery produces noise. When we search for "Migdal," crt.sh also returns migdal.de (a German company), migdal-emek.muni.il (a municipality), and various unrelated domains.
AI Validation: Every discovered SLD is sent to Claude Haiku with the company context. The AI scores each domain 0-100 for ownership confidence:
Input: Company "Migdal Insurance" (Israel)
Discovered SLD: migdal.de
AI Output:
Score: 8/100
Reason: German TLD (.de) for an Israeli insurance company.
No evidence of international operations in Germany.
Verdict: EXCLUDE
AI Expansion: After validation, a second AI call suggests SLDs that the 13 sources might have missed:
Input: Company "Migdal Insurance" (Israel)
Known SLDs: migdal.co.il, migdalins.co.il
AI Output:
Suggestions:
- migdal-capital.co.il (investment subsidiary)
- migdalor.co.il (brand variant)
- migdal-health.co.il (health insurance arm)
Verified via DNS: 2 of 3 resolve to active servers
Country context detection: If any confirmed SLD ends in .co.il, the AI automatically flags domains with foreign TLDs (.de, .fr, .it) as likely false positives. This single heuristic eliminated 25% of false positives in our Israeli financial sector scans.
Stage 3: Deep Enumeration and L7 Recon
For every validated SLD, we enumerate all subdomains across our 13 sources, then perform L7 reconnaissance on every discovered asset:
- HTTP fingerprinting: Server headers, response codes, TLS certificate details
- CDN/WAF detection: Which protection vendor fronts each asset
- Origin discovery: Finding the real server IP behind CDN proxies
- Login detection: Identifying customer portals, admin panels, API endpoints
- Technology stack: Frameworks, CMS, cloud platforms
- Breach exposure: Credential leaks associated with each domain
AI Analysis: Four Stages
AI isn't just used for SLD validation. Four distinct Claude Haiku calls happen during every scan:
| Stage | Purpose | Input | Output |
|---|---|---|---|
| SLD Validation | Filter false positives | Company name + discovered SLDs | Score 0-100 per SLD, include/exclude verdict |
| SLD Expansion | Find missing domains | Company name + confirmed SLDs | Suggested SLDs, verified via DNS |
| Platform Classification | Map hosting topology | Assets with HTTP headers | Platform labels (Cloudflare, Vercel, AWS, etc.) |
| Resilience Analysis | Assess DDoS risk | All assets + findings | Risk scores, attack vectors, remediation priority |
Total cost: under $0.02 per scan. Four API calls to Claude Haiku. The model is fast enough that all four stages add less than 30 seconds to a scan that runs 5-15 minutes.
Real-World Results
Case: Israeli Financial Group
Single-source scan (crt.sh only): 47 domains found. 12 were false positives (foreign companies with similar names). 35 real domains.
DDactic full pipeline: 58 unique domains found across 13 sources. AI excluded 14 false positives. AI suggested 4 additional SLDs, 3 confirmed via DNS. Final count: 47 verified domains. 12 more than single-source, zero false positives.
What the Extra 12 Domains Revealed
- 3 unprotected API endpoints serving JSON without any CDN or WAF
- 2 staging environments with production database connections
- 1 legacy VPN portal on an old IP range, no rate limiting
- 4 subsidiary brand domains that the security team didn't know were still active
- 2 internal tools (admin panels) exposed to the public internet
None of these appeared in crt.sh. All of them represent real attack surface.
Why Breach Databases Matter for DDoS
This might seem counterintuitive: why does a DDoS resilience platform check breach databases?
Because attack surface is not just infrastructure. If 200 employee credentials from your organization appeared in a breach dump, an attacker can:
- Log in to your VPN portal (bypassing all network-level protection)
- Access internal dashboards that aren't behind CDN/WAF
- Authenticate to API endpoints and bypass rate limiting
- Pivot from credential access to service disruption
We found that companies with the strongest perimeter defense often had the highest credential exposure. The investment went to CDN/WAF/scrubbing, while credential hygiene and monitoring were deprioritized.
The Residential Proxy Advantage
Half of our 13 sources rate-limit or block requests from cloud IP ranges. This is a real problem for scanners running on AWS, GCP, or any cloud provider.
DDactic routes all discovery through residential IP infrastructure. The requests look like a regular user browsing from a home connection, not a scanner running from us-east-1.
This matters because:
- crt.sh: Aggressively rate-limits cloud IPs. Residential IPs get full access.
- VirusTotal: API quotas are stricter for known cloud ranges.
- Shodan: Some queries return filtered results for automated scanners.
- Google CSE: Cloud IP requests trigger CAPTCHA challenges.
The difference is measurable: residential-routed queries return 30-40% more results than direct cloud queries for the same targets.
Architecture
The pipeline runs on dedicated infrastructure (not serverless, not Lambda). Single-company scans execute on our Dedibox backend server, where the full Python pipeline has access to all 13 sources, AI validation, and breach database lookups.
For industry-wide scans (scanning hundreds of companies simultaneously), we use AWS Batch with a Go-based scanner binary that parallelizes across companies.
Single Company Scan Flow:
Dashboard (Cloudflare Pages)
|
v
Backend API (Dedibox :8080)
|
+---> crt.sh (residential proxy) --+
+---> VirusTotal API |
+---> SecurityTrails API |
+---> Shodan API +--> SLD List
+---> CertSpotter API |
+---> CIRCL Passive DNS |
+---> 7 more sources ... --+
|
v
AI Validation (Claude Haiku) --> Filtered SLDs
|
v
AI Expansion (Claude Haiku) --> Additional SLDs
|
v
Subdomain Enumeration (all sources, per SLD)
|
v
L7 Recon (HTTP, TLS, CDN, WAF, Origin)
|
v
AI Platform Classification (Claude Haiku)
|
v
AI Resilience Analysis (Claude Haiku)
|
v
Results --> S3 --> Dashboard
What's Next
We're expanding the pipeline in three directions:
- Continuous monitoring: Re-scanning at intervals to detect new subdomains, expired certificates, and configuration drift
- Mobile and desktop app recon: Traffic interception labs that discover API endpoints apps connect to (not visible in DNS or CT logs)
- Active validation: After passive discovery, optionally testing each asset's DDoS resilience with controlled load from our 24-platform bot fleet
The goal is a complete loop: discover the attack surface, assess its resilience, harden the gaps, and continuously verify.
Try it yourself: Run a free scan at ddactic.net/free-scan. No account required. See what your attack surface actually looks like across 13 intelligence sources.