Security Research · March 2026 · 12 min read

How We Find Every Subdomain: 13 Sources, AI Validation, and the Discovery Pipeline

Most attack surface scanners query one or two data sources and call it a day. DDactic queries 13, validates every result with AI, expands coverage by suggesting what sources missed, and checks breach databases for credential exposure. Here's exactly how the pipeline works, why single-source discovery fails, and what AI actually contributes to reconnaissance.

Intelligence Sources

AI Analysis Stages

<$0.02

AI Cost Per Scan

More Assets Found

The Single-Source Problem

If you search your company on crt.sh right now, you'll get a list of subdomains from Certificate Transparency logs. It's the most popular subdomain discovery tool in existence, and security teams use it daily.

The problem: crt.sh only sees domains that have had TLS certificates issued for them. It misses:

Subdomains using wildcard certificates (the individual hostnames never appear in CT logs)
Internal services that use self-signed or private CA certificates
Domains registered but not yet deployed (no certificate issued)
Subsidiary brands that use completely different domain names
Legacy domains from pre-HTTPS era that still resolve

Every single data source has blind spots like these. Shodan finds internet-facing services but misses CDN-proxied domains. SecurityTrails has historical DNS but gaps in certain TLDs. VirusTotal aggregates malware reports but doesn't cover domains that were never flagged.

No single source gives you the full picture. The only way to approach completeness is to query many sources and merge the results.

The 13 Sources

DDactic's discovery pipeline queries these sources for every scan:

Category	Source	What It Finds
Certificate Transparency	crt.sh	Every TLS certificate ever issued for a domain
Certificate Transparency	CertSpotter	Real-time CT log monitoring with different coverage than crt.sh
Threat Intelligence	VirusTotal	Subdomains seen in malware analysis, URL scans, passive DNS
Threat Intelligence	SecurityTrails	Historical DNS records, WHOIS history, subdomain enumeration
Threat Intelligence	CIRCL Passive DNS	European passive DNS collection from network sensors
Search & OSINT	Google CSE	Indexed subdomains from web crawling
Search & OSINT	WHOISXML	Domain registration data, reverse WHOIS lookups
Search & OSINT	Shodan	Internet-wide port scanning, service banners, SSL certificates
Network	RIPE Atlas	BGP routing data, IP prefix announcements
Network	BGP Routing Tables	ASN mapping, IP range ownership
Discovery	DNS Brute-force	Common subdomain patterns (dev, staging, api, vpn, mail)
Discovery	Reverse WHOIS	Other domains registered by the same organization
Credential	Breach Databases	Employee credentials exposed in known data breaches

Rate limit bypass: Most of these APIs rate-limit or block requests from cloud IPs and known scanner ranges. DDactic routes discovery through residential IP infrastructure, bypassing bot detection, JS challenges, and IP reputation filters. This surfaces results that scanners running from AWS or GCP never see.

The Pipeline: Three Stages

SLD Discovery → AI Validation → Subdomain Enumeration → L7 Recon → AI Analysis

Stage 1: SLD Discovery

The first stage identifies all Second-Level Domains (SLDs) that belong to the target organization. If we're scanning "Acme Corp," we need to find every domain they operate: acmecorp.com, acme-services.com, acme-capital.com, and potentially dozens more.

We query crt.sh by organization name (not just domain), which returns certificates issued to "Acme Corp" regardless of domain. Then we cross-reference with reverse WHOIS, SecurityTrails, and other sources to find domains registered by the same entity.

Stage 2: AI Validation and Expansion

This is where DDactic diverges from every other scanner we've seen.

Raw SLD discovery produces noise. When we search for "Acme," crt.sh also returns acme.de (a German company), acme-tools.com (an unrelated retailer), and various unrelated domains.

AI Validation: Every discovered SLD is sent to Claude Haiku with the company context. The AI scores each domain 0-100 for ownership confidence:

Input:  Company "Acme Corp" (Israel)
        Discovered SLD: acme.de

AI Output:
  Score: 8/100
  Reason: German TLD (.de) for an Israeli company.
          No evidence of international operations in Germany.
  Verdict: EXCLUDE

AI Expansion: After validation, a second AI call suggests SLDs that the 13 sources might have missed:

Input:  Company "Acme Corp" (Israel)
        Known SLDs: acmecorp.com, acme-services.com

AI Output:
  Suggestions:
    - acme-capital.com (investment subsidiary)
    - acmecorp.co.il (local domain variant)
    - acme-health.com (health division)

  Verified via DNS: 2 of 3 resolve to active servers

Country context detection: If any confirmed SLD ends in .co.il, the AI automatically flags domains with foreign TLDs (.de, .fr, .it) as likely false positives. This single heuristic eliminated 25% of false positives in our financial sector scans.

Stage 3: Deep Enumeration and L7 Recon

For every validated SLD, we enumerate all subdomains across our 13 sources, then perform L7 reconnaissance on every discovered asset:

HTTP fingerprinting: Server headers, response codes, TLS certificate details
CDN/WAF detection: Which protection vendor fronts each asset
Origin discovery: Finding the real server IP behind CDN proxies
Login detection: Identifying customer portals, admin panels, API endpoints
Technology stack: Frameworks, CMS, cloud platforms
Breach exposure: Credential leaks associated with each domain

AI Analysis: Four Stages

AI isn't just used for SLD validation. Four distinct Claude Haiku calls happen during every scan:

Stage	Purpose	Input	Output
SLD Validation	Filter false positives	Company name + discovered SLDs	Score 0-100 per SLD, include/exclude verdict
SLD Expansion	Find missing domains	Company name + confirmed SLDs	Suggested SLDs, verified via DNS
Platform Classification	Map hosting topology	Assets with HTTP headers	Platform labels (Cloudflare, Vercel, AWS, etc.)
Resilience Analysis	Assess DDoS risk	All assets + findings	Risk scores, attack vectors, remediation priority

Total cost: under $0.02 per scan. Four API calls to Claude Haiku. The model is fast enough that all four stages add less than 30 seconds to a scan that runs 5-15 minutes.

Real-World Results

Case: Israeli Financial Group

Single-source scan (crt.sh only): 47 domains found. 12 were false positives (foreign companies with similar names). 35 real domains.

DDactic full pipeline: 58 unique domains found across 13 sources. AI excluded 14 false positives. AI suggested 4 additional SLDs, 3 confirmed via DNS. Final count: 47 verified domains. 12 more than single-source, zero false positives.

crt.sh Raw Results

False Positives Removed

+12

Assets Other Sources Found

Verified Final Count

What the Extra 12 Domains Revealed

3 API endpoints without detected cloud protection serving JSON without any CDN or WAF
2 staging environments with production database connections
1 legacy VPN portal on an old IP range, no rate limiting
4 subsidiary brand domains that the security team didn't know were still active
2 internal tools (admin panels) exposed to the public internet

None of these appeared in crt.sh. All of them represent real attack surface.

Why Breach Databases Matter for DDoS

This might seem counterintuitive: why does a DDoS resilience platform check breach databases?

Because attack surface is not just infrastructure. If 200 employee credentials from your organization appeared in a breach dump, an attacker can:

Log in to your VPN portal (bypassing all network-level protection)
Access internal dashboards that aren't behind CDN/WAF
Authenticate to API endpoints and bypass rate limiting
Pivot from credential access to service disruption

We found that companies with the strongest perimeter defense often had the highest credential exposure. The investment went to CDN/WAF/scrubbing, while credential hygiene and monitoring were deprioritized.

The Residential Proxy Advantage

Half of our 13 sources rate-limit or block requests from cloud IP ranges. This is a real problem for scanners running on AWS, GCP, or any cloud provider.

DDactic routes all discovery through residential IP infrastructure. The requests look like a regular user browsing from a home connection, not a scanner running from us-east-1.

This matters because:

crt.sh: Aggressively rate-limits cloud IPs. Residential IPs get full access.
VirusTotal: API quotas are stricter for known cloud ranges.
Shodan: Some queries return filtered results for automated scanners.
Google CSE: Cloud IP requests trigger CAPTCHA challenges.

The difference is measurable: residential-routed queries return 30-40% more results than direct cloud queries for the same targets.

Architecture

The pipeline runs on dedicated infrastructure (not serverless, not Lambda). Single-company scans execute on our Dedibox backend server, where the full Python pipeline has access to all 13 sources, AI validation, and breach database lookups.

For industry-wide scans (scanning hundreds of companies simultaneously), we use AWS Batch with a Go-based scanner binary that parallelizes across companies.

Single Company Scan Flow:

Dashboard (Cloudflare Pages)
    |
    v
Backend API (Dedibox :8080)
    |
    +---> crt.sh (residential proxy) --+
    +---> VirusTotal API              |
    +---> SecurityTrails API          |
    +---> Shodan API                  +--> SLD List
    +---> CertSpotter API             |
    +---> CIRCL Passive DNS           |
    +---> 7 more sources ...         --+
    |
    v
AI Validation (Claude Haiku) --> Filtered SLDs
    |
    v
AI Expansion (Claude Haiku) --> Additional SLDs
    |
    v
Subdomain Enumeration (all sources, per SLD)
    |
    v
L7 Recon (HTTP, TLS, CDN, WAF, Origin)
    |
    v
AI Platform Classification (Claude Haiku)
    |
    v
AI Resilience Analysis (Claude Haiku)
    |
    v
Results --> S3 --> Dashboard

What's Next

We're expanding the pipeline in three directions:

Continuous monitoring: Re-scanning at intervals to detect new subdomains, expired certificates, and configuration drift
Mobile and desktop app recon: Traffic interception labs that discover API endpoints apps connect to (not visible in DNS or CT logs)
Active validation: After passive discovery, optionally testing each asset's DDoS resilience with controlled load from our 24-platform bot fleet

The goal is a complete loop: discover the attack surface, assess its resilience, harden the gaps, and continuously verify.

Try it yourself: Run a free scan at ddactic.net/free-scan. No account required. See what your attack surface actually looks like across 13 intelligence sources.

Why Your CDN Isn't Protecting You

How attackers bypass CDN protection to reach origin servers.

How We Built a Self-Healing Fleet Across 19 Cloud Providers

Multi-cloud orchestration and spot instance recovery.

Introducing OPI: The Open Protection Index

A new open standard for measuring DDoS resilience.

How We Find Every Subdomain: 13 Sources, AI Validation, and the Discovery Pipeline

The Single-Source Problem

The 13 Sources

The Pipeline: Three Stages

Stage 1: SLD Discovery

Stage 2: AI Validation and Expansion

Stage 3: Deep Enumeration and L7 Recon

AI Analysis: Four Stages

Real-World Results

Case: Israeli Financial Group

What the Extra 12 Domains Revealed

Why Breach Databases Matter for DDoS

The Residential Proxy Advantage

Architecture

What's Next

Related Articles