DDactic Blog / Security Research
Security Research · March 2026 · 12 min read

How We Find Every Subdomain: 13 Sources, AI Validation, and the Discovery Pipeline

Most attack surface scanners query one or two data sources and call it a day. DDactic queries 13, validates every result with AI, expands coverage by suggesting what sources missed, and checks breach databases for credential exposure. Here's exactly how the pipeline works, why single-source discovery fails, and what AI actually contributes to reconnaissance.
13
Intelligence Sources
4
AI Analysis Stages
<$0.02
AI Cost Per Scan
3x
More Assets Found

The Single-Source Problem

If you search your company on crt.sh right now, you'll get a list of subdomains from Certificate Transparency logs. It's the most popular subdomain discovery tool in existence, and security teams use it daily.

The problem: crt.sh only sees domains that have had TLS certificates issued for them. It misses:

Every single data source has blind spots like these. Shodan finds internet-facing services but misses CDN-proxied domains. SecurityTrails has historical DNS but gaps in certain TLDs. VirusTotal aggregates malware reports but doesn't cover domains that were never flagged.

No single source gives you the full picture. The only way to approach completeness is to query many sources and merge the results.

The 13 Sources

DDactic's discovery pipeline queries these sources for every scan:

CategorySourceWhat It Finds
Certificate Transparencycrt.shEvery TLS certificate ever issued for a domain
Certificate TransparencyCertSpotterReal-time CT log monitoring with different coverage than crt.sh
Threat IntelligenceVirusTotalSubdomains seen in malware analysis, URL scans, passive DNS
Threat IntelligenceSecurityTrailsHistorical DNS records, WHOIS history, subdomain enumeration
Threat IntelligenceCIRCL Passive DNSEuropean passive DNS collection from network sensors
Search & OSINTGoogle CSEIndexed subdomains from web crawling
Search & OSINTWHOISXMLDomain registration data, reverse WHOIS lookups
Search & OSINTShodanInternet-wide port scanning, service banners, SSL certificates
NetworkRIPE AtlasBGP routing data, IP prefix announcements
NetworkBGP Routing TablesASN mapping, IP range ownership
DiscoveryDNS Brute-forceCommon subdomain patterns (dev, staging, api, vpn, mail)
DiscoveryReverse WHOISOther domains registered by the same organization
CredentialBreach DatabasesEmployee credentials exposed in known data breaches

Rate limit bypass: Most of these APIs rate-limit or block requests from cloud IPs and known scanner ranges. DDactic routes discovery through residential IP infrastructure, bypassing bot detection, JS challenges, and IP reputation filters. This surfaces results that scanners running from AWS or GCP never see.

The Pipeline: Three Stages

SLD Discovery AI Validation Subdomain Enumeration L7 Recon AI Analysis

Stage 1: SLD Discovery

The first stage identifies all Second-Level Domains (SLDs) that belong to the target organization. If we're scanning "Migdal Insurance," we need to find every domain they operate: migdal.co.il, migdalins.co.il, migdal-capital.co.il, and potentially dozens more.

We query crt.sh by organization name (not just domain), which returns certificates issued to "Migdal Insurance" regardless of domain. Then we cross-reference with reverse WHOIS, SecurityTrails, and other sources to find domains registered by the same entity.

Stage 2: AI Validation and Expansion

This is where DDactic diverges from every other scanner we've seen.

Raw SLD discovery produces noise. When we search for "Migdal," crt.sh also returns migdal.de (a German company), migdal-emek.muni.il (a municipality), and various unrelated domains.

AI Validation: Every discovered SLD is sent to Claude Haiku with the company context. The AI scores each domain 0-100 for ownership confidence:

Input:  Company "Migdal Insurance" (Israel)
        Discovered SLD: migdal.de

AI Output:
  Score: 8/100
  Reason: German TLD (.de) for an Israeli insurance company.
          No evidence of international operations in Germany.
  Verdict: EXCLUDE

AI Expansion: After validation, a second AI call suggests SLDs that the 13 sources might have missed:

Input:  Company "Migdal Insurance" (Israel)
        Known SLDs: migdal.co.il, migdalins.co.il

AI Output:
  Suggestions:
    - migdal-capital.co.il (investment subsidiary)
    - migdalor.co.il (brand variant)
    - migdal-health.co.il (health insurance arm)

  Verified via DNS: 2 of 3 resolve to active servers

Country context detection: If any confirmed SLD ends in .co.il, the AI automatically flags domains with foreign TLDs (.de, .fr, .it) as likely false positives. This single heuristic eliminated 25% of false positives in our Israeli financial sector scans.

Stage 3: Deep Enumeration and L7 Recon

For every validated SLD, we enumerate all subdomains across our 13 sources, then perform L7 reconnaissance on every discovered asset:

AI Analysis: Four Stages

AI isn't just used for SLD validation. Four distinct Claude Haiku calls happen during every scan:

StagePurposeInputOutput
SLD ValidationFilter false positivesCompany name + discovered SLDsScore 0-100 per SLD, include/exclude verdict
SLD ExpansionFind missing domainsCompany name + confirmed SLDsSuggested SLDs, verified via DNS
Platform ClassificationMap hosting topologyAssets with HTTP headersPlatform labels (Cloudflare, Vercel, AWS, etc.)
Resilience AnalysisAssess DDoS riskAll assets + findingsRisk scores, attack vectors, remediation priority

Total cost: under $0.02 per scan. Four API calls to Claude Haiku. The model is fast enough that all four stages add less than 30 seconds to a scan that runs 5-15 minutes.

Real-World Results

Case: Israeli Financial Group

Single-source scan (crt.sh only): 47 domains found. 12 were false positives (foreign companies with similar names). 35 real domains.

DDactic full pipeline: 58 unique domains found across 13 sources. AI excluded 14 false positives. AI suggested 4 additional SLDs, 3 confirmed via DNS. Final count: 47 verified domains. 12 more than single-source, zero false positives.

47
crt.sh Raw Results
12
False Positives Removed
+12
Assets Other Sources Found
47
Verified Final Count

What the Extra 12 Domains Revealed

None of these appeared in crt.sh. All of them represent real attack surface.

Why Breach Databases Matter for DDoS

This might seem counterintuitive: why does a DDoS resilience platform check breach databases?

Because attack surface is not just infrastructure. If 200 employee credentials from your organization appeared in a breach dump, an attacker can:

We found that companies with the strongest perimeter defense often had the highest credential exposure. The investment went to CDN/WAF/scrubbing, while credential hygiene and monitoring were deprioritized.

The Residential Proxy Advantage

Half of our 13 sources rate-limit or block requests from cloud IP ranges. This is a real problem for scanners running on AWS, GCP, or any cloud provider.

DDactic routes all discovery through residential IP infrastructure. The requests look like a regular user browsing from a home connection, not a scanner running from us-east-1.

This matters because:

The difference is measurable: residential-routed queries return 30-40% more results than direct cloud queries for the same targets.

Architecture

The pipeline runs on dedicated infrastructure (not serverless, not Lambda). Single-company scans execute on our Dedibox backend server, where the full Python pipeline has access to all 13 sources, AI validation, and breach database lookups.

For industry-wide scans (scanning hundreds of companies simultaneously), we use AWS Batch with a Go-based scanner binary that parallelizes across companies.

Single Company Scan Flow:

Dashboard (Cloudflare Pages)
    |
    v
Backend API (Dedibox :8080)
    |
    +---> crt.sh (residential proxy) --+
    +---> VirusTotal API              |
    +---> SecurityTrails API          |
    +---> Shodan API                  +--> SLD List
    +---> CertSpotter API             |
    +---> CIRCL Passive DNS           |
    +---> 7 more sources ...         --+
    |
    v
AI Validation (Claude Haiku) --> Filtered SLDs
    |
    v
AI Expansion (Claude Haiku) --> Additional SLDs
    |
    v
Subdomain Enumeration (all sources, per SLD)
    |
    v
L7 Recon (HTTP, TLS, CDN, WAF, Origin)
    |
    v
AI Platform Classification (Claude Haiku)
    |
    v
AI Resilience Analysis (Claude Haiku)
    |
    v
Results --> S3 --> Dashboard

What's Next

We're expanding the pipeline in three directions:

The goal is a complete loop: discover the attack surface, assess its resilience, harden the gaps, and continuously verify.


Try it yourself: Run a free scan at ddactic.net/free-scan. No account required. See what your attack surface actually looks like across 13 intelligence sources.