Phishing domain detection (techniques, data sources, and triage) - Typosquatting

What it is#

Phishing domain detection is the practice of identifying domain names that are being used, or are likely to be used, in phishing campaigns before they reach end users. No single signal reliably separates phishing infrastructure from legitimate registrations, so detection systems layer multiple data sources and analytical techniques into a pipeline that narrows millions of daily domain events down to actionable alerts.

The challenge is fundamentally asymmetric. Attackers register domains cheaply and quickly; defenders must evaluate every new registration, certificate issuance, and DNS change against a set of protected brands. The techniques described below form the standard detection pipeline, roughly ordered from least to most expensive in computational cost.

String similarity and lexical analysis#

The first filter in most detection pipelines is name-based. Comparing candidate domains against a set of protected brand strings using metrics like Levenshtein distance or Jaro-Winkler similarity catches typosquats, homoglyphs, and close permutations. Have I Been Squatted's twistrs library generates permutations (addition, omission, transposition, bitsquatting, hyphenation, TLD swaps, etc.) and scores each variant against the original. The Unicode skeleton algorithm extends this to IDN homograph attacks, where visually identical characters from different scripts produce domain names that appear legitimate in a browser address bar but resolve to attacker infrastructure.

Keyword matching provides a second lexical layer, scanning domain strings for brand names embedded as substrings. This catches combosquats and subdomain takeover patterns that edit-distance metrics miss because the brand string is present in full. Suspicious structural features, such as high character entropy, excessive hyphens, and keywords like "login", "verify", or "secure", further increase a domain's threat score.

Length-based pre-filtering can reduce the computational cost of similarity searches by roughly 90%, scanning only domains within a narrow character tolerance of the target. At the scale of newly registered domain (NRD) feeds, which publish hundreds of thousands of registrations per day across all TLDs, this optimization is essential. Even with filtering, the volume of candidates that pass string-similarity thresholds is too large for manual review, which is why downstream enrichment and signal stacking are necessary.

Registration and infrastructure signals#

Domain registration metadata provides context that pure string analysis cannot. WHOIS and RDAP records reveal creation dates, registrar identity, and (where available) registrant information. Privacy-protected registrations are the norm, but registrar patterns and creation timestamps remain useful. Research from ICANN found that each dollar reduction in TLD registration fees correlates with a 49% increase in maliciously registered domains, and free hosting services drive an 88% surge in phishing activity.

Infrastructure clustering groups domains by shared hosting IP, name server, registrar, or certificate issuer to surface bulk-registration campaigns. Phishing operators frequently register dozens of lookalike domains through the same registrar within a short window, and clustering makes these campaigns visible even when individual domains look innocuous. ASN reputation and hosting provider reputation data enriches this clustering.

Passive DNS records reveal when a domain begins resolving, what IP addresses it points to, and what other domains share the same infrastructure. Applying graph neural networks to passive DNS data has increased phishing classification accuracy from 85% to 90% by capturing clustering patterns among domains controlled by the same threat actors.

Certificate Transparency monitoring#

Certificate Transparency (CT) logs are public, append-only records of TLS certificates. When a phishing domain obtains a certificate (often through automated certificate authorities like Let's Encrypt), the certificate appears in CT logs before the phishing page may be live. This creates a lead-time advantage. Monitoring CT logs for certificates containing brand names or confusable strings can surface threats during the gap between registration and weaponization.

CT monitoring tools match incoming certificates against known brand strings using the same similarity and keyword techniques applied to NRD feeds. A domain that appears in both a CT log match and an NRD feed match, within a short time window, is far more likely to be phishing than one matching only a single source. Certificate metadata also provides enrichment signals. The issuing certificate authority, certificate validity period, Subject Alternative Name (SAN) structure, and whether the certificate covers a single domain or a wildcard all contribute to risk scoring.

DNS and HTTP enrichment#

Raw domain lists become actionable through enrichment. DNS lookups confirm whether a domain resolves, reveal its hosting infrastructure, and expose MX records that indicate mail capability. HTTP banner analysis identifies web server software, redirect behavior, and response codes. Fetching page content and checking for login forms, brand logos, cloned HTML, or redirect chains to credential-harvesting pages provides strong evidence of intent.

Visual comparison, rendering the page and comparing screenshots against known legitimate sites, is the highest-confidence content signal. Phishing kits frequently clone login pages pixel-for-pixel, making visual similarity a reliable indicator. It is also the most resource-intensive technique and typically reserved for domains that already score highly on other signals. DNS filtering services can act on enrichment data in near-real-time, blocking resolution of confirmed phishing domains before users reach the page.

Machine learning classification#

Machine learning classifiers trained on features extracted from domain strings, DNS records, WHOIS data, and web content score the likelihood that a domain is phishing. Random forest classifiers have achieved 99.7% accuracy in brand-domain identification tasks when using a small set of optimized features (most common link domain, logo domain, and form action domain). Deep learning architectures have reached 92% zero-day detection rates with 3.5% false positive rates.

These models improve detection rates but require ongoing retraining as attacker techniques evolve. The feature set matters more than the algorithm. Domain-string features alone produce moderate accuracy, but combining lexical, DNS, WHOIS, and content features into a single feature vector significantly improves both precision and recall.

Emerging research on language-model-based detection (such as PhishReplicant) targets generated squatting domains that employ multiple evasion techniques and lack direct brand name inclusion, a class of threats that traditional string similarity misses entirely.

As attackers adopt domain generation algorithms (DGAs) and AI-assisted naming, detection systems must evolve beyond static permutation lists toward models that can generalize to previously unseen naming patterns.

The detection timeline challenge#

Phishing campaigns are increasingly short-lived. Research analyzing phishing-site lifespans found a median operational duration of 5.46 hours, with logistics-themed campaigns (impersonating carriers like USPS) compressing to 1.76 hours. Phishing sites that undergo extensive visual modifications (cloning multiple brand pages or rotating lures) persist longer, with a median lifespan of 17 days, but simpler campaigns are often abandoned within hours. Roughly 66% of phishing domains are maliciously registered rather than compromised, and 75% of victims encounter the malicious site before any blocklist provides protection.

Google Safe Browsing detects only about 18% of phishing websites, taking an average of 4.5 days; by that point, 84% of the sites have already been taken down. Detection systems that run on daily batch cycles miss the majority of short-lived campaigns. Adversary-in-the-middle phishing infrastructure is a particularly high-value detection target because the reverse proxy relays traffic to the real service, making the page indistinguishable from the genuine login except for the domain name.

Near-real-time CT log monitoring and streaming NRD analysis reduce the detection gap but increase infrastructure costs and alert volume. The goal is to surface threats during the window between registration and weaponization, before the first phishing email is sent. For organizations targeted by sophisticated campaigns, even a few hours of advance warning can be the difference between proactive protection and reactive incident response.

False positive management#

Phishing domain detection operates in a high-noise environment. Newly registered domains are overwhelmingly benign. String similarity matching flags legitimate businesses, fan sites, news domains, and parked pages alongside actual threats. Keyword-based matching surfaces brand impersonation attempts but also authorized partners, resellers, and affiliates.

Operational triage workflows address this by stacking signals. A domain that is newly registered, has a brand-confusable name, obtained a TLS certificate, and hosts a login form is far more likely to be phishing than a domain matching only one criterion. Each additional signal narrows the candidate set and increases confidence.

Domain threat intelligence platforms formalize this stacking into scoring models that prioritize alerts by composite risk rather than any single indicator. Effective malicious domain detection requires balancing recall (catching threats) against precision (avoiding false alarms that erode analyst trust). When false positive rates are too high, analysts begin ignoring alerts entirely, which defeats the purpose of the detection pipeline. Tuning thresholds, weighting signals appropriately, and providing enrichment context alongside each alert are essential to maintaining operational value.

Applying detection to brand protection#

Systematic phishing domain detection is a core component of domain monitoring and brand protection strategy. Organizations that rely solely on user reports or blocklist feeds miss the majority of short-lived campaigns. Proactive detection through CT monitoring, NRD analysis, and infrastructure enrichment shifts the timeline from reactive cleanup to early intervention, enabling faster DMARC policy enforcement and takedown requests. The most effective programs combine automated detection with defensive domain registration for high-risk permutations, reducing the attack surface before adversaries can exploit it.

Have I Been Squatted combines permutation generation with registration checks, Certificate Transparency extended search, and DNS, HTTP, RDAP, and screenshot enrichment to detect phishing domains targeting a monitored brand as early as possible, bridging the gap between registration and weaponization.

Phishing domain detection