Lookalike domains (Types, detection and protection) - Typosquatting

What lookalike domains are#

Lookalike domains is a catch-all term for any domain name crafted to be confused with a legitimate target. The confusion may exploit a typing error, a visual resemblance between characters, or a structural trick that places the target brand in an unexpected position within the URL. Each technique abuses a different aspect of how humans read, type, or skim domain names, and each demands a different detection strategy.

The concept is not new, but the scale is. Over 200,000 new domains are registered daily, and monitoring services routinely identify hundreds of thousands of lookalike domains targeting major brands each year. Understanding the taxonomy is the first step toward closing detection gaps.

Taxonomy of techniques#

Lookalike domain techniques fall into three broad families: character-level mutations, structural manipulations, and semantic extensions.

Character-level mutations#

Typosquatting covers keyboard-adjacent errors and simple misspellings. Subtypes include addition (googles.com), omission (gogle.com), transposition (googel.com), and vowel swap (goagle.com). These variants can be generated deterministically from the target string, making enumeration straightforward.
Homoglyph substitution replaces characters with visually identical glyphs from other Unicode scripts. Cyrillic а (U+0430) is indistinguishable from Latin a (U+0061) in most fonts, so аpple.com rendered in a browser looks identical to apple.com. These domains are encoded as Punycode in DNS and form the basis of IDN homograph attacks.
Bitsquatting registers domains that differ by a single bit in ASCII representation, exploiting hardware memory errors rather than human mistakes. The permutation set is small and deterministic.

Structural manipulations#

Hyphenation inserts hyphens into a brand name (pay-pal.com), creating a domain that reads naturally but points elsewhere.
TLD squatting registers the same label under a different top-level domain (amazon.co instead of amazon.com), betting that users will not notice the suffix difference.

Semantic extensions#

Combosquatting appends or prepends keywords to a brand name (amazon-login.com, paypal-security.net). Longitudinal DNS analysis shows that combosquatting abuse grows year over year, with almost 60% of abusive combosquatting domains persisting for over 1,000 days.
Keyword squatting targets generic industry terms rather than a specific brand, capturing traffic from users searching for a category rather than a company.

The categories are not mutually exclusive. Attackers routinely combine techniques, registering a combosquat that also uses a homoglyph substitution, for instance. Research on "generated squatting domains" (GSDs) has shown that multi-technique domains are increasingly common and harder for single-method detectors to catch.

Detection approaches#

String distance metrics#

The simplest approach measures the edit distance between a candidate domain and a protected target. Levenshtein distance counts the minimum insertions, deletions, and substitutions needed to transform one string into another. Jaro-Winkler similarity weights the beginning of strings more heavily, reflecting how users tend to read left-to-right. These metrics are fast and effective for typosquats, but they assign low risk to combosquats (which share the full brand string) and miss homoglyphs entirely (since substituted characters differ at the codepoint level, not the visual level).

Unicode confusable analysis#

Unicode Technical Standard #39 defines a skeleton algorithm that maps visually similar characters to a canonical form. If skeleton("paypal") equals skeleton("𝔭𝒶ỿ𝕡𝕒ℓ"), the two strings are confusable. Applying this algorithm to Punycode-decoded internationalized domain names catches homoglyph substitutions that string distance metrics overlook. The Unicode Consortium maintains a confusables data file with mappings across scripts, updated with each Unicode release.

Visual similarity models#

Rendering two domain strings as bitmap images and comparing them pixel-by-pixel captures what no character-level metric can, namely whether two strings actually look the same to a human eye. Siamese convolutional neural networks trained on rendered string pairs have improved confusable detection by 13 to 45% (measured by ROC AUC) over edit-distance baselines. Raycasting through font vector outlines has discovered over 249,000 unique single-character confusable pairs across 245 fonts, more than three times the count identified by prior methods.

Machine learning at scale#

Large language model approaches have achieved over 94% accuracy on curated squatting datasets and detected tens of thousands of squatting domains from millions of new registrations, with detection rates roughly 2.5 times higher than baseline methods. Complementary approaches focusing on linguistic similarity rather than exact brand-name matching have identified thousands of attacker-acquired domains within weeks of monitoring.

Triage signals#

Detection generates candidates; triage separates confirmed threats from noise. Effective triage layers multiple signals:

Registration age. Domains registered within the past 30 days carry higher risk. WHOIS and RDAP creation dates provide this signal.
DNS resolution. A domain with active A or AAAA records is more likely to be weaponized than one that does not resolve. Passive DNS data reveals resolution history.
Mail infrastructure. The presence of MX records suggests potential use in email-based phishing.
TLS certificates. A recently issued certificate from a public CA, observable through Certificate Transparency logs, indicates the domain is preparing to serve HTTPS content.
Web content analysis. Fetching the page and checking for cloned logos, login forms, or brand assets provides the strongest signal of active brand impersonation.
Hosting context. Shared infrastructure with known malicious domains, low-reputation ASNs, or bulletproof hosting providers increases the risk score.

False-positive rates remain a central challenge. Many domains that resemble a brand are benign, including fan sites, legitimate businesses with similar names, or parked domains with no active content. Alert fatigue from high false-positive rates is one of the primary reasons that domain monitoring programs fail in practice.

Closing gaps#

A monitoring program that implements only one detection method will miss entire categories. Typosquat enumeration does not find combosquats. Substring matching does not catch homoglyphs. Visual similarity models do not detect subdomain takeover. The taxonomy exists precisely to help security teams audit their coverage and identify blind spots.

Defensive registration can preemptively block high-risk variants, but the economics only scale for a small, prioritized subset. For the long tail, continuous monitoring and automated triage remain essential, and enforcement actions provide recourse when malicious registrations are identified.

Have I Been Squatted generates permutations across all categories in this taxonomy, from character-level mutations through combosquats and TLD variants. Each match is enriched with DNS, certificate, WHOIS, and hosting signals so that domain threat intelligence teams can focus on confirmed threats rather than raw candidate lists.

Lookalike domains