Lookalike domains
Lookalike domains are an umbrella category covering any domain designed to visually resemble a legitimate target, including typosquats, homoglyphs, combosquats, and TLD variants. This guide maps the full taxonomy, surveys detection techniques from edit distance to neural visual similarity, and outlines practical triage strategies.
6 min read
What lookalike domains are#
Lookalike domains is a catch-all term for any domain name crafted to be confused with a legitimate target. The confusion may exploit a typing error, a visual resemblance between characters, or a structural trick that places the target brand in an unexpected position within the URL. Each technique abuses a different aspect of how humans read, type, or skim domain names, and each demands a different detection strategy.
The concept is not new, but the scale is. Over 200,000 new domains are registered daily, and monitoring services routinely identify hundreds of thousands of lookalike domains targeting major brands each year. Understanding the taxonomy is the first step toward closing detection gaps.
Taxonomy of techniques#
Lookalike domain techniques fall into three broad families: character-level mutations, structural manipulations, and semantic extensions.
Character-level mutations#
- Typosquatting covers keyboard-adjacent errors and simple misspellings. Subtypes include addition (
googles.com), omission (gogle.com), transposition (googel.com), and vowel swap (goagle.com). These variants can be generated deterministically from the target string, making enumeration straightforward. - Homoglyph substitution replaces characters with visually identical glyphs from other Unicode scripts. Cyrillic
а(U+0430) is indistinguishable from Latina(U+0061) in most fonts, soаpple.comrendered in a browser looks identical toapple.com. These domains are encoded as Punycode in DNS and form the basis of IDN homograph attacks. - Bitsquatting registers domains that differ by a single bit in ASCII representation, exploiting hardware memory errors rather than human mistakes. The permutation set is small and deterministic.
Structural manipulations#
- Hyphenation inserts hyphens into a brand name (
pay-pal.com), creating a domain that reads naturally but points elsewhere. - TLD squatting registers the same label under a different top-level domain (
amazon.coinstead ofamazon.com), betting that users will not notice the suffix difference.
Semantic extensions#
- Combosquatting appends or prepends keywords to a brand name (
amazon-login.com,paypal-security.net). Longitudinal DNS analysis shows that combosquatting abuse grows year over year, with almost 60% of abusive combosquatting domains persisting for over 1,000 days. - Keyword squatting targets generic industry terms rather than a specific brand, capturing traffic from users searching for a category rather than a company.
The categories are not mutually exclusive. Attackers routinely combine techniques, registering a combosquat that also uses a homoglyph substitution, for instance. Research on "generated squatting domains" (GSDs) has shown that multi-technique domains are increasingly common and harder for single-method detectors to catch.
Detection approaches#
String distance metrics#
The simplest approach measures the edit distance between a candidate domain and a protected target. Levenshtein distance counts the minimum insertions, deletions, and substitutions needed to transform one string into another. Jaro-Winkler similarity weights the beginning of strings more heavily, reflecting how users tend to read left-to-right. These metrics are fast and effective for typosquats, but they assign low risk to combosquats (which share the full brand string) and miss homoglyphs entirely (since substituted characters differ at the codepoint level, not the visual level).
Unicode confusable analysis#
Unicode Technical Standard #39 defines a skeleton algorithm that maps visually similar characters to a canonical form. If skeleton("paypal") equals skeleton("𝔭𝒶ỿ𝕡𝕒ℓ"), the two strings are confusable. Applying this algorithm to Punycode-decoded internationalized domain names catches homoglyph substitutions that string distance metrics overlook. The Unicode Consortium maintains a confusables data file with mappings across scripts, updated with each Unicode release.
Visual similarity models#
Rendering two domain strings as bitmap images and comparing them pixel-by-pixel captures what no character-level metric can, namely whether two strings actually look the same to a human eye. Siamese convolutional neural networks trained on rendered string pairs have improved confusable detection by 13 to 45% (measured by ROC AUC) over edit-distance baselines. Raycasting through font vector outlines has discovered over 249,000 unique single-character confusable pairs across 245 fonts, more than three times the count identified by prior methods.
Machine learning at scale#
Large language model approaches have achieved over 94% accuracy on curated squatting datasets and detected tens of thousands of squatting domains from millions of new registrations, with detection rates roughly 2.5 times higher than baseline methods. Complementary approaches focusing on linguistic similarity rather than exact brand-name matching have identified thousands of attacker-acquired domains within weeks of monitoring.
Triage signals#
Detection generates candidates; triage separates confirmed threats from noise. Effective triage layers multiple signals:
- Registration age. Domains registered within the past 30 days carry higher risk. WHOIS and RDAP creation dates provide this signal.
- DNS resolution. A domain with active A or AAAA records is more likely to be weaponized than one that does not resolve. Passive DNS data reveals resolution history.
- Mail infrastructure. The presence of MX records suggests potential use in email-based phishing.
- TLS certificates. A recently issued certificate from a public CA, observable through Certificate Transparency logs, indicates the domain is preparing to serve HTTPS content.
- Web content analysis. Fetching the page and checking for cloned logos, login forms, or brand assets provides the strongest signal of active brand impersonation.
- Hosting context. Shared infrastructure with known malicious domains, low-reputation ASNs, or bulletproof hosting providers increases the risk score.
False-positive rates remain a central challenge. Many domains that resemble a brand are benign, including fan sites, legitimate businesses with similar names, or parked domains with no active content. Alert fatigue from high false-positive rates is one of the primary reasons that domain monitoring programs fail in practice.
Closing gaps#
A monitoring program that implements only one detection method will miss entire categories. Typosquat enumeration does not find combosquats. Substring matching does not catch homoglyphs. Visual similarity models do not detect subdomain takeover. The taxonomy exists precisely to help security teams audit their coverage and identify blind spots.
Defensive registration can preemptively block high-risk variants, but the economics only scale for a small, prioritized subset. For the long tail, continuous monitoring and automated triage remain essential, and enforcement actions provide recourse when malicious registrations are identified.
Have I Been Squatted generates permutations across all categories in this taxonomy, from character-level mutations through combosquats and TLD variants. Each match is enriched with DNS, certificate, WHOIS, and hosting signals so that domain threat intelligence teams can focus on confirmed threats rather than raw candidate lists.
Previous
Brand impersonation
Next
Phishing domain detection
More from Typosquatting
View allIDN homograph attacks
IDN homograph attacks exploit visual similarity between characters in different Unicode scripts to create domains that appear identical to legitimate ones. This guide covers the technical mechanism, notable demonstrations, browser and registry defenses, and detection approaches.
Typosquatting examples
Documented real-world typosquatting incidents, from Google's typo-domain disputes to Fortune 500 email interception and supply-chain attacks on package managers. Each case illustrates a distinct attack category with dates, outcomes, and lessons.
Typosquatting permutations
Typosquatting permutation generation is the process of algorithmically enumerating all plausible misspellings and variations of a domain name. This guide explains the permutation categories, the tools that generate them, the combinatorial explosion problem, and how security teams prioritize the output.