What are homoglyphs?
Homoglyphs are characters from different scripts or encodings that look identical or nearly identical to each other. This guide explains Unicode confusable characters, the UTS #39 skeleton algorithm, browser defenses against homoglyph abuse, and detection strategies for identifying deceptive domain names.
7 min read
What it is#
A homoglyph is a character that looks identical or very similar to another character from a different script, encoding, or code point. The term comes from the Greek homo (same) and glyphē (carving). The Latin letter a (U+0061) and the Cyrillic letter а (U+0430) render identically in most typefaces, yet they are entirely different Unicode code points with different semantic meanings.
The characters themselves are not malicious. They exist because Unicode encodes each writing system independently, and visual overlap between scripts is an unavoidable consequence of how human alphabets evolved from common ancestors.
A homoglyph attack is the broader category, covering any use of lookalike characters to deceive across usernames, file names, email addresses, code, or rendered text. IDN homograph attacks are a specific subtype focused on domain names. They exploit the fact that internationalized domain names (IDNs) allow non-Latin characters in web addresses, so a fake domain encoded with lookalike Cyrillic characters can render identically to a real one in a browser's address bar. Where a general homoglyph attack targets any surface where text appears, an IDN homograph attack targets the domain name specifically, with the added complication of Punycode encoding and browser display rules.
Unicode confusables and UTS #39#
The Unicode Consortium catalogs visually confusable characters in Unicode Technical Standard #39 (UTS #39, "Unicode Security Mechanisms"). The standard defines three core mechanisms:
- Confusable mappings. Character pairs or sets that are visually similar. The current
confusables.txtdataset maps approximately 6,565 characters to their visual equivalents, spanning Latin, Cyrillic, Greek, Armenian, Cherokee, and dozens of other scripts. - Skeleton algorithm. A normalization function that maps confusable characters to a canonical form. Two strings that look the same to a human produce the same skeleton, enabling automated detection. For example,
skeleton("paypal")andskeleton("ρ⍺у𝓅𝒂ן")return identical output despite the second string mixing Greek, Cyrillic, and Hebrew characters. - Mixed-script detection. Rules for identifying strings that combine characters from multiple scripts, a strong signal of intentional deception in contexts like domain names.
The confusable data is updated with each Unicode release. Up to 20% of common English words have at least one confusable representation in another script, illustrating the scale of the problem.
Common homoglyph pairs#
Some of the most frequently exploited homoglyph pairs in domain abuse include:
| Latin | Lookalike | Script | Unicode | Notes |
|---|---|---|---|---|
a | а | Cyrillic | U+0430 | Visually identical in most fonts |
e | е | Cyrillic | U+0435 | Identical rendering |
o | ο | Greek | U+03BF | Greek omicron, identical to Latin o |
o | о | Cyrillic | U+043E | Three-way confusable with Greek omicron |
p | р | Cyrillic | U+0440 | Cyrillic "er", identical to Latin p |
c | с | Cyrillic | U+0441 | Cyrillic "es", identical to Latin c |
x | х | Cyrillic | U+0445 | Cyrillic "kha", identical to Latin x |
i | і | Cyrillic | U+0456 | Ukrainian/Belarusian i |
l | ӏ | Cyrillic | U+04CF | Cyrillic palochka |
Beyond cross-script pairs, homoglyphs exist within a single script. The digit 0 and the letter O, the digit 1 and lowercase l, are classic confusables that use only ASCII (American Standard Code for Information Interchange, the long-standing 7-bit character set for English letters, digits, and common punctuation). They predate Unicode by decades. These intra-script pairs overlap with simpler typosquatting techniques like addition and vowel swap, but cross-script homoglyphs create a qualitatively different threat because the resulting strings can be character-for-character identical to the target in rendered output.
Whole-script confusables#
A whole-script confusable replaces every character in a string with a visually identical character from a single non-Latin script. For domain names, this is the most dangerous form of homoglyph abuse because early browser heuristics only flagged labels that mixed scripts. A label written entirely in Cyrillic passed those checks even when it was character-for-character identical to an English word. The domain-specific mechanics of this technique, including Punycode encoding, browser display rules, certificate issuance, and registry controls, are covered in IDN homograph attacks.
Attack surfaces beyond the address bar#
Browser address-bar protections for homoglyph domains are specific to the domain-name context and are explained in IDN homograph attacks. Outside of domain names, homoglyph abuse applies to any surface where text is rendered and read by a human:
- Email display names and headers. A sender display name using a Cyrillic
аin place of a Latinapasses most mail filters unmodified, since the underlying envelope address may be legitimate. - Usernames and handles. Platforms that allow Unicode in usernames can host accounts that impersonate real users or brands at a character level rather than a spelling level.
- Source code. Variable names, string literals, and comments can contain confusable characters that compile and lint cleanly while deceiving a human reviewer during code review.
- Document and chat links. Hyperlink text in PDFs, slide decks, and messaging platforms displays whatever Unicode the author supplied; the underlying URL may differ entirely from what the rendered text implies.
- Log output and terminals. Confusable characters in process names, file paths, or log entries can mislead analysts inspecting system activity.
In all of these contexts, no equivalent of the browser's Punycode fallback exists. Detection relies on the same skeleton-based and visual similarity analysis used for domain monitoring, applied to the relevant data source.
Detection#
Identifying homoglyph-based lookalike domains requires purpose-built detection, since standard string comparison treats visually identical cross-script characters as completely different. The skeleton algorithm described in the UTS #39 section above is the standard baseline for automated detection, but it has known limitations. It does not normalize case (PayPal and paypal produce different skeletons), does not strip diacritics, and cannot catch confusables that depend on specific font rendering.
Visual similarity models. Research has explored rendering domain strings as images and comparing them with convolutional neural networks. Siamese CNNs trained on rendered text have achieved 13 to 45% improvement over traditional string-comparison algorithms like Levenshtein distance for detecting visually deceptive domains.
Layered enrichment. Homoglyph detection produces false positives against legitimate non-Latin domain registrations. Effective triage combines confusable matching with additional signals: WHOIS and RDAP registration recency, TLS certificate issuance via Certificate Transparency logs, DNS resolution patterns, and web content analysis. This layered approach mirrors the methodology used for other typosquatting permutation categories.
Monitoring and response#
Unlike bitsquatting, which produces a small deterministic set of variants, the homoglyph permutation space for a given domain can be extremely large. A domain composed entirely of characters that have cross-script confusables (like apple, where every letter has a Cyrillic equivalent) produces a combinatorial explosion of mixed and whole-script variants. Prioritization typically focuses on whole-script confusables and high-frequency character substitutions rather than exhaustive enumeration.
Defensive registration of homoglyph variants is possible but less practical than for categories with smaller permutation counts. Domain monitoring and brand protection enforcement through registrar abuse processes and UDRP proceedings are more scalable responses, particularly for phishing campaigns that leverage homoglyph domains for brand impersonation.
Have I Been Squatted generates homoglyph variants for every monitored domain and checks whether they are registered. Matches are enriched with DNS, certificate, and hosting data to help distinguish active phishing infrastructure from legitimate internationalized registrations.
Previous
Typosquatting protection
Next
What is typosquatting?
More from Typosquatting
View allIDN homograph attacks
IDN homograph attacks exploit visual similarity between characters in different Unicode scripts to create domains that appear identical to legitimate ones. This guide covers the technical mechanism, notable demonstrations, browser and registry defenses, and detection approaches.
Typosquatting examples
Documented real-world typosquatting incidents, from Google's typo-domain disputes to Fortune 500 email interception and supply-chain attacks on package managers. Each case illustrates a distinct attack category with dates, outcomes, and lessons.
Typosquatting permutations
Typosquatting permutation generation is the process of algorithmically enumerating all plausible misspellings and variations of a domain name. This guide explains the permutation categories, the tools that generate them, the combinatorial explosion problem, and how security teams prioritize the output.