IDN homograph attacks (How they work and defenses) - Typosquatting

What it is#

An IDN homograph attack is a specific form of homoglyph attack that targets domain names. Where the broader homoglyph attack category covers any context in which lookalike characters deceive (usernames, file names, rendered text, code), the IDN variant exploits internationalized domain names (IDNs) specifically. Because IDNs support the full Unicode range via Punycode encoding, an attacker can register a domain composed of non-Latin characters that displays identically to a legitimate brand domain in a browser's address bar but resolves to completely different infrastructure. The concept was first described in the early 2000s when researchers registered a homoglyph variant of "microsoft.com" using Cyrillic characters to demonstrate the risk.

A simple example is the Cyrillic letter а (U+0430), which is pixel-identical to the Latin letter a (U+0061) in most fonts. Replacing every Latin character in "apple" with its Cyrillic counterpart produces a label that looks identical to apple.com in an address bar but resolves in DNS as xn--80ak6aa92e.com. No rendering trick or font manipulation is involved; the characters are genuinely different code points that happen to share the same glyph.

Confusable scripts#

The attack is most effective when every character in the label comes from a single non-Latin script, because mixed-script labels trigger browser defenses. The most dangerous script pairings for Latin-alphabet brands are:

Cyrillic. Shares visual equivalents for a, c, e, o, p, x, y, and several other Latin letters, making it possible to construct entire English words in Cyrillic alone. This is the script used in nearly all documented real-world IDN homograph attacks.
Greek. Shares omicron (ο, identical to Latin o) and several uppercase forms. The overlap is narrower, limiting whole-word construction, but single-character substitutions can still evade casual inspection.
Armenian. Some lowercase characters resemble Latin equivalents. The confusable set is smaller than Cyrillic but remains relevant for targeted brand impersonation.

Whole-script confusable attacks, where every character in the label belongs to one non-Latin script, are the hardest to detect visually. Without inspecting the raw Punycode or certificate details, a user cannot distinguish xn--80ak6aa92e.com (all-Cyrillic "apple") from apple.com.

The apple.com proof of concept#

The most widely cited demonstration came from security researcher Xudong Zheng, who registered xn--80ak6aa92e.com, a domain composed entirely of Cyrillic characters that renders as "apple.com" in Unicode. Because every character belonged to a single script, the existing mixed-script filters in Chrome, Firefox, and Opera did not trigger, and the address bar displayed what appeared to be the legitimate Apple domain, complete with a valid TLS certificate.

After disclosure, Google incorporated a fix into Chrome that flags whole-script confusable labels. Firefox initially declined to change its default behavior, though users could force Punycode display through a manual configuration flag (network.IDN_show_punycode). Safari and Internet Explorer were not affected; both already applied stricter IDN display rules at the time.

The demonstration made the abstract risk concrete. A domain that passed browser address-bar checks, held a valid HTTPS certificate, and could host a pixel-perfect clone of apple.com was available for the cost of a standard registration. It also revealed a gap in how automated certificate authorities validated domain ownership: Let's Encrypt and similar services issued certificates for xn--80ak6aa92e.com without any visual-similarity check, giving the attacker a green padlock in the address bar.

How the attack works#

Character selection. The attacker identifies homoglyph characters from Cyrillic, Greek, or other scripts that are visually identical to the Latin characters in the target domain.
Domain registration. The attacker registers the Punycode-encoded form (e.g., xn--80ak6aa92e.com) through any registrar that supports IDN registrations.
Infrastructure setup. The attacker configures DNS, obtains a TLS certificate (often via automated certificate authorities), and builds a credential-harvesting or malware-delivery page.
Delivery. The domain is distributed via phishing emails, ads, QR codes, or messaging platforms.

A single-character substitution in a mixed-script label (Cyrillic а replacing Latin a in an otherwise Latin string) is relatively easy for browsers to flag. The more dangerous variant uses an all-Cyrillic label that happens to look like an English word. Detecting that case requires the browser to compare the rendered label against a dictionary or list of known high-value domains.

Browser mitigations#

Major browsers have implemented layered defenses that have evolved significantly since the 2017 disclosure:

Mixed-script blocking. If a label contains characters from multiple scripts (e.g., Latin + Cyrillic), the browser displays the raw xn-- Punycode instead of decoded Unicode. This catches partial substitutions but not whole-script attacks.
Whole-script confusable detection. Chrome and Firefox apply the Unicode Consortium's UTS #39 skeleton algorithm, which normalizes visually confusable characters to a canonical form for comparison. If a single-script IDN label produces a skeleton that matches a known Latin domain, the browser displays Punycode.
Top-domain matching. Some browsers maintain lists of high-value domains and force Punycode display for any IDN label confusable with an entry on the list.

These defenses work in address bars but coverage is not universal. Independent security research has found that all tested browsers retain gaps in their IDN display rules, and that some rules have even been relaxed over time. Email clients, chat applications, PDFs, QR codes, and mobile apps frequently display decoded Unicode without any confusability check, leaving significant attack surface outside the browser.

Registry and ICANN restrictions#

ICANN's IDN Implementation Guidelines require registries to publish permitted Unicode code point repertoires and to prevent registrations that mix scripts in ways that violate the guidelines. In practice, enforcement varies:

Some ccTLD registries restrict labels to a single language or script and block registrations confusable with existing domains in other scripts.
A smaller number require proof of linguistic connection to the requested label.
Many gTLD registries impose minimal IDN restrictions beyond the ICANN baseline.

The fragmented enforcement across thousands of TLDs means motivated attackers can usually find a registry that accepts a confusable domain. Registry-level controls reduce bulk abuse but do not eliminate the threat for targeted attacks.

Why IDN homographs are especially dangerous#

Several properties make IDN homograph attacks harder to defend against than other typosquatting categories:

Visual perfection. Unlike addition, omission, or transposition variants, a well-constructed IDN homograph is indistinguishable from the target in standard fonts. There is no misspelling to notice.
Certificate availability. Automated certificate authorities issue certificates based on domain control, not visual similarity. An attacker obtains a valid TLS certificate for the Punycode domain as easily as for any other.
Surface area beyond browsers. Address-bar defenses do not extend to email headers, chat links, PDF hyperlinks, QR codes, or mobile push notifications. In these contexts the decoded Unicode appears without any Punycode fallback.
Low permutation count. The number of viable whole-script homograph variants for a given brand is small (often single digits), making defensive registration feasible but also making each variant high-impact if missed.

Detection and monitoring#

Detecting IDN homograph attacks requires analysis that goes beyond ASCII string comparison:

Punycode decoding. Decode xn-- labels from Certificate Transparency logs and zone file data to reveal the Unicode characters behind each registration.
Skeleton normalization. Apply the UTS #39 skeleton algorithm to compare candidate domains against protected brand names. Visually identical characters from different scripts reduce to the same skeleton, surfacing matches that byte-level string comparison would miss entirely.
Enrichment and triage. Cross-reference matches with DNS records, hosting infrastructure, WHOIS/RDAP registration data, and page content to distinguish active threats from parked or defensive registrations.
Defensive registration. For high-value brands, registering the most dangerous Punycode variants preemptively removes them from attacker inventory.

IDN homograph permutations overlap with broader homoglyph and lookalike domain categories. Effective domain monitoring treats them as one layer in a pipeline that also covers typosquatting, combosquatting, TLD squatting, and other permutation techniques.

Have I Been Squatted includes IDN homograph permutations in its domain permutation analysis, decoding Punycode labels from Certificate Transparency logs and applying confusable-character normalization to surface IDN-based impersonation alongside standard typosquatting variants.

IDN homograph attacks