What is Punycode?
Punycode is the ASCII-Compatible Encoding that allows Unicode characters in domain names. This guide explains how the Bootstring algorithm works, the xn-- prefix convention, browser display policies, and how attackers abuse Punycode to create deceptive lookalike domains.
7 min read
What it is#
Punycode is a transfer encoding syntax that converts Unicode strings into the ASCII character set permitted by DNS. The Domain Name System restricts labels to a-z, 0-9, and hyphens, so any domain containing characters outside that range needs a representation that fits within those constraints. Punycode provides that representation, and it is the mechanism that makes internationalized domain names (IDNs) function.
A domain label containing non-ASCII characters is converted to Punycode and prefixed with xn--, the ACE prefix (ASCII-Compatible Encoding). The domain münchen.de becomes xn--mnchen-3ya.de in DNS. Browsers and applications perform this conversion transparently, so users see the Unicode form while DNS infrastructure resolves the ASCII form.
Punycode operates within a broader standards stack called IDNA (Internationalizing Domain Names in Applications). IDNA governs how applications convert Unicode labels to their ASCII form and back, with Punycode handling the encoding step itself.
Punycode is not an attack technique. It is legitimate infrastructure. But the gap between what humans read in a decoded domain and what DNS actually resolves creates a surface that attackers exploit through IDN homograph attacks.
How the encoding works#
Punycode implements a general algorithm called Bootstring. The algorithm encodes Unicode code points as a compact ASCII sequence through four steps:
- Basic code point segregation. All ASCII characters in the label are copied to the output first. For
café, the output begins withcaf. - Delta encoding. The positions and values of non-ASCII characters are expressed as numeric deltas, capturing each character's code point and its insertion position relative to the previous one.
- Variable-length integer encoding. Each delta is converted into a base-36 integer using the characters
a-z(values 0-25) and0-9(values 26-35). A variable-length scheme allows arbitrarily large code points to be represented. - Bias adaptation. A dynamic bias adjusts the encoding thresholds after each character, reducing output length by adapting to patterns in the input.
A hyphen delimiter separates the literal ASCII portion from the encoded non-ASCII portion. The domain café.com becomes xn--caf-dma.com. The ASCII characters caf appear before the delimiter, and dma encodes the position and value of é. Labels with no ASCII characters at all produce output where everything after xn-- is encoded content, which is why a fully Cyrillic label can produce a string like xn--80ak6aa92e that bears no resemblance to the decoded Unicode.
The algorithm is deterministic; the same input always produces identical output, and decoding always recovers the exact original Unicode string. This reversibility is essential for DNS, where every resolver and application must agree on the mapping between a Unicode label and its wire-format representation.
The apple.com demonstration#
Security researcher Xudong Zheng disclosed a proof-of-concept that exposed a fundamental weakness in how browsers handled Punycode display. He registered the domain xn--80ak6aa92e.com, which decoded to a string composed entirely of Cyrillic characters visually identical to the Latin letters in apple.com. When visited in Chrome, Firefox, or Opera, the address bar displayed what appeared to be apple.com, with no visible indication that the domain was a Punycode-encoded Cyrillic string.
The attack succeeded because every character came from a single script. Browsers at the time checked for mixed-script labels (combining, say, Latin and Cyrillic characters in one label) and displayed Punycode when mixing was detected. An all-Cyrillic label bypassed that check entirely. Cyrillic а (U+0430) is a pixel-perfect match for Latin a (U+0061), Cyrillic р (U+0440) matches Latin p, and so on. See homoglyphs for the character pairs that enable this class of deception.
After disclosure, Google expanded Chrome's IDN display policy to flag whole-script lookalike domains. Firefox and Opera initially declined to implement equivalent restrictions, with Mozilla arguing that displaying Punycode for non-Latin scripts risked making those scripts "second-class citizens" on the web. The disagreement illustrated a genuine tension. Aggressive Punycode display protects Latin-script users from homograph attacks but degrades the browsing experience for the billions of internet users who read Cyrillic, Greek, Arabic, or CJK scripts natively.
Browser display policies#
Modern browsers apply layered heuristics to decide whether to show a domain's Unicode form or its raw xn-- Punycode string:
- Mixed-script detection. If a single label combines characters from multiple scripts (Latin and Cyrillic, for example), the browser displays the Punycode form.
- Whole-script confusable blocking. Even single-script labels are shown as Punycode if every character has a Latin lookalike. This rule catches all-Cyrillic or all-Greek strings that mimic Latin words.
- ICU skeleton checks. Chrome applies Unicode confusable mappings (from Unicode Technical Report #39) to detect whether a label's "skeleton" matches a known brand or common word.
- TLD whitelisting. Firefox maintains a list of trusted TLDs where Unicode display is permitted unconditionally, combined with a "Moderately Restrictive" profile for all other TLDs.
These defenses are not comprehensive. Security research testing over 9,000 cases across Chrome, Firefox, Safari, Edge, and mobile browsers has found exploitable gaps in every browser tested, with some browsers reversing their own rules over time, re-allowing certain homograph IDNs that were previously blocked.
Safari takes one of the stricter approaches, displaying Punycode in the address bar for a broader set of suspect labels than Chrome or Firefox. The variation between browsers means a domain that appears as Punycode in one browser may render as convincing Unicode in another, complicating user awareness.
Email clients, messaging platforms, and mobile apps often lack equivalent protections entirely. A Punycode domain that browsers flag correctly may still appear as a convincing Unicode string in a phishing email or a chat message, making non-browser contexts the primary remaining attack surface.
Security implications beyond browsers#
The security concern with Punycode is not the encoding itself but the mismatch between the human-readable decoded form and the machine-resolved ASCII form. This mismatch enables several threat categories:
- IDN homograph attacks. Registering domains with visually confusable characters from scripts like Cyrillic, Greek, or Armenian to impersonate Latin-script brands.
- Brand impersonation. Combining Punycode domains with valid TLS certificates to create convincing credential-harvesting pages.
- Phishing domain detection evasion. ASCII-only pattern matching against threat feeds misses
xn--encoded domains unless the monitoring system decodes them first.
Attackers can obtain certificates for xn-- domains through standard certificate authorities, since the Punycode form is a syntactically valid domain name. Certificate Transparency logs record these issuances, but the entries appear in their xn-- encoded form, requiring decoding to identify visual impersonation. Registry-level restrictions help reduce exposure (many ccTLDs and gTLDs limit labels to a single script or require characters from an approved set), but enforcement varies widely, and generic TLDs impose fewer constraints.
Punycode in domain monitoring#
Security teams monitoring Certificate Transparency logs, WHOIS records, or RDAP feeds encounter xn-- domains regularly. Effective domain monitoring decodes Punycode labels and applies confusable-character analysis rather than relying on ASCII string matching alone. A domain that looks like random characters in its encoded form may decode to a near-perfect visual replica of a protected brand name.
Automated detection should cross-reference decoded labels against known homoglyph mappings and Unicode confusable tables to flag domains warranting investigation. This approach catches threats that conventional typosquatting permutation methods, which operate on ASCII characters, inherently miss. The Zheng demonstration proved that a single xn-- domain can be a high-fidelity replica of a globally recognized brand; systematic decoding and confusable analysis is the only reliable way to surface these registrations before they are weaponized.
Have I Been Squatted decodes xn-- labels across Certificate Transparency logs and registration data, applies confusable-character analysis, and flags IDN domains that visually impersonate monitored names. This detection runs alongside ASCII-based permutation checks for lookalike domains, omission, transposition, and other squatting categories, covering both Latin-character and Unicode-based brand impersonation in a single monitoring pipeline.
Previous
What is keyword squatting?
Next
What is TLD squatting?
More from Typosquatting
View allIDN homograph attacks
IDN homograph attacks exploit visual similarity between characters in different Unicode scripts to create domains that appear identical to legitimate ones. This guide covers the technical mechanism, notable demonstrations, browser and registry defenses, and detection approaches.
Typosquatting examples
Documented real-world typosquatting incidents, from Google's typo-domain disputes to Fortune 500 email interception and supply-chain attacks on package managers. Each case illustrates a distinct attack category with dates, outcomes, and lessons.
Typosquatting permutations
Typosquatting permutation generation is the process of algorithmically enumerating all plausible misspellings and variations of a domain name. This guide explains the permutation categories, the tools that generate them, the combinatorial explosion problem, and how security teams prioritize the output.