What are internationalized domain names (IDNs)?
Internationalized Domain Names (IDNs) allow domain labels to contain non-ASCII characters, serving billions of users worldwide. This guide explains how IDNs work through the IDNA standard and Punycode encoding, and why they create security considerations around homograph attacks.
8 min read
What they are#
Internationalized Domain Names (IDNs) are domain names containing characters outside the traditional ASCII set, including letters from Cyrillic, Arabic, Chinese, Devanagari, and hundreds of other scripts defined in Unicode. The original DNS protocol restricted domain labels to the 37 characters of LDH (letters, digits, hyphen), locking out most of the world's writing systems. IDNs remove that barrier. A user in Moscow can type пример.рф, a user in Tokyo can type 例え.jp, and the domain system resolves both without requiring Latin transliteration.
IDNs are legitimate infrastructure, not an attack technique. They exist because the internet serves billions of people whose languages do not use the Latin alphabet, and a naming system limited to a-z is exclusionary by design. The security considerations they introduce (discussed below) are a side effect of that inclusivity, not its purpose.
The IDNA standard and its evolution#
The mechanism that makes IDNs possible is IDNA (Internationalized Domain Names in Applications). The first version, IDNA2003, relied on a preprocessing step called Nameprep, which used general-purpose Unicode normalization (NFKC) and case-folding to map input labels before encoding. That approach proved fragile. It mapped characters in ways that did not always align with linguistic expectations, and it could not adapt to new Unicode versions without risking backward-incompatible changes.
The current version, IDNA2008, replaced IDNA2003. The revision dropped Nameprep entirely and introduced a purpose-built character-property table that classifies every Unicode code point as PVALID, DISALLOWED, CONTEXTJ, CONTEXTO, or UNASSIGNED. Applications derive character validity algorithmically from Unicode properties rather than from a static mapping table, allowing the standard to accommodate new Unicode versions without protocol revisions.
The two generations of the standard disagree on certain characters. The German sharp-s (ß) and the Greek final sigma (ς) were mapped away by IDNA2003's normalization step but are valid labels under IDNA2008. To bridge this gap, the Unicode Consortium published UTS #46 (Unicode IDNA Compatibility Processing), a compatibility layer that lets browsers and other client software handle domains registered under either version. Each new Unicode release triggers a review of UTS #46 to identify and resolve emerging incompatibilities.
How encoding works#
When a user enters a domain containing non-ASCII characters (such as münchen.de with a German umlaut), the application performs a sequence of steps defined by IDNA2008:
- Validate each label against the IDNA permitted-character table, rejecting code points that are disallowed or unassigned
- Check context rules for characters that are valid only in specific positions, such as zero-width joiners in Arabic or Indic scripts
- Encode each non-ASCII label into Punycode and prepend the
xn--ACE prefix - Resolve the resulting ASCII-only string through standard DNS
The two representations have formal names. The Unicode form visible to users is the U-label; the Punycode-encoded form that DNS processes is the A-label. For пример.рф, the A-label is xn--e1afmapc.xn--p1ai. The user never needs to see the A-label under normal conditions, but it is the form stored in zone files, WHOIS records, and Certificate Transparency logs.
A critical design property is that the DNS, which only processes ASCII labels, is never modified. IDNA operates entirely at the application layer. Resolvers, authoritative servers, and caching infrastructure all see ordinary ASCII A-labels. This kept the deployment path simple but shifted all Unicode-handling responsibility to client software, a fact with significant implications for security consistency across applications.
Adoption#
IDN adoption is substantial and growing. ICANN reports over 150 TLDs delegated as IDNs, representing dozens of languages across more than 20 scripts. Several TLDs are themselves entirely non-Latin: .рф (Russia), .中国 (China), .भारत (India), .السعودية (Saudi Arabia), and dozens more. gTLD registries hold well over a million IDN registrations, with Chinese script accounting for nearly half the total and Latin script (including accented characters) at roughly 28%.
Registry policies vary widely. Some TLD operators accept broad Unicode character sets; others restrict labels to characters from a single script or a single language. ICANN's Label Generation Rules (LGR) project defines which code points are permitted in each script for root-zone labels, covering 27 scripts across multiple versions. At the second level, registries publish IDN tables specifying their own permitted repertoires and must deposit those tables in the IANA Repository for IDN Practices.
The Universal Acceptance Steering Group (UASG) works to ensure that software handles IDNs correctly, though compatibility gaps persist in practice:
- Email. Many mail transfer agents and validation libraries reject non-ASCII local parts or domain labels, even though later standards added SMTP support for internationalized email
- Enterprise software. Internal tools, CRMs, and ticketing systems often truncate or reject domain names containing
xn--prefixes or non-Latin characters - Programming libraries. URL parsers in older language standard libraries may fail on IDN input unless explicitly configured for IDNA-aware processing
These gaps limit the practical reach of IDN infrastructure and create inconsistencies in how internationalized domains are handled across the software ecosystem.
Security implications#
IDNs introduce a class of abuse known as IDN homograph attacks. Unicode contains thousands of characters that are visually identical or near-identical across scripts, known as homoglyphs. Latin a (U+0061) and Cyrillic а (U+0430) are indistinguishable in most typefaces. An attacker can register an IDN composed entirely of Cyrillic look-alikes that renders identically to a Latin-script brand domain, producing a lookalike domain that is invisible to casual inspection.
The classic demonstration: аррӏе.com spelled with Cyrillic а, р, and Armenian ӏ displays as apple.com in fonts that render these glyphs identically. In Punycode, the domain resolves as xn--80ak6aa92e.com, a string that bears no resemblance to the original. The disconnect between visual presentation and underlying DNS identity is the core of the threat.
The risk is not theoretical. ICANN analysis of IDN abuse has found that phishing accounts for nearly 99% of reported IDN security threats. Latin and Chinese scripts comprise the majority of reported problematic registrations. That said, IDNs have a significantly lower overall abuse ratio than ASCII domains (by a factor of roughly 5 to 1), indicating that the vast majority of IDN registrations serve legitimate purposes.
Defenses at the browser and registry level#
Modern browsers apply display policies that show the Punycode xn-- form instead of decoded Unicode when a label triggers suspicion. Chrome uses a multi-layered checker. It flags mixed-script labels, whole-script confusable domains, and labels whose ICU skeletons match known top domains. Firefox applies a similar approach, validating labels against allowed script combinations (such as Han + Hiragana + Katakana for Japanese) and falling back to Punycode when confusable detection triggers. Safari takes a stricter stance, displaying Punycode for any label that mixes scripts.
At the registry level, ICANN's IDN Implementation Guidelines require operators to publish code-point repertoires and prohibit registrations containing unlisted characters. Many registries enforce single-script policies, blocking labels that combine characters from multiple writing systems. The Maximal Starting Repertoire (MSR) framework addresses cross-script homoglyphs, script-internal homoglyphs, and ASCII lookalikes as distinct categories of confusability risk.
These layered defenses reduce but do not eliminate the threat. Not all applications apply the same display heuristics, and contexts such as email headers, messaging apps, and mobile notification bars may render IDNs in their decoded Unicode form without any confusable warning. Whole-script homograph domains (where every character belongs to a single non-Latin script) are particularly difficult to catch at the browser level, because single-script labels are generally considered legitimate. The defense gap is widest in non-browser contexts: QR code scanners, command-line tools, PDF viewers, and API clients rarely implement any confusable-character checking at all.
Monitoring IDNs for brand protection#
For domain monitoring and phishing domain detection teams, IDNs expand the attack surface that must be watched. A brand name written in Latin characters may have confusable equivalents in Cyrillic, Greek, Armenian, or other scripts that produce visually identical IDN registrations. Standard ASCII string matching misses these variants entirely, and simple substring filters on zone-file data will not detect a Punycode A-label that encodes to a confusable U-label.
Effective detection requires decoding Punycode labels from CT logs and RDAP data, then applying Unicode confusable-character analysis to identify brand impersonation attempts. The confusable-character mappings maintained by the Unicode Consortium provide the reference dataset, but operationalizing them at scale demands automated permutation generation and continuous scanning.
Useful monitoring signals include:
- Certificate Transparency logs. IDN domains that obtain TLS certificates are preparing to serve HTTPS content. Decoding Punycode labels in CT log entries and comparing the resulting U-labels against confusable mappings surfaces impersonation attempts before they reach users.
- WHOIS and RDAP registration data. New IDN registrations whose decoded labels match confusable permutations of a monitored brand indicate potential abuse, regardless of whether the domain has been activated.
- Passive DNS resolution. Active resolution of an IDN homograph domain confirms that it is in use and likely serving content or receiving traffic.
- HTTP banner analysis. Checking the content served by a suspected homograph domain can distinguish parked pages from active phishing infrastructure.
Have I Been Squatted generates IDN homograph permutations for monitored domains and checks them against certificate issuance, registration data, and DNS resolution. Because homograph variants are structurally predictable, they can be enumerated and tracked alongside other lookalike domain categories such as omission, transposition, and keyword squatting, providing continuous visibility into a threat class that manual inspection cannot reliably catch.
Previous
Phishing domain detection
Next
What is bitsquatting?
More from Typosquatting
View allIDN homograph attacks
IDN homograph attacks exploit visual similarity between characters in different Unicode scripts to create domains that appear identical to legitimate ones. This guide covers the technical mechanism, notable demonstrations, browser and registry defenses, and detection approaches.
Typosquatting examples
Documented real-world typosquatting incidents, from Google's typo-domain disputes to Fortune 500 email interception and supply-chain attacks on package managers. Each case illustrates a distinct attack category with dates, outcomes, and lessons.
Typosquatting permutations
Typosquatting permutation generation is the process of algorithmically enumerating all plausible misspellings and variations of a domain name. This guide explains the permutation categories, the tools that generate them, the combinatorial explosion problem, and how security teams prioritize the output.