Levenshtein distance (edit distance for security) - Threat intelligence

What is Levenshtein distance?#

Levenshtein distance (also called edit distance) is the minimum number of single-character edits required to transform one string into another. The three permitted operations are insertion, deletion, and substitution. It was introduced by Vladimir Levenshtein in 1965 and is one of the most widely used string similarity metrics in computer science, and a foundational tool in typosquatting detection.

Worked example#

Consider transforming "kitten" into "sitting":

kitten → sitten (substitute k → s)
sitten → sittin (substitute e → i)
sittin → sitting (insert g)

The Levenshtein distance is 3; three edits are required, and no shorter sequence of edits exists.

For domain names, the distance between google.com and gogle.com is 1 (one deletion), while google.com and g00gle.com is 2 (two substitutions).

How it works#

The standard algorithm builds a matrix of size (n+1) × (m+1) where n and m are the lengths of the two strings. Each cell represents the minimum edits needed to transform a prefix of one string into a prefix of the other. The algorithm fills the matrix row by row, and the final cell gives the distance. Time and space complexity are both O(n × m), which is efficient enough for comparing domain names (typically under 63 characters) but can be expensive for very long strings.

Use in domain similarity scoring#

In typosquatting permutation detection, Levenshtein distance quantifies how close a candidate domain is to a target. A distance of 1 typically indicates a single typo, omission, insertion, or substitution, making the candidate highly suspicious. Detection systems often filter candidates by edit distance thresholds. Domains within distance 1-2 of a monitored brand get the highest scrutiny.

Some implementations normalize the distance by string length (e.g., distance / max(n, m)) to compare domains of different lengths on a consistent scale.

Limitations#

Levenshtein distance treats all characters equally and operates on code points, not visual appearance. Two critical blind spots in domain security:

Homoglyphs may have low edit distance but high visual impact, or the reverse. Replacing Latin a with Cyrillic а is an edit distance of 1 and visually undetectable. But replacing rn with m (which looks similar in many fonts) is actually a deletion and a substitution, producing a distance of 2 despite the strong visual similarity.

Keyboard proximity is ignored. Substituting a with s (adjacent keys) is more likely as a typo than substituting a with x, but both count as one edit.

Complementary measures#

Because of these gaps, production systems combine Levenshtein distance with other metrics. Jaro-Winkler similarity gives extra weight to matching prefixes, which aligns with how users read domain names left to right. Visual similarity scoring compares rendered glyphs rather than character codes, catching homoglyph attacks that edit distance misses. Using multiple metrics together, as part of a broader phishing domain detection pipeline, produces more robust similarity assessments than any single measure.

Levenshtein distance remains a foundational building block for scoring and prioritizing lookalike domains, but it must be paired with visual and contextual signals to cover the cases where edit distance alone falls short.

What is Levenshtein distance?