data:image/s3,"s3://crabby-images/a2743/a2743fed0a7f4e5bed2490d3c97575df5f49e748" alt="Edit distance"
In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two strings (e.g., words) are to one another, that is measured by counting the minimum number of operations required to transform one string into the other. Edit distances find applications in natural language processing, where automatic spelling correction can determine candidate corrections for a misspelled word by selecting words from a dictionary that have a low distance to the word in question. In bioinformatics, it can be used to quantify the similarity of DNA sequences, which can be viewed as strings of the letters A, C, G and T.
Different definitions of an edit distance use different sets of like operations. Levenshtein distance operations are the removal, insertion, or substitution of a character in the string. Being the most common metric, the term Levenshtein distance is often used interchangeably with edit distance.
Types of edit distance
Different types of edit distance allow different sets of string operations. For instance:
Algorithm | Operations Allowed | |||
---|---|---|---|---|
Insertions | Deletions | Substitutions | Transposition | |
Levenshtein Distance | ✓ | ✓ | ✓ | |
Longest Common Subsequence (LCS) | ✓ | ✓ | ||
Hamming Distance | ✓ | |||
Damerau–Levenshtein Distance | ✓ | ✓ | ✓ | ✓ |
Jaro distance | ✓ |
Some edit distances are defined as a parameterizable metric calculated with a specific set of allowed edit operations, and each operation is assigned a cost (possibly infinite). This is further generalized by DNA sequence alignment algorithms such as the Smith–Waterman algorithm, which make an operation's cost depend on where it is applied.
Formal definition and properties
Given two strings a and b on an alphabet Σ (e.g. the set of ASCII characters, the set of bytes [0..255], etc.), the edit distance d(a, b) is the minimum-weight series of edit operations that transforms a into b. One of the simplest sets of edit operations is that defined by Levenshtein in 1966:
- Insertion of a single symbol. If a = uv, then inserting the symbol x produces uxv. This can also be denoted ε→x, using ε to denote the empty string.
- Deletion of a single symbol changes uxv to uv (x→ε).
- Substitution of a single symbol x for a symbol y ≠ x changes uxv to uyv (x→y).
In Levenshtein's original definition, each of these operations has unit cost (except that substitution of a character by itself has zero cost), so the Levenshtein distance is equal to the minimum number of operations required to transform a to b. A more general definition associates non-negative weight functions wins(x), wdel(x) and wsub(x, y) with the operations.
Additional primitive operations have been suggested. Damerau–Levenshtein distance counts as a single edit a common mistake: transposition of two adjacent characters, formally characterized by an operation that changes uxyv into uyxv. For the task of correcting OCR output, merge and split operations have been used which replace a single character into a pair of them or vice versa.
Other variants of edit distance are obtained by restricting the set of operations. Longest common subsequence (LCS) distance is edit distance with insertion and deletion as the only two edit operations, both at unit cost.: 37 Similarly, by only allowing substitutions (again at unit cost), Hamming distance is obtained; this must be restricted to equal-length strings.Jaro–Winkler distance can be obtained from an edit distance where only transpositions are allowed.
Example
The Levenshtein distance between "kitten" and "sitting" is 3. A minimal edit script that transforms the former into the latter is:
- kitten → sitten (substitute "s" for "k")
- sitten → sittin (substitute "i" for "e")
- sittin → sitting (insert "g" at the end)
LCS distance (insertions and deletions only) gives a different distance and minimal edit script:
- kitten → itten (delete "k" at 0)
- itten → sitten (insert "s" at 0)
- sitten → sittn (delete "e" at 4)
- sittn → sittin (insert "i" at 4)
- sittin → sitting (insert "g" at 6)
for a total cost/distance of 5 operations.
Properties
Edit distance with non-negative cost satisfies the axioms of a metric, giving rise to a metric space of strings, when the following conditions are met:: 37
- Every edit operation has positive cost;
- for every operation, there is an inverse operation with equal cost.
With these properties, the metric axioms are satisfied as follows:
- d(a, b) = 0 if and only if a=b, since each string can be trivially transformed to itself using exactly zero operations.
- d(a, b) > 0 when a ≠ b, since this would require at least one operation at non-zero cost.
- d(a, b) = d(b, a) by equality of the cost of each operation and its inverse.
- Triangle inequality: d(a, c) ≤ d(a, b) + d(b, c).
Levenshtein distance and LCS distance with unit cost satisfy the above conditions, and therefore the metric axioms. Variants of edit distance that are not proper metrics have also been considered in the literature.
Other useful properties of unit-cost edit distances include:
- LCS distance is bounded above by the sum of lengths of a pair of strings.: 37
- LCS distance is an upper bound on Levenshtein distance.
- For strings of the same length, Hamming distance is an upper bound on Levenshtein distance.
Regardless of cost/weights, the following property holds of all edit distances:
- When a and b share a common prefix, this prefix has no effect on the distance. Formally, when a = uv and b = uw, then d(a, b) = d(v, w). This allows speeding up many computations involving edit distance and edit scripts, since common prefixes and suffixes can be skipped in linear time.
Computation
The first algorithm for computing minimum edit distance between a pair of strings was published by Damerau in 1964.
Common algorithm
Using Levenshtein's original operations, the (nonsymmetric) edit distance from to
is given by
, defined by the recurrence
This algorithm can be generalized to handle transpositions by adding another term in the recursive clause's minimization.
The straightforward, recursive way of evaluating this recurrence takes exponential time. Therefore, it is usually computed using a dynamic programming algorithm that is commonly credited to Wagner and Fischer, although it has a history of multiple invention. After completion of the Wagner–Fischer algorithm, a minimal sequence of edit operations can be read off as a backtrace of the operations used during the dynamic programming algorithm starting at .
This algorithm has a time complexity of Θ(mn) where m and n are the lengths of the strings. When the full dynamic programming table is constructed, its space complexity is also Θ(mn); this can be improved to Θ(min(m,n)) by observing that at any instant, the algorithm only requires two rows (or two columns) in memory. However, this optimization makes it impossible to read off the minimal series of edit operations. A linear-space solution to this problem is offered by Hirschberg's algorithm.: 634 A general recursive divide-and-conquer framework for solving such recurrences and extracting an optimal sequence of operations cache-efficiently in space linear in the size of the input is given by Chowdhury, Le, and Ramachandran.
Improved algorithms
Improving on the Wagner–Fisher algorithm described above, Ukkonen describes several variants, one of which takes two strings and a maximum edit distance s, and returns min(s, d). It achieves this by only computing and storing a part of the dynamic programming table around its diagonal. This algorithm takes time O(s×min(m,n)), where m and n are the lengths of the strings. Space complexity is O(s2) or O(s), depending on whether the edit sequence needs to be read off.
Further improvements by Landau, Myers, and Schmidt [1] give an O(s2 + max(m,n)) time algorithm.
For a finite alphabet and edit costs which are multiples of each other, the fastest known exact algorithm is of Masek and Paterson having worst case runtime of O(nm/logn).
Applications
Edit distance finds applications in computational biology and natural language processing, e.g. the correction of spelling mistakes or OCR errors, and approximate string matching, where the objective is to find matches for short strings in many longer texts, in situations where a small number of differences is to be expected.
Various algorithms exist that solve problems beside the computation of distance between a pair of strings, to solve related types of problems.
- Hirschberg's algorithm computes the optimal alignment of two strings, where optimality is defined as minimizing edit distance.
- Approximate string matching can be formulated in terms of edit distance. Ukkonen's 1985 algorithm takes a string p, called the pattern, and a constant k; it then builds a deterministic finite state automaton that finds, in an arbitrary string s, a substring whose edit distance to p is at most k (cf. the Aho–Corasick algorithm, which similarly constructs an automaton to search for any of a number of patterns, but without allowing edit operations). A similar algorithm for approximate string matching is the bitap algorithm, also defined in terms of edit distance.
- Levenshtein automata are finite-state machines that recognize a set of strings within bounded edit distance of a fixed reference string.
Language edit distance
A generalization of the edit distance between strings is the language edit distance between a string and a language, usually a formal language. Instead of considering the edit distance between one string and another, the language edit distance is the minimum edit distance that can be attained between a fixed string and any string taken from a set of strings. More formally, for any language L and string x over an alphabet Σ, the language edit distance d(L, x) is given by, where
is the string edit distance. When the language L is context free, there is a cubic time dynamic programming algorithm proposed by Aho and Peterson in 1972 which computes the language edit distance. For less expressive families of grammars, such as the regular grammars, faster algorithms exist for computing the edit distance.
Language edit distance has found many diverse applications, such as RNA folding, error correction, and solutions to the Optimum Stack Generation problem.
See also
- Graph edit distance
- String-to-string correction problem
- String metric
- Time Warp Edit Distance
References
- Daniel Jurafsky; James H. Martin. Speech and Language Processing. Pearson Education International. pp. 107–111.
- Esko Ukkonen (1983). On approximate string matching. Foundations of Computation Theory. Springer. pp. 487–495. doi:10.1007/3-540-12689-9_129.
- Schulz, Klaus U.; Mihov, Stoyan (2002). "Fast string correction with Levenshtein automata". International Journal of Document Analysis and Recognition. 5 (1): 67–85. CiteSeerX 10.1.1.16.652. doi:10.1007/s10032-002-0082-8. S2CID 207046453.
- Lei Chen; Raymond Ng (2004). On the marriage of Lp-norms and edit distance (PDF). Proc. 30th Int'l Conf. on Very Large Databases (VLDB). Vol. 30. doi:10.1016/b978-012088469-8.50070-x.
- Kukich, Karen (1992). "Techniques for Automatically Correcting Words in Text" (PDF). ACM Computing Surveys. 24 (4): 377–439. doi:10.1145/146370.146380. S2CID 5431215. Archived from the original (PDF) on 2016-09-27. Retrieved 2017-11-09.
- R. Wagner; M. Fischer (1974). "The string-to-string correction problem". J. ACM. 21: 168–178. doi:10.1145/321796.321811. S2CID 13381535.
- Skiena, Steven (2010). The Algorithm Design Manual (2nd ed.). Springer Science+Business Media. Bibcode:2008adm..book.....S. ISBN 978-1-849-96720-4.
- Chowdhury, Rezaul; Le, Hai-Son; Ramachandran, Vijaya (July 2010). "Cache-oblivious dynamic programming for bioinformatics". IEEE/ACM Transactions on Computational Biology and Bioinformatics. 7 (3): 495–510. doi:10.1109/TCBB.2008.94. PMID 20671320. S2CID 2532039.
- Ukkonen, Esko (1985). "Algorithms for approximate string matching" (PDF). Information and Control. 64 (1–3): 100–118. doi:10.1016/S0019-9958(85)80046-2.
- Landau; Myers; Schmidt (1998). "Incremental String Comparison". SIAM Journal on Computing. 27 (2): 557–582. CiteSeerX 10.1.1.38.1766. doi:10.1137/S0097539794264810.
- Masek, William J.; Paterson, Michael S. (February 1980). "A faster algorithm computing string edit distances". Journal of Computer and System Sciences. 20 (1): 18–31. doi:10.1016/0022-0000(80)90002-1. hdl:1721.1/148933. ISSN 0022-0000.
- Esko Ukkonen (1985). "Finding approximate patterns in strings". J. Algorithms. 6: 132–137. doi:10.1016/0196-6774(85)90023-9.
- Bringmann, Karl; Grandoni, Fabrizio; Saha, Barna; Williams, Virginia Vassilevska (2016). "Truly Sub-cubic Algorithms for Language Edit Distance and RNA-Folding via Fast Bounded-Difference Min-Plus Product" (PDF). 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS). pp. 375–384. arXiv:1707.05095. doi:10.1109/focs.2016.48. ISBN 978-1-5090-3933-3. S2CID 17064578.
- Aho, A.; Peterson, T. (1972-12-01). "A Minimum Distance Error-Correcting Parser for Context-Free Languages". SIAM Journal on Computing. 1 (4): 305–312. doi:10.1137/0201022. ISSN 0097-5397.
- Wagner, Robert A. (1974). "Order-n correction for regular languages". Communications of the ACM. 17 (5): 265–268. doi:10.1145/360980.360995. S2CID 11063282.
- Saha, B. (2014-10-01). The Dyck Language Edit Distance Problem in Near-Linear Time. 2014 IEEE 55th Annual Symposium on Foundations of Computer Science. pp. 611–620. doi:10.1109/FOCS.2014.71. ISBN 978-1-4799-6517-5. S2CID 14806359.
In computational linguistics and computer science edit distance is a string metric i e a way of quantifying how dissimilar two strings e g words are to one another that is measured by counting the minimum number of operations required to transform one string into the other Edit distances find applications in natural language processing where automatic spelling correction can determine candidate corrections for a misspelled word by selecting words from a dictionary that have a low distance to the word in question In bioinformatics it can be used to quantify the similarity of DNA sequences which can be viewed as strings of the letters A C G and T Different definitions of an edit distance use different sets of like operations Levenshtein distance operations are the removal insertion or substitution of a character in the string Being the most common metric the term Levenshtein distance is often used interchangeably with edit distance Types of edit distanceDifferent types of edit distance allow different sets of string operations For instance Algorithm Operations AllowedInsertions Deletions Substitutions TranspositionLevenshtein Distance Longest Common Subsequence LCS Hamming Distance Damerau Levenshtein Distance Jaro distance Some edit distances are defined as a parameterizable metric calculated with a specific set of allowed edit operations and each operation is assigned a cost possibly infinite This is further generalized by DNA sequence alignment algorithms such as the Smith Waterman algorithm which make an operation s cost depend on where it is applied Formal definition and propertiesGiven two strings a and b on an alphabet S e g the set of ASCII characters the set of bytes 0 255 etc the edit distance d a b is the minimum weight series of edit operations that transforms a into b One of the simplest sets of edit operations is that defined by Levenshtein in 1966 Insertion of a single symbol If a u v then inserting the symbol x produces u x v This can also be denoted e x using e to denote the empty string Deletion of a single symbol changes u x v to u v x e Substitution of a single symbol x for a symbol y x changes u x v to u y v x y In Levenshtein s original definition each of these operations has unit cost except that substitution of a character by itself has zero cost so the Levenshtein distance is equal to the minimum number of operations required to transform a to b A more general definition associates non negative weight functions w ins x w del x and w sub x y with the operations Additional primitive operations have been suggested Damerau Levenshtein distance counts as a single edit a common mistake transposition of two adjacent characters formally characterized by an operation that changes u x y v into u y x v For the task of correcting OCR output merge and split operations have been used which replace a single character into a pair of them or vice versa Other variants of edit distance are obtained by restricting the set of operations Longest common subsequence LCS distance is edit distance with insertion and deletion as the only two edit operations both at unit cost 37 Similarly by only allowing substitutions again at unit cost Hamming distance is obtained this must be restricted to equal length strings Jaro Winkler distance can be obtained from an edit distance where only transpositions are allowed Example The Levenshtein distance between kitten and sitting is 3 A minimal edit script that transforms the former into the latter is kitten sitten substitute s for k sitten sittin substitute i for e sittin sitting insert g at the end LCS distance insertions and deletions only gives a different distance and minimal edit script kitten itten delete k at 0 itten sitten insert s at 0 sitten sittn delete e at 4 sittn sittin insert i at 4 sittin sitting insert g at 6 for a total cost distance of 5 operations Properties Edit distance with non negative cost satisfies the axioms of a metric giving rise to a metric space of strings when the following conditions are met 37 Every edit operation has positive cost for every operation there is an inverse operation with equal cost With these properties the metric axioms are satisfied as follows d a b 0 if and only if a b since each string can be trivially transformed to itself using exactly zero operations d a b gt 0 when a b since this would require at least one operation at non zero cost d a b d b a by equality of the cost of each operation and its inverse Triangle inequality d a c d a b d b c Levenshtein distance and LCS distance with unit cost satisfy the above conditions and therefore the metric axioms Variants of edit distance that are not proper metrics have also been considered in the literature Other useful properties of unit cost edit distances include LCS distance is bounded above by the sum of lengths of a pair of strings 37 LCS distance is an upper bound on Levenshtein distance For strings of the same length Hamming distance is an upper bound on Levenshtein distance Regardless of cost weights the following property holds of all edit distances When a and b share a common prefix this prefix has no effect on the distance Formally when a uv and b uw then d a b d v w This allows speeding up many computations involving edit distance and edit scripts since common prefixes and suffixes can be skipped in linear time ComputationThe first algorithm for computing minimum edit distance between a pair of strings was published by Damerau in 1964 Common algorithm Using Levenshtein s original operations the nonsymmetric edit distance from a a1 am displaystyle a a 1 ldots a m to b b1 bn displaystyle b b 1 ldots b n is given by dmn displaystyle d mn defined by the recurrence di0 k 1iwdel ak for1 i md0j k 1jwins bk for1 j ndij di 1 j 1forai bjmin di 1 j wdel ai di j 1 wins bj di 1 j 1 wsub ai bj forai bjfor1 i m 1 j n displaystyle begin aligned d i0 amp sum k 1 i w mathrm del a k amp amp quad text for 1 leq i leq m d 0j amp sum k 1 j w mathrm ins b k amp amp quad text for 1 leq j leq n d ij amp begin cases d i 1 j 1 amp text for a i b j min begin cases d i 1 j w mathrm del a i d i j 1 w mathrm ins b j d i 1 j 1 w mathrm sub a i b j end cases amp text for a i neq b j end cases amp amp quad text for 1 leq i leq m 1 leq j leq n end aligned This algorithm can be generalized to handle transpositions by adding another term in the recursive clause s minimization The straightforward recursive way of evaluating this recurrence takes exponential time Therefore it is usually computed using a dynamic programming algorithm that is commonly credited to Wagner and Fischer although it has a history of multiple invention After completion of the Wagner Fischer algorithm a minimal sequence of edit operations can be read off as a backtrace of the operations used during the dynamic programming algorithm starting at dmn displaystyle d mn This algorithm has a time complexity of 8 m n where m and n are the lengths of the strings When the full dynamic programming table is constructed its space complexity is also 8 m n this can be improved to 8 min m n by observing that at any instant the algorithm only requires two rows or two columns in memory However this optimization makes it impossible to read off the minimal series of edit operations A linear space solution to this problem is offered by Hirschberg s algorithm 634 A general recursive divide and conquer framework for solving such recurrences and extracting an optimal sequence of operations cache efficiently in space linear in the size of the input is given by Chowdhury Le and Ramachandran Improved algorithms Improving on the Wagner Fisher algorithm described above Ukkonen describes several variants one of which takes two strings and a maximum edit distance s and returns min s d It achieves this by only computing and storing a part of the dynamic programming table around its diagonal This algorithm takes time O s min m n where m and n are the lengths of the strings Space complexity is O s 2 or O s depending on whether the edit sequence needs to be read off Further improvements by Landau Myers and Schmidt 1 give an O s 2 max m n time algorithm For a finite alphabet and edit costs which are multiples of each other the fastest known exact algorithm is of Masek and Paterson having worst case runtime of O nm logn ApplicationsEdit distance finds applications in computational biology and natural language processing e g the correction of spelling mistakes or OCR errors and approximate string matching where the objective is to find matches for short strings in many longer texts in situations where a small number of differences is to be expected Various algorithms exist that solve problems beside the computation of distance between a pair of strings to solve related types of problems Hirschberg s algorithm computes the optimal alignment of two strings where optimality is defined as minimizing edit distance Approximate string matching can be formulated in terms of edit distance Ukkonen s 1985 algorithm takes a string p called the pattern and a constant k it then builds a deterministic finite state automaton that finds in an arbitrary string s a substring whose edit distance to p is at most k cf the Aho Corasick algorithm which similarly constructs an automaton to search for any of a number of patterns but without allowing edit operations A similar algorithm for approximate string matching is the bitap algorithm also defined in terms of edit distance Levenshtein automata are finite state machines that recognize a set of strings within bounded edit distance of a fixed reference string Language edit distanceA generalization of the edit distance between strings is the language edit distance between a string and a language usually a formal language Instead of considering the edit distance between one string and another the language edit distance is the minimum edit distance that can be attained between a fixed string and any string taken from a set of strings More formally for any language L and string x over an alphabet S the language edit distance d L x is given byd L x miny Ld x y displaystyle d L x min y in L d x y where d x y displaystyle d x y is the string edit distance When the language L is context free there is a cubic time dynamic programming algorithm proposed by Aho and Peterson in 1972 which computes the language edit distance For less expressive families of grammars such as the regular grammars faster algorithms exist for computing the edit distance Language edit distance has found many diverse applications such as RNA folding error correction and solutions to the Optimum Stack Generation problem See alsoGraph edit distance String to string correction problem String metric Time Warp Edit DistanceReferencesNavarro Gonzalo 1 March 2001 A guided tour to approximate string matching PDF ACM Computing Surveys 33 1 31 88 CiteSeerX 10 1 1 452 6317 doi 10 1145 375360 375365 S2CID 207551224 Retrieved 19 March 2015 Daniel Jurafsky James H Martin Speech and Language Processing Pearson Education International pp 107 111 Esko Ukkonen 1983 On approximate string matching Foundations of Computation Theory Springer pp 487 495 doi 10 1007 3 540 12689 9 129 Schulz Klaus U Mihov Stoyan 2002 Fast string correction with Levenshtein automata International Journal of Document Analysis and Recognition 5 1 67 85 CiteSeerX 10 1 1 16 652 doi 10 1007 s10032 002 0082 8 S2CID 207046453 Lei Chen Raymond Ng 2004 On the marriage of Lp norms and edit distance PDF Proc 30th Int l Conf on Very Large Databases VLDB Vol 30 doi 10 1016 b978 012088469 8 50070 x Kukich Karen 1992 Techniques for Automatically Correcting Words in Text PDF ACM Computing Surveys 24 4 377 439 doi 10 1145 146370 146380 S2CID 5431215 Archived from the original PDF on 2016 09 27 Retrieved 2017 11 09 R Wagner M Fischer 1974 The string to string correction problem J ACM 21 168 178 doi 10 1145 321796 321811 S2CID 13381535 Skiena Steven 2010 The Algorithm Design Manual 2nd ed Springer Science Business Media Bibcode 2008adm book S ISBN 978 1 849 96720 4 Chowdhury Rezaul Le Hai Son Ramachandran Vijaya July 2010 Cache oblivious dynamic programming for bioinformatics IEEE ACM Transactions on Computational Biology and Bioinformatics 7 3 495 510 doi 10 1109 TCBB 2008 94 PMID 20671320 S2CID 2532039 Ukkonen Esko 1985 Algorithms for approximate string matching PDF Information and Control 64 1 3 100 118 doi 10 1016 S0019 9958 85 80046 2 Landau Myers Schmidt 1998 Incremental String Comparison SIAM Journal on Computing 27 2 557 582 CiteSeerX 10 1 1 38 1766 doi 10 1137 S0097539794264810 Masek William J Paterson Michael S February 1980 A faster algorithm computing string edit distances Journal of Computer and System Sciences 20 1 18 31 doi 10 1016 0022 0000 80 90002 1 hdl 1721 1 148933 ISSN 0022 0000 Esko Ukkonen 1985 Finding approximate patterns in strings J Algorithms 6 132 137 doi 10 1016 0196 6774 85 90023 9 Bringmann Karl Grandoni Fabrizio Saha Barna Williams Virginia Vassilevska 2016 Truly Sub cubic Algorithms for Language Edit Distance and RNA Folding via Fast Bounded Difference Min Plus Product PDF 2016 IEEE 57th Annual Symposium on Foundations of Computer Science FOCS pp 375 384 arXiv 1707 05095 doi 10 1109 focs 2016 48 ISBN 978 1 5090 3933 3 S2CID 17064578 Aho A Peterson T 1972 12 01 A Minimum Distance Error Correcting Parser for Context Free Languages SIAM Journal on Computing 1 4 305 312 doi 10 1137 0201022 ISSN 0097 5397 Wagner Robert A 1974 Order n correction for regular languages Communications of the ACM 17 5 265 268 doi 10 1145 360980 360995 S2CID 11063282 Saha B 2014 10 01 The Dyck Language Edit Distance Problem in Near Linear Time 2014 IEEE 55th Annual Symposium on Foundations of Computer Science pp 611 620 doi 10 1109 FOCS 2014 71 ISBN 978 1 4799 6517 5 S2CID 14806359