Levenstein edit distance is the defacto algorithm for fuzzy string matching. It's often used in information retrival system (i.e. search engine) and data management system. For example, in a search engine, if the user enter 'DiMarco', the system should be able to not only bring back records that exactly match 'DiMarco', it should also consider records that are similiar to the input. If there are no records that matches exactly to the input, then the system should offer these similar records as alternatives. In our example, they could be
- 'De Marcro'
- 'Di Marco'
- 'De Merco'
- 'Dy Marrcro'
- and so on...
Edit distance is also a very popular algorithm in implementing spell checkers. In which the input is compare to a dictionary of words, and the cloest matching word will be returned.
Levenstein distance in its most basic form is quite rigid and it might not be a drop in solution for many applications that requires fuzzy matching. In a previous post, I had discussed about a list of possible improvements to the algorithm. Here is one more which involves mixing levenstein distance with a phonetic algorithm to achieve more natural fuzzy match algorithm.
Consider the above example about the word 'DiMarco'. Notice that the list of possible similar matches I had listed all starts with the character 'D'. Why? Because to a normal human being, the first character is the most visual significant character when considering whether 2 strings are similar. Other strings that do not share the same first characters often are considered different strings. 'DiMarco' V.S. 'CiMarco' or 'TiMarco' for instances.
In fuzzy record matching, should records be considered similar ONLY if they share the first character? In most cases, the answer is yes. However, here's the catch. Edit distance only considered how a word is spelled, but human often also consider how the word is pronounced. If that case, even if the first character of 2 words is different, but the first character (or two) has the same pronounciation, then a human might also considered the 2 words to be similar.
These are very common in names, because people often uses spelling variations as their names. For examples, the following pairs of words should be also considered similar even tho they differ in the first character:
- Ellan, Allan
- Iga, Eega
- Zandy, Sandy
- Phonda, Fonda
- Phat Joe Burger, Fat Joe Burger
- Kandy Kane, Candy Cane
and so on...
Here is where Phonetic algorithm like Metaphone comes in. The idea is to compare words' pronounciations by first encoding them into phonetic codes. The following is the encoding rule for metaphone:
RULES = [ # Regexp, replacement [ /([bcdfhjklmnpqrstvwxyz])\1+/, '\1' ],
# Remove doubled consonants except g. # [PHP] remove c from regexp.
[ /^ae/, 'E' ], [ /^[gkp]n/, 'N' ], [ /^wr/, 'R' ], [ /^x/, 'S' ],
[ /^wh/, 'W' ], [ /mb$/, 'M' ], # [PHP] remove $ from regexp.
[ /(?!^)sch/, 'SK' ], [ /th/, '0' ], [ /t?ch|sh/, 'X' ],
[ /c(?=ia)/, 'X' ], [ /[st](?=i[ao])/, 'X' ], [ /s?c(?=[iey])/, 'S' ],
[ /[cq]/, 'K' ], [ /dg(?=[iey])/, 'J' ], [ /d/, 'T' ],
[ /g(?=h[^aeiou])/, '' ], [ /gn(ed)?/, 'N' ], [ /([^g]|^)g(?=[iey])/, '\1J' ],
[ /g+/, 'K' ], [ /ph/, 'F' ], [ /([aeiou])h(?=\b|[^aeiou])/, '\1' ],
[ /[wy](?![aeiou])/, '' ], [ /z/, 'S' ], [ /v/, 'F' ], [ /(?!^)[aeiou]+/, '' ], ]
They are a bunch of regular expression replacement rules which replace the same sounding characters with the same encoding character. For instance,
[ /^[gkp]n/, 'N' ] means to replace gn, kn, or pn that are at the beginning of the string to the character 'N'
[ /^x/, 'S' ] means to replace x of the beginning of the string to the character 'S'
of course, you can replace ph with 'F' (see rule [ /ph/, 'F' ]).
Combining Visual and Phonetic Similarity
So, if you want your similarity metrics to align closer to how human think, your similarity metric should take both look alike and sound alike features into considerations. Given the above 2 observations about edit distance and metaphone, you can simply do the following in your fuzzy string metrics:
def isSimilar(s1, s2)
1. encode the first few characters of s1 and s2
2. false if the 1st encoded character differs
3. false if edit distance (normalized to the length s1 and s2) is greater than some threshold
4. true otherwise