Beyond StringUtils.getLevensteinDistance
Levenstein Distance (aka. Edit Distance)
In Apache commons-lang, there's an implementation of Levenstein Distance algorithm (aka. Edit Distance in many textbooks). That implementation is based from an article on edit distance by Michael Gilleland. Edit distance is a farily well understood algorithm (case and point: it's in commons-lang). I am not going to discuss the basic algorithm here. If you need more details about the basics, read and understand Gilleland's article first. The article is about how to improve from the basics and creating something better than the one in commons-lang.
A few days ago, I was involved in a discussion about how to apply fuzzy matching in our search algorithms. One of the topics that came up was about how to apply different weightings on different character edits. For example, using StringUtils.getLevensteinDistance , the string 'newyork' and 'newyonk' will both come out to be of distance 1 from the normal spelling of 'new york'. This is fine but we understand that 'newyork' is semantically closer to 'new york' than 'newyonk' is, therefore edit distance is more accurate if it returns a distance of <1 for the 'newyork' case. The idea is to lower the weighing on certain character edits such as edits on space characters, edits on punctuation characters at the end of the string ('new york' and 'new york!' for examples). To do that, we need to modify commons-lang's implementation to allow us inject different weighting on its distance calculation.
We can do better
Since we are looking to modify the implementation, why not throw in a few more improvements while we are at it. Below is a list of 'Possible Improvements' from wikipedia's levestein distance article, we are going to address some of them in our improved edit distance algorithm.
(From wikipedia)
Possible improvements to this algorithm include:
- We can adapt the algorithm to use less space, O(m) instead of O(mn), since it only requires that the previous row and current row be stored at any one time.
- We can store the number of insertions, deletions, and substitutions
separately, or even the positions at which they occur, which is alwaysj
. - We can normalize the distance to the interval
[0,1]
. - If we are only interested in the distance if it is smaller than a threshold k, then it suffices to compute a diagonal stripe of width 2k+1 in the matrix. In this way, the algorithm can be run in O(kl) time, where l is the length of the shortest string.[2]
- We can give different penalty costs to insertion, deletion and
substitution. We can also give penalty costs that depend on which
characters are inserted, deleted or substituted. - The initialization of
d[i,0]
can be moved inside the main outer loop. - By initializing the first row of the matrix with 0, the algorithm can be used for fuzzy string search of a string in a text [3].
This modification gives the end-position of matching substrings of the
text. To determine the start-position of the matching substrings, the
number of insertions and deletions can be stored separately and used to
compute the start-position from the end-position [4] - This algorithm parallelizes poorly, due to a large number of data dependencies. However, all the
cost
values can be computed in parallel, and the algorithm can be adapted to perform theminimum
function in phases to eliminate dependencies. - By examining diagonals instead of rows, and by using lazy evaluation, we can find the Levenshtein distance in O(m (1 + d)) time (where d is the Levenshtein distance), which is much faster than the regular dynamic programming algorithm if the distance is small. [5]
Introducing Brew Edit Distance
Before we start hacking away tho, let's look at what are already available.
On CPAN, there's a variation of Edit Distance algorithm named 'Text::Brew' , This is a variation of edit distance algorithm based from an article by Chris Brew at Ohio State University. (read his article before reading on if you don't already know what Brew's edit distance algorithm is). The idea there is that instead of just maintaining a matrix of distances, we maitain (at each cell) a TraceBack data structure which is just a tuple of {cost, previous edit move} . This allows us to construct the edit path at the end by following the final TraceBack back to the intial point. For example, after the edit distance calculation is completed for string 'abc' and 'abcd', you will get the edit path of [ match, match, match, insert] and the edit distance of 1 because to transform 'abc' to 'abcd' we only need to insert the character 'd'. This is great because we can review exactly what edit path was used in the edit distance calculation.
The Text::Brew implementation (in perl) and the implementation on Brew's article (in ruby) uses a full matrix, that is not necessary because at any point the algorithm only requires the previous row and the current row for distance calculation (see improve item #1 above). In commons-lang, the StringUtils.getLevensteinDistance function maintains 2 rows and swap them at each outer loop iteration therefore it only uses O(min(m, n)) memory. (it also always transform from the shorter string to the longer string so that the row is as short as the the length of the shorter string). We will be use a hybrid of Text::Brew and StringUtils.getLevensteinDistance as a starting point.
#d1 and d2 are arrays of length n
for i below m
d1(0) = (delCost, DEL, nil) #initialize the first cell here
for j below n
if string1(i) == string2(j)
subst = 0
else
subst = substCost
endif
d1(j+1) = best ((subst,MATCH,d2(j))
(insCost,INS,d1(j))
(delCost,DEL,d2(j+1)))
endfor
swap(d1, d2); #swp the 2 rows
endfor
The Cost Function
In 'Text::Brew', you can adjust the insert, delete, substitution costs, but these costs do not take the character being edited into account. For example, inserting 'a' is just weighted the same as inserting 's' which we know is not the case in English because people often leave out 's' when mis-spelled but rarely leave out vowels. Further more, the cost of edit does not take into the account of where the edit is taking place. For example, the insertions at the end of the strings are generally considered less significant as insertions at the beginning or middle of the strings because the use of abbreviations and acronyms. I.e. 'Jonathan' and 'Jon' V.S. 'Jonathan' and 'han'. (Be careful with this tho because some abbreviations happen at the beginning of the strings, such as 'drew' and 'Andrew' and 'New jersey' and 'Jersey' so inserting at the beginning may also be insignificant. Isn't English wonderful?) Therefore, we need modify the code such that we can inject arbitrary cost functions base on our application needs.
This is easy to do, instead hard-conding the cost values, we introduce a cost function into our algorithm. Notice that the cost function takes 2 arguments into considerations, the edit operation (which is a tuple of {edit type, the character being edited}) and the location of which the edit occurs.
enum Move{
DELETE, INSERT, SUBSTITUTE, MATCH
}
class Edit{
public static final Edit INITIAL = new Edit(null, '0');
Move move;
char character;
char subCharacter; //only available in Substitute case
}
abstract class CostFunction{
public abstract float getEditCost(Edit edit, int index);
}
This way we can assign whatever cost function we desire for any give edit operation base on the specific application needs. For example, if we want to ignore edits on space characters, we can create a cost function like this:
CostFunction ignoreSpace =
new CostFunction(){
@Override
public float getEditCost(Edit edit, int index) {
return edit.move == Move.MATCH || edit.character == ' '? 0 : 1.0f;
}
}
With the cost function object, we can use it in our distance calculations:
for (i = 1; i<=m; i++) {
from_i = fromString.charAt(i-1);
edit = new Edit(Move.DELETE, from_i);
d[0] = new TraceBack(p[0].cost+costFunction.getEditCost(edit, i), edit, null);
for (j=1; j<=n; j++) {
to_j = toString.charAt(j-1);
edit = new Edit(to_j==from_i ? Move.MATCH:Move.SUBSTITUTE, from_i, to_j);
TraceBack sub = new TraceBack(costFunction.getEditCost(edit, i), edit, p[j-1]);
edit = new Edit(Move.DELETE, from_i);
TraceBack del = new TraceBack(costFunction.getEditCost(edit, i), edit, p[j]);
edit = new Edit(Move.INSERT, to_j);
TraceBack insert = new TraceBack(costFunction.getEditCost(edit, j), edit, d[j-1]);
d[j] = best(sub, insert, del);
}
// copy current distance counts to 'previous row' distance counts
_d = p; p = d; d = _d;
}
By the way, the TraceBack object is the same trace back data structure described in Brew's article:
class TraceBack {
float cost;
Edit edit;
TraceBack prevTraceBack;
}
With the modified Brew edit distance algorithm and the cost function, we had just covered the possible improvement items #1, #2, #5 and #6 in the above list. Now we are going to takle a few more (they will be going down fast).
Improvements that Involves the use of Thresholds.
When using edit distance algorithms, people usually do not care about what the actual distance come out to be. Rather, they are interested in whether the distance is greater than some threshold values. For example, in fuzzy string matching, it's suffice to know that 'new york' has a edit distance of greater than 3 comparing to 'new jersey' so that they can be considered as not match. If your applications have a threshold and your edit distance algorithm only needs to return a 'Yes' or 'No' answer about whether the 2 input strings are within some distance threshold, you can add a few improvements to the algorithm to improve its time efficiency by leveraging the threshold value.
In the TraceBack object, the member field 'cost' is actually holding the current minimum cost. So if at any point after the inner loop scan, the field holds a value that is greater than your threshold, then we know immediately the final edit distance will be greater than the threshold and can stop right away (because keep scanning downward will not reduce this value).
There are a few ways to implement this, here I am trying to use as little code as possible, so I am just going to assume that there's a variable named 'threshold' which has the threshold value and have the algorithm throw a runtime exception as soon as the current minimum cost is greater than the threshold.
private float threshold = Float.MAX_VALUE; //change it if you have a threshold
for (i = 1; i<=m; i++) {
from_i = fromString.charAt(i-1);
edit = new Edit(Move.DELETE, from_i);
d[0] = new TraceBack(p[0].cost+costFunction.getEditCost(edit, i), edit, null);
for (j=1; j<=n; j++) {
to_j = toString.charAt(j-1);
edit = new Edit(to_j==from_i ? Move.MATCH:Move.SUBSTITUTE, from_i, to_j);
TraceBack sub = new TraceBack(costFunction.getEditCost(edit, i), edit, p[j-1]);
edit = new Edit(Move.DELETE, from_i);
TraceBack del = new TraceBack(costFunction.getEditCost(edit, i), edit, p[j]);
edit = new Edit(Move.INSERT, to_j);
TraceBack insert = new TraceBack(costFunction.getEditCost(edit, j), edit, d[j-1]);
d[j] = best(sub, insert, del);
}
boolean stillInRange = false;
for(int x =0; x<= n; x++){ //check to see if there still exist a path that is within the threshold range
if(d[x].cost < threshold){
stillInRange=true; break;
}
}
if(!stillInRange){
throw new RuntimeException("Threshold reached");
}
// copy current distance counts to 'previous row' distance counts
_d = p; p = d; d = _d;
}
This improvement is somewhat related to possible improvement item #4. The above uses the threshold to prevent us from scanning down the matrix, while the idea in item #4 is about avoiding the need to scan all the way across during the inner loop. Let's say you have a threshold of k (integer) and are only interested in whether the distance is greater than the threshold. The inner loop ONLY needs to calculate the distance in the cells that are within k cells away from the diagonal line. The rest of the cells we can just fill them with Float.MAX_VALUE because they are for sure have a greater distance than k (because k away from the diagonal line means k delete edits away). The following picture should illustrates the idea:
Please note that item #4 doesn't fit too well with our implementation because we allow customized cost function, therefore we cannot tell whether cells that are k away from the diagonal are in fact have a distance greater than k (do you see why?). So if your edit distance implementation has a static delete cost then you can add this improvement. In our case, we cannot.
All rights, there goes item #4, only a few more to go.
Normalized Distance
Item #3 says we can normalized the distance to the interval of [0,1]. This is a simple idea. It's very common that people will normalize the edit distance based off the length of the input strings. For example, (assuming all edit cost are 1). strings 'ab' and 'cd' are of distance 2. However, strings 'john' and 'johnny' are also of distance of 2. This does not make senses because we know 'ab' and 'cd' are two totally different strings while 'john' are 'johnny' are very close. So, instead of using the edit distance, we use a so called normalized distance when comparing strings:
normalized distance = edit distance / average(length(a), length(b))
depending on your applications, you might want to instead use the following function, which generally gives a lower normalized distance than the above (because max(a,b) >= avg(a,b))
normalized distance = edit distance / max(length(a), length(b))
Of course, you can have any other functions to normalize the edit distance depending on the specifics of your applications.
Everything Else
We just addressed items #1 to #6 off the list of possible improvements. For items #7 to #9, I understand what they are but haven't tried them. I think parallelizing the algorithm is a very good thing to try although I question whether the parallelism justify the parallelization overheads. I can only talk about them after I find some times to test them out.
The source codes
Here is the full source codes of everything that was discussed above. The main function prints
[SUBSTITUTE N with n, MATCH e, MATCH w, DELETE , MATCH y, MATCH o, MATCH r, MATCH k]
1.0
import java.util.LinkedList;
import java.util.List;
public class BrewEditDistance{
public static void main(String[] args) {
String s = "New york", t = "newyork";
BrewEditDistance ed = new BrewEditDistance(new CostFunction(){
@Override
public float getEditCost(Edit edit, int index) {
return edit.move == Move.MATCH || edit.character == ' '? 0 : 1.0f;
}
});
TraceBack trace = ed.brewEditDistance(s, t);
EditResults res = constructResultFromTraceBack(trace);
System.out.println(res.getEditPath());
System.out.println(res.totalCost);
}
static enum Move{
DELETE, INSERT, SUBSTITUTE, MATCH
}
static class Edit{
public static final Edit INITIAL = new Edit(null, '0');
Move move;
char character;
char subCharacter; //only available in Substitute case
public Edit(Move move, char character) {
this.move = move;
this.character = character;
}
public Edit(Move move, char character, char subCharacter) {
this.move = move;
this.character = character;
this.subCharacter = subCharacter;
}
@Override
public String toString() {
return move.name() + " " + character+ (move==Move.SUBSTITUTE? " with "+subCharacter:"");
}
}
abstract static class CostFunction{
public abstract float getEditCost(Edit edit, int index);
}
static class EditResults{
private float totalCost;
private List<Edit> editPath = new LinkedList<Edit>();
public float getTotalCost() {
return totalCost;
}
public List<Edit> getEditPath() {
return editPath;
}
public void setTotalCost(float totalCost) {
this.totalCost = totalCost;
}
}
static class TraceBack {
float cost;
Edit edit;
TraceBack prevTraceBack;
public TraceBack(float cost, Edit edit, TraceBack nextTraceBack) {
this.cost = cost;
this.edit = edit;
this.prevTraceBack = nextTraceBack;
}
}
static EditResults constructResultFromTraceBack(TraceBack traceBack){
EditResults ret =new EditResults();
ret.setTotalCost(traceBack.cost);
TraceBack t = traceBack;
while(t.edit != Edit.INITIAL){
ret.editPath.add(0, t.edit);
t = t.prevTraceBack;
}
return ret;
}
public BrewEditDistance(CostFunction costFunction){
this.costFunction = costFunction;
}
private CostFunction costFunction;
private float threshold = Float.MAX_VALUE;
public TraceBack brewEditDistance(String fromString, String toString) {
if (toString == null || fromString == null) {
throw new IllegalArgumentException("Strings must not be null");
}
//Apache common-langs's implementation always transform from t to s , which is very counter intuitive.
//In their case it doesnt matter because it doesnt track edits and all edit costs are the same
//but why the heck would anyone in the right mind want to call editDistance(s, t) and have it compute as
//transform t to s? here I just substitute them back to what they are supposed to be meant
int n = toString.length();
int m = fromString.length();
if (n == 0) { //toString is empty, so we are doing all deletes
return constructTraceBack(fromString, costFunction, Move.DELETE);
} else if (m == 0) {//fromString is empty, so we are doing all inserts
return constructTraceBack(toString, costFunction, Move.INSERT);
}
//(see original apache common-lang getLevensteinDistance())
//we cannot do swap strings memory optimization any more because insert/delete cost can be different
//swapping the strings will temper the final edit cost. Swapping the strings will also screw up the edit path
//however, in many applications your should have 2 similar length strings, otherwise
//you can skip edit distance and just consider 2 strings that varies greatly in length
//to be |s1.length - s2.length| which should be a good enough approximation
TraceBack p[] = new TraceBack[n+1]; //'previous' cost array, horizontally
TraceBack d[] = new TraceBack[n+1]; // cost array, horizontally
TraceBack _d[]; //placeholder to assist in swapping p and d
// indexes into strings toString and fromString
int j; // iterates through toString
int i; // iterates through fromString
char from_i; // jth character of fromString
char to_j; // ith character of toString
Edit edit;
p[0] = new TraceBack(0, Edit.INITIAL, null);
for (j = 0; j<n; j++) {
TraceBack prev = p[j];
edit = new Edit(Move.INSERT, toString.charAt(j));
p[j+1] = new TraceBack(prev.cost+costFunction.getEditCost(edit, j), edit, prev);
}
for (i = 1; i<=m; i++) {
from_i = fromString.charAt(i-1);
edit = new Edit(Move.DELETE, from_i);
d[0] = new TraceBack(p[0].cost+costFunction.getEditCost(edit, i), edit, null);
for (j=1; j<=n; j++) {
to_j = toString.charAt(j-1);
edit = new Edit(to_j==from_i ? Move.MATCH:Move.SUBSTITUTE, from_i, to_j);
TraceBack sub = new TraceBack(costFunction.getEditCost(edit, i), edit, p[j-1]);
edit = new Edit(Move.DELETE, from_i);
TraceBack del = new TraceBack(costFunction.getEditCost(edit, i), edit, p[j]);
edit = new Edit(Move.INSERT, to_j);
TraceBack insert = new TraceBack(costFunction.getEditCost(edit, j), edit, d[j-1]);
d[j] = best(sub, insert, del);
}
boolean stillInRange = false;
for(int x =0; x<= n; x++){ //check to see if there still exist a path that is within the threshold range
if(d[x].cost < threshold){
stillInRange=true; break;
}
}
if(!stillInRange){
throw new RuntimeException("Threshold reached");
}
// copy current distance counts to 'previous row' distance counts
_d = p;
p = d;
d = _d;
}
// our last action in the above loop was to switch d and p, so p now
// actually has the most recent cost counts
return p[n];
}
private static TraceBack constructTraceBack(String s, CostFunction costFunction, Move move){
TraceBack trace = new TraceBack(0f, Edit.INITIAL, null);
for(int i =0;i<s.length();i++){
Edit edit = new Edit(move, s.charAt(i));
trace = new TraceBack(costFunction.getEditCost(edit, i), edit, trace);
}
return trace;
}
private static TraceBack best(TraceBack substitute, TraceBack insert, TraceBack delete){
float subCost = substitute.cost + substitute.prevTraceBack.cost;
float intCost = insert.cost + insert.prevTraceBack.cost;
float delCost = delete.cost + delete.prevTraceBack.cost;
TraceBack ret = substitute;
ret.cost = subCost;
float bestCost = subCost;
if(intCost < bestCost){
bestCost = intCost;
ret = insert;
}
if(delCost < bestCost){
bestCost = delCost;
ret = delete;
}
ret.cost = bestCost;
return ret;
}
}
Performance Considerations
After I finished implementing the above algorithm, I couldn't stop thinking about the performance difference between my implementation against the commons-lang implementation. It's obvious that the BrewEditDistance algorithm will be slower because it keeps track of more variables and perform more calculations, but exactly how much slower?
So I wrote a little function to time the 2 implementations:
public static void main(String[] args) {
BrewEditDistance ed = new BrewEditDistance(new CostFunction(){
@Override
public float getEditCost(Edit edit, int index) {
return edit.move == Move.MATCH ? 0 : 1.0f;
}
});
int N = 1000000;
long start = System.currentTimeMillis();
for(int i =0; i<N; i++){
StringUtils.getLevenshteinDistance("fromString", "toString");
}
System.out.println((System.currentTimeMillis() - start)/1000.0);
start = System.currentTimeMillis();
for(int i =0; i<N; i++){
ed.brewEditDistance("fromString", "toString");
}
System.out.println((System.currentTimeMillis() - start)/1000.0);
}
The result on my quad-core Q6600@2.40GHz is
getLevenshteinDistance = 0.98 SECONDS
brewEditDistance = 9.639 SECONDS
As you can see, brewEditDistance is about 10 times slower. This really bothers me because I was expecting it to be about 4 times slower because brewEditDistance keeps track of about 4 times as many variables ({cost, edit type, character, prevTraceback}) as getLevenshteinDistance ( just {cost}) does.
So what went wrong? Well, I couldn't really tell so I did some profiling on brewEditDistance using the YourKit profiler. Here's the result
Okay... Looks like the algorithm was spending a lot of time constructing the TraceBack and Edit objects (most of them were throw away quickly after constructions).
We can fix this if we initialize the set of TraceBack objects up front and re-use them as if they are normal memory cells in an array. In addition, we need to flatten out the TraceBack object so that it's easier to work with.
abstract static class CostFunction{
public abstract float getEditCost(Move move, char character, int index);
}
static class TraceBack {
float cost = 0;
Move move = null;
char character = '0';
char subCharacter = '0';
TraceBack prevTraceBack = null;
}
Since we are re-using TraceBack objects, we can no longer use the row swapping trick (see item #1). Instead, we have to have the full matrix. We also make it big enough so that we only need to initialize it once.
private TraceBack d[][] = new TraceBack[100][100];
{ //per instance initialization
for(int i =0; i< 100; i++){
for(int j =0; j<100; j++){
d[i][j] = new TraceBack();
}
}
}
Now with these 2 modifications, we also need to modify the algorithm back to its very basic form.
// Step 2
for (i = 1; i <= n; i++) {
TraceBack prev = d[i-1][0];
char c = s.charAt(i-1);
d[i][0].cost = prev.cost + costFunction.getEditCost(Move.DELETE, c, i);
d[i][0].prevTraceBack = prev;
d[i][0].character = c;
d[i][0].move = Move.DELETE;
}
for (j = 1; j <= m; j++) {
TraceBack prev = d[0][j-1];
char c = t.charAt(j-1);
d[0][j].cost = prev.cost + costFunction.getEditCost(Move.INSERT, c, j);
d[i][0].prevTraceBack = prev;
d[i][0].character = c;
d[i][0].move = Move.INSERT;
}
// Step 3
TraceBack sub = new TraceBack();
TraceBack insert = new TraceBack();
TraceBack del = new TraceBack();
TraceBack tmp = null;
for (i = 1; i <= n; i++) {
s_i = s.charAt (i - 1);
// Step 4
for (j = 1; j <= m; j++) {
t_j = t.charAt (j - 1);
// Step 5
sub.move = s_i == t_j?Move.MATCH:Move.SUBSTITUTE;
sub.prevTraceBack = d[i-1][j-1];
sub.cost = sub.prevTraceBack.cost + costFunction.getEditCost(sub.move, s_i, i);
insert.move = Move.INSERT;
insert.prevTraceBack = d[i][j-1];
insert.cost = insert.prevTraceBack.cost + costFunction.getEditCost(insert.move, t_j, j);
del.move = Move.DELETE;
del.prevTraceBack = d[i-1][j];
del.cost = del.prevTraceBack.cost + costFunction.getEditCost(del.move, s_i, i);
// Step 6
tmp = sub;
if(insert.cost < tmp.cost ){ tmp = insert; }
if(del.cost < tmp.cost ){ tmp = del; }
d[i][j].prevTraceBack = tmp.prevTraceBack;
d[i][j].move = tmp.move;
d[i][j].subCharacter = t_j;
d[i][j].character = s_i;
}
}
// Step 7
return d[n][m];
With these changes, I timed the same test once again and I got
getLevenshteinDistance = 0.993 SECONDS
brewEditDistance = 4.416 SECONDS
This time brewEditDistance is only 5 times slower which is in line with what I was expecting.
So, if you want good object-oriented design and can take some performance hit, use the normal implementation. Otherwise if performance is a concern for you, update the algorithm such that it re-use TrackBack objects in stead of constructing new ones during the computations. Use your own judgment to decide which design tradeoff you want to make.
References:
http://en.wikipedia.org/wiki/Levenshtein_distance
http://www.ling.ohio-state.edu/~cbrew/2002/winter/684.02/string-distance.html
http://search.cpan.org/src/KCIVEY/Text-Brew-0.02/lib/Text/Brew.pm
http://svn.apache.org/viewvc/commons/proper/lang/trunk/src/java/org/apache/commons/lang/StringUtils.java?view=markup
Also many thanks to the smart folks at Health Market Science