This probably has been blogged more than a few times in the blog sphere, but it doesn't hurt to add another entry about it.

In java 1.6, the specs for @Override annotation was changed to allow @Override annotation on implementation methods that 'override' interface methods. That syntax was not allowed in 1.5.

interface Foo{ void foo(); }
class Bar implements Foo{
@Override //this line is valid in 1.6, but gives compilation error in 1.5
public void foo() {}

So, if you are switching back and forth with jdk 1.5 and jdk 1.6 thinking the syntax are compatible, you may run into compilation problems if you have @Override on your implementation class methods. IDEs like Eclipse, trying to be helpful, will automatically add @Override to methods when generating implementation classes, which makes your source codes non-portable to 1.5.


Mercurial = Good

By Jay Liang

good eclipse support
good integration on both Linux & Windows
distributed repository, you carry the repository everywhere you go (awesome!)
easy to setup, easy to use
many tools support (I think ...)


Here is another thing in android that kept me scratching my head for a day before i finally figure out how to do it properly because of the lack of documentations and books for Android development at this time.

The android app that I am currently working on has a search feature which will fire a http request and shows the result in a list view. The search can be slow depending on the network speed so I wanted to put a loading dialog on the screen to indicate that the app is processing.

There's a ProgressDialog class in Android's SDK, so the idea is to just first build and show a progress dialog before the search starts, and then cancel the dialog on search completion:

ProgressDialog pd =,
true, false);

The above snippet will not work because the makeHttpRequest() method is being executed on the UI thread (Android has a similar threading model as Swing) therefore the dialog will not show on the screen until makeHttpRequest() returns.

That's understandable, so how about putting the method in a background thread like this:
final ProgressDialog pd =,
true, false);

new Thread(new Runnable(){
public void run(){
This works as we would expect, the progress dialog is shown and then it will disappear when makeHttpRequest() is finished.

However, here's the difficult bit. The above code will cause your application to crash if the user change the phone's orientation while the progress dialog is not yet dismissed.

W/dalvikvm( 292): threadid=3: thread exiting with uncaught exception
E/AndroidRuntime( 292): Uncaught handler: thread main exiting due to uncaught exception
E/AndroidRuntime( 292): java.lang.IllegalArgumentException: View not attached to window manager
E/AndroidRuntime( 292): at android.view.WindowManagerImpl.findViewLocked( 331)
E/AndroidRuntime( 292): at
E/AndroidRuntime( 292): at android.view.Window$LocalWindowManager.removeView(
E/AndroidRuntime( 292): at
E/AndroidRuntime( 292): at$000(
E/AndroidRuntime( 292): at$
E/AndroidRuntime( 292): at
E/AndroidRuntime( 292): at
E/AndroidRuntime( 292): at com.yellowbook.android2.SearchHelper$3.handleMessage(
E/AndroidRuntime( 292): at android.os.Handler.dispatchMessage(
E/AndroidRuntime( 292): at android.os.Looper.loop(
E/AndroidRuntime( 292): at
E/AndroidRuntime( 292): at
java.lang.reflect.Method.invokeNative(Native Method)
E/AndroidRuntime( 292): at
E/AndroidRuntime( 292): at
E/AndroidRuntime( 292): at
E/AndroidRuntime( 292): at
dalvik.system.NativeStart.main(Native Method)

That is because by default, activities in Android will be destroyed and recreated on orientation change. Since the progress dialog is no longer 'attached' to the view after the activity is re-created, it blows up once the activity was destroyed. The problem has to do with the internal UI managements of android which i won't attempt to explain here.

Long story short, I figured out a working solution to the above problem. Rather than creating and showing the dialog manually, you will need to override the onCreateDialog() method in your activity to let Android management the dialogs for you.

Example codes:
protected Dialog onCreateDialog(int id) {
ProgressDialog loadingDialog = new ProgressDialog(this);
return loadingDialog;

return super.onCreateDialog(id);

private void Search(){
new Thread(new Runnable(){
public void run(){

In addition to the aboive, my app's makeHttpRequest() forwards the UI to another screen using startActivityForResult() method like so:
Result res = search();
Intent i = new Intent(activity, ResultActivity.class);
i.putExtra("result", res);
activity.startActivityForResult(i, 1);
If you have similar code, then you will find that the progress dialog is still being shown when you click on the back button to go back to the search activity screen. After a few more hours of trial and errors, I found that if I override the onSaveInstanceState() as followed, the progress dialog will be gone when I go back to the previous screen using the phone's back button.

//Do this at your own risk because I don't know whether that's the correct way of handling this.
protected void onSaveInstanceState(Bundle outState) {
try {
} catch (Exception e) {
//ignore error


By Jay Liang

Posted via Pixelpipe.


Update (2008-10-29):

I just saw this on android-developers mailing list. Looks like it's a easier way to achieve this effect. (I haven't have time to test it myself).


In res/drawable, create a file called for instance mybutton_background.xml
and put something like this inside:

<?xml version="1.0" encoding="utf-8"?>
<selector android="">
<item android:state_focused="true" android:state_pressed="false"
android:drawable="@drawable/button_background_focus" />
<item android:state_focused="true" android:state_pressed="true"
android:drawable="@drawable/button_background_pressed" />
<item android:state_focused="false" android:state_pressed="true"
android:drawable="@drawable/button_background_pressed" />
<item drawable="@drawable/button_background_normal">
Then set this drawable as the background of your button with

This is not documented anywhere in the android site.

The standard Button widget in Android has an onpress effect which will paint the button with orangish color when the button is pressed. However, if you change the background of the button widget to create your own custom button, the onpressed effect will be lost.

To create the same effect with your custom button image, you need to extend the Button class and override its onDraw() methd as followed:

public class OnPressButton extends Button{

public OnPressButton(Context context) {

public OnPressButton(Context context, AttributeSet attrs){
super(context, attrs);

protected void onDraw(Canvas canvas) {
//sets the button image based on whether the button in its pressed state
setBackgroundDrawable(getResources().getDrawable(isPressed()?R.drawable.btn_on : R.drawable.btn_off));

Then instead of using the Button widget, use the above implementation of Button in your layout xml file.
android:text="My Button" />

Observe that the background is set to "color/transparent".

That's all you need to do to implement the onpress effect. For more details about implementing custom widget, see also


(Scraping google map may be in violations of their term & services)

Google map provides a set of ajax API that allows systems to query its data programatically. However the search results from this API does not include the business url information. So what if your application needs this url information? Well, you have no choices but to bring back the good old screen scraping technique.

Rather than using the ajax API link, the easiest way is to just use the 'maps' link directly as shown below and scrape the returned html page for the business url.,++King+Of+Prussia,+PA+19406&mrt=yp&view=text

This method can only be used in toy problems of course (as oppose to use in production systems).


Lucene Spell Checker

By Jay Liang

Here's a regular expression gotcha. I was writing a regex pattern this morning to capture the zip code from a US address. A zip code can be a 5 digits number following by an optional 3-4 digit zip4 extension, so here's the regex I wrote:

"\\b\\d{5}(?:[ -+]?\\d{3,4})?\\b"
As it turned out, this regex will NOT match a zip code like "12323-1234". Can you spot the problem?

Well, the title of this post actually gave away the answer. This regex fails to match '12345-1234' because the dash character ('-') in the character class does not mean the dash character. With the character class [ -+] I meant to say the character ' ' or '-' or '+', but regex engine will interpret it as ' ' TO '+' which does not include '-'.

To fix the problem, simply put the '-' character at the very end of the character class which will then be intepreted as the actual dash character:
"\\b\\d{5}(?:[ +-]?\\d{3,4})?\\b"
The moral of the story is that you need to be careful when putting the '-' character in a character class, because it could mean 'range to' or just dash depending on its location.


I just got my hands on Microsoft's latest toy call Live Mesh Remote Desktop . It turned out to be a very impressive tool!

I installed it on my workstation at work and right away I can access my work desktop from anywhere thru the Live mesh website (need to install a activeX plugin on the client side IE). Appearantly this thing also goes through firewalls so there's no need to login to my company's VPN to access the desktop. I didnt spend too much time with it but from what I'd seen so far, it feels just like a remote desktop session.

The biggest drawback is that it's Windows and IE only and required the activeX plugin to be installed, which means it's not yet truely accessible from anywhere. The activeX plugin also requires admin rights to install so you cann't access it unless you have admin rights on the client side or the client computer already had the plugin installed.

Overall, I am impressed by this tool. It's one of the best things Microsoft put out for a while, and it's FREE! I cant ask for too much for that price :-)


Visual Similarity

Levenstein edit distance is the defacto algorithm for fuzzy string matching. It's often used in information retrival system (i.e. search engine) and data management system. For example, in a search engine, if the user enter 'DiMarco', the system should be able to not only bring back records that exactly match 'DiMarco', it should also consider records that are similiar to the input. If there are no records that matches exactly to the input, then the system should offer these similar records as alternatives. In our example, they could be
- 'De Marcro'
- 'Di Marco'
- 'D'Marco'
- 'De Merco'
- 'Dy Marrcro'
- 'DiMecrro'
- and so on...

Edit distance is also a very popular algorithm in implementing spell checkers. In which the input is compare to a dictionary of words, and the cloest matching word will be returned.

Levenstein distance in its most basic form is quite rigid and it might not be a drop in solution for many applications that requires fuzzy matching. In a previous post, I had discussed about a list of possible improvements to the algorithm. Here is one more which involves mixing levenstein distance with a phonetic algorithm to achieve more natural fuzzy match algorithm.

Consider the above example about the word 'DiMarco'. Notice that the list of possible similar matches I had listed all starts with the character 'D'. Why? Because to a normal human being, the first character is the most visual significant character when considering whether 2 strings are similar. Other strings that do not share the same first characters often are considered different strings. 'DiMarco' V.S. 'CiMarco' or 'TiMarco' for instances.

Phonetic Similarity

In fuzzy record matching, should records be considered similar ONLY if they share the first character? In most cases, the answer is yes. However, here's the catch. Edit distance only considered how a word is spelled, but human often also consider how the word is pronounced. If that case, even if the first character of 2 words is different, but the first character (or two) has the same pronounciation, then a human might also considered the 2 words to be similar.

These are very common in names, because people often uses spelling variations as their names. For examples, the following pairs of words should be also considered similar even tho they differ in the first character:
- Ellan, Allan
- Iga, Eega
- Zandy, Sandy
- Phonda, Fonda
- Phat Joe Burger, Fat Joe Burger
- Kandy Kane, Candy Cane
and so on...

Here is where Phonetic algorithm like Metaphone comes in. The idea is to compare words' pronounciations by first encoding them into phonetic codes. The following is the encoding rule for metaphone:

RULES   =   [ # Regexp, replacement [ /([bcdfhjklmnpqrstvwxyz])\1+/, '\1' ],
# Remove doubled consonants except g. # [PHP] remove c from regexp.
[ /^ae/, 'E' ], [ /^[gkp]n/, 'N' ], [ /^wr/, 'R' ], [ /^x/, 'S' ],
[ /^wh/, 'W' ], [ /mb$/, 'M' ], # [PHP] remove $ from regexp.
[ /(?!^)sch/, 'SK' ], [ /th/, '0' ], [ /t?ch|sh/, 'X' ],
[ /c(?=ia)/, 'X' ], [ /[st](?=i[ao])/, 'X' ], [ /s?c(?=[iey])/, 'S' ],
[ /[cq]/, 'K' ], [ /dg(?=[iey])/, 'J' ], [ /d/, 'T' ],
[ /g(?=h[^aeiou])/, '' ], [ /gn(ed)?/, 'N' ], [ /([^g]|^)g(?=[iey])/, '\1J' ],
[ /g+/, 'K' ], [ /ph/, 'F' ], [ /([aeiou])h(?=\b|[^aeiou])/, '\1' ],
[ /[wy](?![aeiou])/, '' ], [ /z/, 'S' ], [ /v/, 'F' ], [ /(?!^)[aeiou]+/, '' ], ]

They are a bunch of regular expression replacement rules which replace the same sounding characters with the same encoding character. For instance,

[ /^[gkp]n/, 'N' ] means to replace gn, kn, or pn that are at the beginning of the string to the character 'N'

[ /^x/, 'S' ] means to replace x of the beginning of the string to the character 'S'

of course, you can replace ph with 'F' (see rule [ /ph/, 'F' ]).

Combining Visual and Phonetic Similarity

So, if you want your similarity metrics to align closer to how human think, your similarity metric should take both look alike and sound alike features into considerations. Given the above 2 observations about edit distance and metaphone, you can simply do the following in your fuzzy string metrics:

def isSimilar(s1, s2)
1. encode the first few characters of s1 and s2
2. false if the 1st encoded character differs
3. false if edit distance (normalized to the length s1 and s2) is greater than some threshold
4. true otherwise


Use of EnumMap

By Jay Liang

In JDK 1.5, there's a new collection class called EnumMap. As its name suggested, this map is a specialized map that designed specifically for enum. This seems to be a very little known class. At least I haven't seen it being used in any of the 1.5 codebases that I had been worked with over the last 2-3 years. I have, however, seen many many times where HashMap was used instead to store a collection of {enum -> value} mappings.

I didn't study the implementation of EnumMap, but I can imagine that it's using enum.ordinal() to index the element so it's probably faster than indexing using hashcode as in HashMap. Further, different from HashMap, the elements iteration order is maintained to be the natural order of the enum constants. This is often useful because chances are you will want to print the map. Having a controllable order will make the output more reader.

In short, EnumMap is in many way a better map data structure than HashMap if the keys are enum constants.

enum ABC{ A, B, C}
Map = new EnumMap(ABC.class);
//is better than
Map = new HashMap();


Recently I needed to build something that is able to recognize and resolve US city alias, i.e. 'Phila, PA' should be resolved to 'PHILADELPHIA, PA' and 'MANAYUNK, PA' will be also 'PHILADELPHIA, PA' (Manayunk is a neighborhood in Philadelphia, PA). The USPS website has a web lookup tool in which you can put in a zip code and it will show you the zip's corresponding city state and also a list of alternative city names. For example, the outputs of are

Actual City name in 19148
Acceptable City names in 19148
Not Acceptable

By harvesting these information from USPS, you can build a map of alternative city name -> actual city name, then using this map it takes one lookup to resolve a city alias to the actual city. I needed these data but USPS does not have them available for download. So I wrote a little script that basically executes 'curl$code' for every zip and extracted the information from the return html.

I put the data up at, feel free to download it if you are also interested in these data. The data is in the following format:

$zip $actual_city $actual_state $pipe_delimited_alternative_cities

Now I need to build the city alias map. The map will need to be a map of
{ state -> map of
{city alias -> actual city} }
as such:

PA -> {
NY -> {
(other states...)

The data set is not very large so I wanted to put the entire map in memory for fast lookups. However, since Java will create a new string for each key and value in the map and also because they are man repeated values, it will be very wasteful space wise to simply throw everything in a hash map as is.

To reduce memory consumption of dictionary data structure, people usually use a trie to compress the data in tree-like form. The city alias data set is pretty small (about 1MB in raw) so I think using a trie would be an overkill.

There's another quick and easy way to reduce the memory footprint which happened to work very well for my case. That is to just first intern the key and values strings before putting them in the hash map. The same string will be interned to the same memory location therefore the map uses the same amount of memory no matter how many instances of 'PHILADELPHIA' we need to put in the map.

I ran a quick profiling session just to see how big is the difference, here's the result for the city alias dataset:

with interning - 450KB
without interning - 1.55MB


In the 2nd edition of his book Effective Java, Joshua Bloch suggested the preferred way to implement singletons in Java. Instead of the traditional way of using a private constructor which he said has some problems (I am not going to spoil all the funs here, you need to read his book to find out what they are):

public class Foo{
private static final Foo INSTANCE = new Foo();
private Foo(){}
public static Foo getInstance(){ return INSTANCE; }

the new preferred way to implement singletons in Java in now (since 1.5) using enum:

public enum Foo {
public void bar() { ... }

so there, we know that when Joshua Bloch says this is the preferred way, people will start preaching about it in code reviews and other techno discussions. With this knowledge you can now be that (annoying :-D) guy who advocates this.


Levenstein Distance (aka. Edit Distance)

In Apache commons-lang, there's an implementation of Levenstein Distance algorithm (aka. Edit Distance in many textbooks). That implementation is based from an article on edit distance by Michael Gilleland. Edit distance is a farily well understood algorithm (case and point: it's in commons-lang). I am not going to discuss the basic algorithm here. If you need more details about the basics, read and understand Gilleland's article first. The article is about how to improve from the basics and creating something better than the one in commons-lang.

A few days ago, I was involved in a discussion about how to apply fuzzy matching in our search algorithms. One of the topics that came up was about how to apply different weightings on different character edits. For example, using StringUtils.getLevensteinDistance , the string 'newyork' and 'newyonk' will both come out to be of distance 1 from the normal spelling of 'new york'. This is fine but we understand that 'newyork' is semantically closer to 'new york' than 'newyonk' is, therefore edit distance is more accurate if it returns a distance of <1 for the 'newyork' case. The idea is to lower the weighing on certain character edits such as edits on space characters, edits on punctuation characters at the end of the string ('new york' and 'new york!' for examples). To do that, we need to modify commons-lang's implementation to allow us inject different weighting on its distance calculation.

We can do better

Since we are looking to modify the implementation, why not throw in a few more improvements while we are at it. Below is a list of 'Possible Improvements' from wikipedia's levestein distance article, we are going to address some of them in our improved edit distance algorithm.

(From wikipedia)

Possible improvements to this algorithm include:

  • We can adapt the algorithm to use less space, O(m) instead of O(mn), since it only requires that the previous row and current row be stored at any one time.
  • We can store the number of insertions, deletions, and substitutions
    separately, or even the positions at which they occur, which is always j.
  • We can normalize the distance to the interval [0,1].
  • If we are only interested in the distance if it is smaller than a threshold k, then it suffices to compute a diagonal stripe of width 2k+1 in the matrix. In this way, the algorithm can be run in O(kl) time, where l is the length of the shortest string.[2]
  • We can give different penalty costs to insertion, deletion and
    substitution. We can also give penalty costs that depend on which
    characters are inserted, deleted or substituted.
  • The initialization of d[i,0] can be moved inside the main outer loop.
  • By initializing the first row of the matrix with 0, the algorithm can be used for fuzzy string search of a string in a text [3].
    This modification gives the end-position of matching substrings of the
    text. To determine the start-position of the matching substrings, the
    number of insertions and deletions can be stored separately and used to
    compute the start-position from the end-position [4]
  • This algorithm parallelizes poorly, due to a large number of data dependencies. However, all the cost values can be computed in parallel, and the algorithm can be adapted to perform the minimum function in phases to eliminate dependencies.
  • By examining diagonals instead of rows, and by using lazy evaluation, we can find the Levenshtein distance in O(m (1 + d)) time (where d is the Levenshtein distance), which is much faster than the regular dynamic programming algorithm if the distance is small. [5]

Introducing Brew Edit Distance

Before we start hacking away tho, let's look at what are already available.

On CPAN, there's a variation of Edit Distance algorithm named 'Text::Brew' , This is a variation of edit distance algorithm based from an article by Chris Brew at Ohio State University. (read his article before reading on if you don't already know what Brew's edit distance algorithm is). The idea there is that instead of just maintaining a matrix of distances, we maitain (at each cell) a TraceBack data structure which is just a tuple of {cost, previous edit move} . This allows us to construct the edit path at the end by following the final TraceBack back to the intial point. For example, after the edit distance calculation is completed for string 'abc' and 'abcd', you will get the edit path of [ match, match, match, insert] and the edit distance of 1 because to transform 'abc' to 'abcd' we only need to insert the character 'd'. This is great because we can review exactly what edit path was used in the edit distance calculation.

The Text::Brew implementation (in perl) and the implementation on Brew's article (in ruby) uses a full matrix, that is not necessary because at any point the algorithm only requires the previous row and the current row for distance calculation (see improve item #1 above). In commons-lang, the StringUtils.getLevensteinDistance function maintains 2 rows and swap them at each outer loop iteration therefore it only uses O(min(m, n)) memory. (it also always transform from the shorter string to the longer string so that the row is as short as the the length of the shorter string). We will be use a hybrid of Text::Brew and StringUtils.getLevensteinDistance as a starting point.

#d1 and d2 are arrays of length n
for i below m
       d1(0) = (delCost, DEL, nil) #initialize the first cell here
     for j below n
     if string1(i) == string2(j)
         subst = 0
         subst = substCost
         d1(j+1) = best ((subst,MATCH,d2(j))
    swap(d1, d2); #swp the 2 rows

The Cost Function

In 'Text::Brew', you can adjust the insert, delete, substitution costs, but these costs do not take the character being edited into account. For example, inserting 'a' is just weighted the same as inserting 's' which we know is not the case in English because people often leave out 's' when mis-spelled but rarely leave out vowels. Further more, the cost of edit does not take into the account of where the edit is taking place. For example, the insertions at the end of the strings are generally considered less significant as insertions at the beginning or middle of the strings because the use of abbreviations and acronyms. I.e. 'Jonathan' and 'Jon' V.S. 'Jonathan' and 'han'. (Be careful with this tho because some abbreviations happen at the beginning of the strings, such as 'drew' and 'Andrew' and 'New jersey' and 'Jersey' so inserting at the beginning may also be insignificant. Isn't English wonderful?) Therefore, we need modify the code such that we can inject arbitrary cost functions base on our application needs.

This is easy to do, instead hard-conding the cost values, we introduce a cost function into our algorithm. Notice that the cost function takes 2 arguments into considerations, the edit operation (which is a tuple of {edit type, the character being edited}) and the location of which the edit occurs.
enum Move{
class Edit{
    public static final Edit INITIAL = new Edit(null, '0');
    Move move;
    char character;
    char subCharacter; //only available in Substitute case
 abstract class CostFunction{
    public abstract float getEditCost(Edit edit, int index);   

This way we can assign whatever cost function we desire for any give edit operation base on the specific application needs. For example, if we want to ignore edits on space characters, we can create a cost function like this:

CostFunction ignoreSpace =
  new CostFunction(){

      public float getEditCost(Edit edit, int index) {
        return edit.move == Move.MATCH || edit.character == ' '? 0 : 1.0f;

With the cost function object, we can use it in our distance calculations:

    for (i = 1; i<=m; i++) {
        from_i = fromString.charAt(i-1);
        edit = new Edit(Move.DELETE, from_i);
        d[0] = new TraceBack(p[0].cost+costFunction.getEditCost(edit, i), edit, null);
        for (j=1; j<=n; j++) {
          to_j = toString.charAt(j-1);
            edit = new Edit(to_j==from_i ? Move.MATCH:Move.SUBSTITUTE, from_i, to_j);
            TraceBack sub = new TraceBack(costFunction.getEditCost(edit, i), edit, p[j-1]);
            edit = new Edit(Move.DELETE, from_i);
            TraceBack del = new TraceBack(costFunction.getEditCost(edit, i), edit, p[j]);
            edit = new Edit(Move.INSERT, to_j);
            TraceBack insert = new TraceBack(costFunction.getEditCost(edit, j), edit, d[j-1]);
            d[j] = best(sub, insert, del);

        // copy current distance counts to 'previous row' distance counts
        _d = p; p = d; d = _d;

By the way, the TraceBack object is the same trace back data structure described in Brew's article:

class TraceBack {
    float cost;
    Edit edit;
    TraceBack prevTraceBack;

With the modified Brew edit distance algorithm and the cost function, we had just covered the possible improvement items #1, #2, #5 and #6 in the above list. Now we are going to takle a few more (they will be going down fast).

Improvements that Involves the use of Thresholds.

When using edit distance algorithms, people usually do not care about what the actual distance come out to be. Rather, they are interested in whether the distance is greater than some threshold values. For example, in fuzzy string matching, it's suffice to know that 'new york' has a edit distance of greater than 3 comparing to 'new jersey' so that they can be considered as not match. If your applications have a threshold and your edit distance algorithm only needs to return a 'Yes' or 'No' answer about whether the 2 input strings are within some distance threshold, you can add a few improvements to the algorithm to improve its time efficiency by leveraging the threshold value.

In the TraceBack object, the member field 'cost' is actually holding the current minimum cost. So if at any point after the inner loop scan, the field holds a value that is greater than your threshold, then we know immediately the final edit distance will be greater than the threshold  and can stop right away (because keep scanning downward will not reduce this value).

There are a few ways to implement this, here I am trying to use as little code as possible, so I am just going to assume that there's a variable named 'threshold' which has the threshold value and have the algorithm throw a runtime exception as soon as the current minimum cost is greater than the threshold.

  private float threshold = Float.MAX_VALUE; //change it if you have a threshold

    for (i = 1; i<=m; i++) {
        from_i = fromString.charAt(i-1);
        edit = new Edit(Move.DELETE, from_i);
        d[0] = new TraceBack(p[0].cost+costFunction.getEditCost(edit, i), edit, null);
        for (j=1; j<=n; j++) {
          to_j = toString.charAt(j-1);
            edit = new Edit(to_j==from_i ? Move.MATCH:Move.SUBSTITUTE, from_i, to_j);
            TraceBack sub = new TraceBack(costFunction.getEditCost(edit, i), edit, p[j-1]);
            edit = new Edit(Move.DELETE, from_i);
            TraceBack del = new TraceBack(costFunction.getEditCost(edit, i), edit, p[j]);
            edit = new Edit(Move.INSERT, to_j);
            TraceBack insert = new TraceBack(costFunction.getEditCost(edit, j), edit, d[j-1]);
            d[j] = best(sub, insert, del);
        boolean stillInRange = false;
        for(int x =0; x<= n; x++){ //check to see if there still exist a path that is within the threshold range
            if(d[x].cost < threshold){
                stillInRange=true; break;
            throw new RuntimeException("Threshold reached");

        // copy current distance counts to 'previous row' distance counts
        _d = p; p = d; d = _d;

This improvement is somewhat related to possible improvement item #4. The above uses the threshold to prevent us from scanning down the matrix, while the idea in item #4 is about avoiding the need to scan all the way across during the inner loop. Let's say you have a threshold of k (integer) and are only interested in whether the distance is greater than the threshold. The inner loop ONLY needs to calculate the distance in the cells that are within k cells away from the diagonal line. The rest of the cells we can just fill them with Float.MAX_VALUE because they are for sure have a greater distance than k (because k away from the diagonal line means k delete edits away). The following picture should illustrates the idea:

Please note that item #4 doesn't fit too well with our implementation because we allow customized cost function, therefore we cannot tell whether cells that are k away from the diagonal are in fact have a distance greater than k (do you see why?). So if your edit distance implementation has a static delete cost then you can add this improvement. In our case, we cannot.

All rights, there goes item #4, only a few more to go.

Normalized Distance

Item #3 says we can normalized the distance to the interval of [0,1]. This is a simple idea. It's very common that people will normalize the edit distance based off the length of the input strings. For example, (assuming all edit cost are 1). strings 'ab' and 'cd' are of distance 2. However, strings 'john' and 'johnny' are also of distance of 2. This does not make senses because we know 'ab' and 'cd' are two totally different strings while 'john' are 'johnny' are very close. So, instead of using the edit distance, we use a so called normalized distance when comparing strings:

normalized distance = edit distance / average(length(a), length(b))

depending on your applications, you might want to instead use the following function, which generally gives a lower normalized distance than the above (because max(a,b) >= avg(a,b))

normalized distance = edit distance / max(length(a), length(b))

Of course, you can have any other functions to normalize the edit distance depending on the specifics of your applications.

Everything Else

We just addressed items #1 to #6 off the list of possible improvements. For items #7 to #9, I understand what they are but haven't tried them. I think parallelizing the algorithm is a very good thing to try although I question whether the parallelism justify the parallelization overheads. I can only talk about them after I find some times to test them out.

The source codes

Here is the full source codes of everything that was discussed above. The main function prints


import java.util.LinkedList;
import java.util.List;

public class BrewEditDistance{

  public static void main(String[] args) {
        String s = "New york", t = "newyork";
        BrewEditDistance ed = new BrewEditDistance(new CostFunction(){
          public float getEditCost(Edit edit, int index) {
            return edit.move == Move.MATCH || edit.character == ' '? 0 : 1.0f;
        TraceBack trace = ed.brewEditDistance(s, t);
        EditResults res = constructResultFromTraceBack(trace);
  static enum Move{

  static class Edit{
    public static final Edit INITIAL = new Edit(null, '0');
    Move move;
    char character;
    char subCharacter; //only available in Substitute case
    public Edit(Move move, char character) {
      this.move = move;
      this.character = character;
    public Edit(Move move, char character, char subCharacter) {
      this.move = move;
      this.character = character;
      this.subCharacter = subCharacter;
    public String toString() {
      return + " " + character+ (move==Move.SUBSTITUTE? " with "+subCharacter:"");
  abstract static class CostFunction{
    public abstract float getEditCost(Edit edit, int index);  
  static class EditResults{
    private float totalCost;
    private List<Edit> editPath = new LinkedList<Edit>();
    public float getTotalCost() {
      return totalCost;
    public List<Edit> getEditPath() {
      return editPath;
    public void setTotalCost(float totalCost) {
      this.totalCost = totalCost;
  static class TraceBack {
    float cost;
    Edit edit;
    TraceBack prevTraceBack;
    public TraceBack(float cost, Edit edit, TraceBack nextTraceBack) {
      this.cost = cost;
      this.edit = edit;
      this.prevTraceBack = nextTraceBack;
  static EditResults constructResultFromTraceBack(TraceBack traceBack){
    EditResults ret =new EditResults();
    TraceBack t = traceBack;
    while(t.edit != Edit.INITIAL){
      ret.editPath.add(0, t.edit);
      t = t.prevTraceBack;
    return ret;
  public BrewEditDistance(CostFunction costFunction){
    this.costFunction = costFunction;
  private CostFunction costFunction;
  private float threshold = Float.MAX_VALUE;

  public TraceBack brewEditDistance(String fromString, String toString) {
    if (toString == null || fromString == null) {
        throw new IllegalArgumentException("Strings must not be null");
    //Apache common-langs's implementation always transform from t to s , which is very counter intuitive.
    //In their case it doesnt matter because it doesnt track edits and all edit costs are the same
    //but why the heck would anyone in the right mind want to call editDistance(s, t) and have it compute as
    //transform t to s? here I just substitute them back to what they are supposed to be meant
    int n = toString.length();
    int m = fromString.length();

    if (n == 0) { //toString is empty, so we are doing all deletes
      return constructTraceBack(fromString, costFunction, Move.DELETE);
    } else if (m == 0) {//fromString is empty, so we are doing all inserts
      return constructTraceBack(toString, costFunction, Move.INSERT);

    //(see original apache common-lang getLevensteinDistance())
    //we cannot do swap strings memory optimization any more because insert/delete cost can be different
    //swapping the strings will temper the final edit cost. Swapping the strings will also screw up the edit path
    //however, in many applications your should have 2 similar length strings, otherwise
    //you can skip edit distance and just consider 2 strings that varies greatly in length
    //to be |s1.length - s2.length| which should be a good enough approximation
    TraceBack p[] = new TraceBack[n+1]; //'previous' cost array, horizontally
    TraceBack d[] = new TraceBack[n+1]; // cost array, horizontally
    TraceBack _d[]; //placeholder to assist in swapping p and d

    // indexes into strings toString and fromString
    int j; // iterates through toString
    int i; // iterates through fromString

    char from_i; // jth character of fromString
    char to_j; // ith character of toString

    Edit edit;
    p[0] = new TraceBack(0, Edit.INITIAL, null);
    for (j = 0; j<n; j++) {
      TraceBack prev = p[j];
      edit = new Edit(Move.INSERT, toString.charAt(j));
      p[j+1] = new TraceBack(prev.cost+costFunction.getEditCost(edit, j), edit, prev);

    for (i = 1; i<=m; i++) {
        from_i = fromString.charAt(i-1);
        edit = new Edit(Move.DELETE, from_i);
        d[0] = new TraceBack(p[0].cost+costFunction.getEditCost(edit, i), edit, null);
        for (j=1; j<=n; j++) {
          to_j = toString.charAt(j-1);
            edit = new Edit(to_j==from_i ? Move.MATCH:Move.SUBSTITUTE, from_i, to_j);
            TraceBack sub = new TraceBack(costFunction.getEditCost(edit, i), edit, p[j-1]);
            edit = new Edit(Move.DELETE, from_i);
            TraceBack del = new TraceBack(costFunction.getEditCost(edit, i), edit, p[j]);
            edit = new Edit(Move.INSERT, to_j);
            TraceBack insert = new TraceBack(costFunction.getEditCost(edit, j), edit, d[j-1]);
            d[j] = best(sub, insert, del);
        boolean stillInRange = false;
        for(int x =0; x<= n; x++){ //check to see if there still exist a path that is within the threshold range
            if(d[x].cost < threshold){
                stillInRange=true; break;
            throw new RuntimeException("Threshold reached");
        // copy current distance counts to 'previous row' distance counts
        _d = p;
        p = d;
        d = _d;

    // our last action in the above loop was to switch d and p, so p now
    // actually has the most recent cost counts
    return p[n];
  private static TraceBack constructTraceBack(String s, CostFunction costFunction, Move move){
    TraceBack trace = new TraceBack(0f, Edit.INITIAL, null);
    for(int i =0;i<s.length();i++){
      Edit edit = new Edit(move, s.charAt(i));
      trace = new TraceBack(costFunction.getEditCost(edit, i), edit, trace);
    return trace;
  private static TraceBack best(TraceBack substitute, TraceBack insert, TraceBack delete){
    float subCost = substitute.cost + substitute.prevTraceBack.cost;
    float intCost = insert.cost + insert.prevTraceBack.cost;
    float delCost = delete.cost + delete.prevTraceBack.cost;

    TraceBack ret = substitute;
    ret.cost = subCost;
    float bestCost = subCost;
    if(intCost < bestCost){
      bestCost = intCost;
      ret = insert;
    if(delCost < bestCost){
      bestCost = delCost;
      ret = delete;
    ret.cost = bestCost;
    return ret;

Performance Considerations

After I finished implementing the above algorithm, I couldn't stop thinking about the performance difference between my implementation against the commons-lang implementation. It's obvious that the BrewEditDistance algorithm will be slower because it keeps track of more variables and perform more calculations, but exactly how much slower?

So I wrote a little function to time the 2 implementations:

  public static void main(String[] args) {
    BrewEditDistance ed = new BrewEditDistance(new CostFunction(){
      public float getEditCost(Edit edit, int index) {
        return edit.move == Move.MATCH ? 0 : 1.0f;
    int N = 1000000;
    long start = System.currentTimeMillis();
    for(int i =0; i<N; i++){
      StringUtils.getLevenshteinDistance("fromString", "toString");
    System.out.println((System.currentTimeMillis() - start)/1000.0);
    start = System.currentTimeMillis();
    for(int i =0; i<N; i++){
      ed.brewEditDistance("fromString", "toString");
    System.out.println((System.currentTimeMillis() - start)/1000.0);

The result on my quad-core Q6600@2.40GHz is

getLevenshteinDistance = 0.98 SECONDS
brewEditDistance = 9.639 SECONDS

As you can see, brewEditDistance is about 10 times slower. This really bothers me because I was expecting it to be about 4 times slower because brewEditDistance keeps track of about 4 times as many variables ({cost, edit type, character, prevTraceback}) as getLevenshteinDistance ( just {cost}) does.

So what went wrong? Well, I couldn't really tell so I did some profiling on brewEditDistance using the YourKit profiler. Here's the result

Okay... Looks like the algorithm was spending a lot of time constructing the TraceBack and Edit objects (most of them were throw away quickly after constructions).

We can fix this if we initialize the set of TraceBack objects up front and re-use them as if they are normal memory cells in an array. In addition, we need to flatten out the TraceBack object so that it's easier to work with.

abstract static class CostFunction{
public abstract float getEditCost(Move move, char character, int index);

static class TraceBack {
float cost = 0;
Move move = null;
char character = '0';
char subCharacter = '0';
TraceBack prevTraceBack = null;

Since we are re-using TraceBack objects, we can no longer use the row swapping trick (see item #1). Instead, we have to have the full matrix. We also make it big enough so that we only need to initialize it once.

  private TraceBack d[][] = new TraceBack[100][100];
  { //per instance initialization
    for(int i =0; i< 100; i++){
      for(int j =0; j<100; j++){
        d[i][j] = new TraceBack();

Now with these 2 modifications, we also need to modify the algorithm back to its very basic form.

    // Step 2
    for (i = 1; i <= n; i++) {
      TraceBack prev = d[i-1][0];
      char c = s.charAt(i-1);
      d[i][0].cost = prev.cost + costFunction.getEditCost(Move.DELETE, c, i);
      d[i][0].prevTraceBack = prev;
      d[i][0].character = c;
      d[i][0].move = Move.DELETE;

    for (j = 1; j <= m; j++) {
      TraceBack prev = d[0][j-1];
      char c = t.charAt(j-1);
      d[0][j].cost = prev.cost + costFunction.getEditCost(Move.INSERT, c, j);
      d[i][0].prevTraceBack = prev;
      d[i][0].character = c;
      d[i][0].move = Move.INSERT;

    // Step 3
    TraceBack sub = new TraceBack();
    TraceBack insert = new TraceBack();
    TraceBack del = new TraceBack();
    TraceBack tmp = null;
    for (i = 1; i <= n; i++) {

      s_i = s.charAt (i - 1);

      // Step 4

      for (j = 1; j <= m; j++) {

        t_j = t.charAt (j - 1);

        // Step 5
        sub.move = s_i == t_j?Move.MATCH:Move.SUBSTITUTE;
        sub.prevTraceBack = d[i-1][j-1];
        sub.cost = sub.prevTraceBack.cost + costFunction.getEditCost(sub.move, s_i, i);
        insert.move = Move.INSERT;
        insert.prevTraceBack = d[i][j-1];
        insert.cost = insert.prevTraceBack.cost + costFunction.getEditCost(insert.move, t_j, j);

        del.move = Move.DELETE;
        del.prevTraceBack = d[i-1][j];
        del.cost = del.prevTraceBack.cost + costFunction.getEditCost(del.move, s_i, i);
        // Step 6

        tmp = sub;
        if(insert.cost < tmp.cost ){ tmp = insert; }
        if(del.cost < tmp.cost ){ tmp = del; }
        d[i][j].prevTraceBack = tmp.prevTraceBack;
        d[i][j].move = tmp.move;
        d[i][j].subCharacter = t_j;
        d[i][j].character = s_i;


    // Step 7

    return d[n][m];

With these changes, I timed the same test once again and I got

getLevenshteinDistance = 0.993 SECONDS
brewEditDistance = 4.416 SECONDS

This time brewEditDistance is only 5 times slower which is in line with what I was expecting.

So, if you want good object-oriented design and can take some performance hit, use the normal implementation. Otherwise if performance is a concern for you, update the algorithm such that it re-use TrackBack objects in stead of constructing new ones during the computations. Use your own judgment to decide which design tradeoff you want to make.


Also many thanks to the smart folks at Health Market Science


If you use subversion, you probably know that you can enable a commit email hook on the server. Then you can configure svn to send emails out on every commit. You can also (I think, haven't tried this) configure it by setting hook:commit-email properties on the folder of interests.

Here at my new job, we also use subversion for source control. However, the subversion server is not configured to send out commit emails. I went to talk our svn admin and he told me that our svn server is too old so we are not gonna get that any time soon.

Okay... fine, that's not a problem, I figured that I can simulate the same feature on the client side with a little bit of scripting. The general idea is this, check out a project off svn onto the local drive, have the script to do a diff against the HEAD every 5 min and then have the script to send out email if there's a non-empty diff.

I googled around a little bit but couldn't find such scripts so I went ahead and wrote one myself with groovy. The script works as followed:

1. Check out the project you want to add the hook to, then for every 5 min, execute 'svn log -r BASE:HEAD' on that project to get a list of revisions and logs, the output of svn log looks something like (with --incremental option)

r183 | zl25-drexel | 2008-06-18 22:44:32 -0400 (Wed, 18 Jun 2008) | 1 line

that's good enough for now
r184 | zl25-drexel | 2008-06-18 22:57:06 -0400 (Wed, 18 Jun 2008) | 1 line

r185 | zl25-drexel | 2008-06-18 22:59:11 -0400 (Wed, 18 Jun 2008) | 1 line

2.Parse these logs to get a set of revision numbers (and the author, date, etc). Then for each revision, execute 'svn diff -r $rev1:$rev2' for each adjacent revisions. For example, if the revisions were r182, r183, and r184, then the script needs to to a diff for r182:r183 and r183:r184. For each diff, send out an email containing the diff output.

3. After all the diffs are performed, execute 'svn up -r $lastrev' to bring your local copy to the latest revision that you had checked against (note: do not update it to HEAD because there might be commits during the time when the script is sending out emails).

Voila! with the above 3 simple steps you will get exactly the same functionality as if a commit email hook is enabled on the server.

Here is the script for your viewing pleasure

#!/usr/bin/env groovy
import java.text.SimpleDateFormat
import javax.mail.internet.InternetAddress
import javax.mail.Message
import javax.mail.internet.MimeMessage
import javax.mail.Transport
import javax.mail.Session
import groovy.text.SimpleTemplateEngine

def chill = 5 //min
def emailconfig = [protocol:'smtps',
host : '',
port : 465, //must be int
user : 'XXX', password : 'XXX']
def projects = [

def subject="[<%=rev%>] [SVN:<%=name%>] [Author:<%=author%>]"
def body = """<%=name%> revision <%=rev%> report
Author: <%=author%>
Date: <%=date%>

Log Message:

def engine = new SimpleTemplateEngine()
def BODY = engine.createTemplate(body)
def SUBJECT = engine.createTemplate(subject)
def ps = new PrintStream(
new BufferedOutputStream(new FileOutputStream('svndiff.log')))
def printLog(def msg, def ps){
def df = new SimpleDateFormat()
println "[${df.format(new Date())}] $msg"
ps.println "[${df.format(new Date())}] $msg"

def parseLog(def log){
def lines = []
new StringReader(log).eachLine{ line ->
lines << line
if(lines.size < 2) return;
def ret = [:]
def items = lines[1].split(/[|]/)
ret['rev'] = items[0].trim().substring(1)
ret['author'] = items[1]
ret['date'] = items[2]
ret['log'] = log
return ret
}catch (Exception e) {
return null

def getBASErev(def path){
def base = 'BASE'
"svn info -r BASE $path".execute().in.eachLine{
def m = it=~/Revision:\s+(\d+)/
if(m.matches()){ base = }
return base

projects.each{name, path ->
printLog("processing $name", ps)
def revisions =
"svn log -r BASE:HEAD --limit 30 --incremental $path"
def BASErev = getBASErev(path)
def lastrev = null
def logs = [['rev':'BASE']]
revisions.each{rev ->
def parsed = parseLog(rev)
if (parsed != null)logs << parsed
if(logs.size == 1){
printLog('already up to date, nothing to process', ps); return
for(int i =0; i< logs.size -1; i++){
def rev1 = logs[i], rev2 = logs[i+1]
if(rev2['rev'] == BASErev){ continue;} //no need to diff BASE
def diff =
"svn diff -r ${rev1['rev']}:${rev2['rev']} $path".execute().text
def s = SUBJECT.make([
rev: rev2['rev'], name: name,
author: rev2['author'] ]).toString()
def b = BODY.make([
name: name, rev: rev2['rev'],
author:rev2['author'], date: rev2['date'],
log: logs[i+1]['log'], fulldiff: diff
printLog( "sending $s ...", ps)
sendEmail(s, b, emailconfig)
lastrev = rev2['rev']
// bring the local to the last rev we had diffed
if(lastrev != null){
printLog("svn up -r $lastrev $path", ps) ;
"svn up -r $lastrev $path".execute() }

}catch(Exception e){
printLog('\nChilling out for 5 min ...\n', ps)
//chill out for 5 min
sleep chill*60*1000


def sendEmail(def subject, def body, def config){
Properties props = new Properties()
props.put("mail.transport.protocol", config.protocol);
props.put("mail.smtps.auth", "true");

Session session = Session.getDefaultInstance(props)
Transport transport = session.getTransport()

MimeMessage message = new MimeMessage(session)
message.setContent(body, 'text/plain')
new InternetAddress(''))

transport.connect(, config.port,
config.user, config.password)



A few things should be noted. First of all, it uses java mail, which is not part of the JDK. So make sure you have it available in groovy's classpath (just throw the jars in $GROOVY_HOME/lib). I am using gmail's smtp server just for illustration purposes. It's probably not a good idea to have your company's svn commits messages all go thru gmail, it might get you into troubles. You should use your company's internal smtp server for internal projects. This script works as long as you have read accesses to a repository. So if you are not a developer of projects say apache commons but still want to track its revisions thru commit emails, you can hook this script to their repository and receive emails for every commit!

There are many features in Groovy (this language makes me so happy :-) ) which make scripting extremely easy. The above script showcased quite a few of them. Imagine if you need to write it in Java how much codes you need to write. I will never want to write this in Java.


Eclipse Code Template

Eclipse's Java editor allows you to define code templates. You can use them to code complete things that you type all the time, such as get logger, logger.debug, etc.

Go to Windows->Preferences->Java->Editor->Templates , and put in the following templates


private static final Log logger =
LogFactory.getLog( ${enclosing_type}.class );


if (logger.isDebugEnabled()) {
logger.debug(${Message}, ${exception});


logger.error(${Message}, ${exception});

name:info${Message}, ${exception});


logger.warn(${Message}, ${exception});


private static final ${type} ${name} = new ${type} ${cursor};

After you have these templates setup, type the first few characters of the template name and hit ctrl-space, the editor will code complete the template. For example, type getlog, ctrl-space, and enter will declare a logger variable named logger


If you uses Cygwin and you happens to also include the svn package when you install Cygwin, you won't be able to release your project because the Cygwin svn is broken.

You will get error messages similar to

[INFO] ------------------------------------------------------------------------
[INFO] Unable to commit files
Provider message:
The svn command failed.
Command output:
svn: '......' is not a working copy

The solution is to just use a windows svn client instead of the one that comes with Cygwin. The Cygwin svn is just a re-linking of the linux svn client with some wrapper linux API of windows (I think), which in this case is having trouble resolving windows path.

The problem is also documented on