Recently I needed to build something that is able to recognize and resolve US city alias, i.e. 'Phila, PA' should be resolved to 'PHILADELPHIA, PA' and 'MANAYUNK, PA' will be also 'PHILADELPHIA, PA' (Manayunk is a neighborhood in Philadelphia, PA). The USPS website has a web lookup tool in which you can put in a zip code and it will show you the zip's corresponding city state and also a list of alternative city names. For example, the outputs of http://zip4.usps.com/zip4/zcl_3_results.jsp?zip5=19148 are


Actual City name in 19148
PHILADELPHIA, PA
Acceptable City names in 19148
PHILA, PA
Not Acceptable
PASSYUNK, PA


By harvesting these information from USPS, you can build a map of alternative city name -> actual city name, then using this map it takes one lookup to resolve a city alias to the actual city. I needed these data but USPS does not have them available for download. So I wrote a little script that basically executes 'curl http://zip4.usps.com/zip4/zcl_3_results.jsp?zip5=$code' for every zip and extracted the information from the return html.

I put the data up at http://drop.io/cityalias, feel free to download it if you are also interested in these data. The data is in the following format:

$zip $actual_city $actual_state $pipe_delimited_alternative_cities
19113 PHILADELPHIA, PA=PHILA, PA|LESTER, PA


Now I need to build the city alias map. The map will need to be a map of
{ state -> map of
{city alias -> actual city} }
as such:

PA -> {
PHILA -> PHILADELPHIA
LESTER -> PHILADELPHIA
...
},
NY -> {
MANHATTAN -> NEW YORK
NYC -> NEW YORK
...
},
(other states...)



The data set is not very large so I wanted to put the entire map in memory for fast lookups. However, since Java will create a new string for each key and value in the map and also because they are man repeated values, it will be very wasteful space wise to simply throw everything in a hash map as is.

To reduce memory consumption of dictionary data structure, people usually use a trie to compress the data in tree-like form. The city alias data set is pretty small (about 1MB in raw) so I think using a trie would be an overkill.

There's another quick and easy way to reduce the memory footprint which happened to work very well for my case. That is to just first intern the key and values strings before putting them in the hash map. The same string will be interned to the same memory location therefore the map uses the same amount of memory no matter how many instances of 'PHILADELPHIA' we need to put in the map.

I ran a quick profiling session just to see how big is the difference, here's the result for the city alias dataset:

with interning - 450KB
without interning - 1.55MB