Hi recently had to match two sets of data by names, where the names were similar but not quite the same. There were 3,500 of them so I needed to do it automatically. I managed to do a good job using Microsoft Excel and I thought others might benefit from hearing how I did it. For those of you that want a quick answer I:
For those of you who want a blow-by-blow explanation here is what I did:
1. Setting the Scene
To set the scene I had a list of 3,500 hospitals and clinics in a clean form with lots of useful data including latitude and longitude. However it didn’t have information on their usage of certain service, in this case blood tests. That was data was in another older form with rather random names, some spelt differently and some in a different order. It also had many items that I didn’t need (about 1,500) and didn’t want in my data. The table below shows you two entries that should be the same.
|New form Name||Old Form Name|
|ec Duncan Village CHC||Duncan Village Day Hospital|
2. Producing sub-categories (you can skip this step if you don’t have sub-categories in data)
I couldn’t simply match the names as there were often multiple entries with the same name, e.g. MyTown Hospital, MyTown Day Centre, MyTown Clinic, MyTown Prison Hospital etc. Therefore I needed to classify them into different sub-categories otherwise the match would pick the wrong one.
In my case it was quite difficult but hopefully yours is different. However the general rules are:
|Original Old Format Name||Subcategory||Cleaned Name|
|Duncan Village Day Hospital||Day Hospital||Duncan Village|
I then used a Pivot table to see what items fitted the list of sub-categories I had produced and kept adding sub-categories until I had enough. I then used a Excel Table which allowed me to filter by sub-category.
3. Using Microsoft Fuzzy Lookup
First I needed to install Fuzzy Lookup, which was pretty easy, see link to download (note: only works with Excel 2010). You access Fuzzy Lookup by clicking the top menu item of that name, click the square icon and you get a menu rather like a pivot table on the right hand side.
Some things I found out on how to using Fuzzy Lookup:
I produced different input tables for the different sub-categories, e.g. Large Hospitals, District Hospitals, Community Care Centre and Clinics and matched each separately. This stopped the problem of places with the same name getting confused.
Then I ran a fuzzy lookup on each sub-category which gave me a list of data with the matches (or blank for the second table output if the similarity wasn’t above my set threshold) and a very useful similarity column that shows how close it thinks the match is.
I found that playing with the similarity threshold was very important to get the right level of matching. I preferred no match if could be wrong, but your problem might be different to mine. The end result for me was 98% matching for the bigger hospitals going down to 75% for the small clinics, but that was because I didn’t want bad matches.