How to match names in Excel where spelling differ

Hi recently had to match two sets of data by names, where the names were similar but not quite the same. There were 3,500 of them so I needed to do it automatically. I managed to do a good job using Microsoft Excel and I thought others might benefit from hearing how I did it. For those of you that want a quick answer I:

  • Found a way to  produce sub-categories to make the matching better (you may not need this)
  • Then matched names using a Microsoft Fuzzy Lookup Add-In to Excel 2010 (link to download)

For those of you who want a blow-by-blow explanation here is what I did:

1. Setting the Scene

To set the scene I had a list of 3,500 hospitals and clinics in a clean form with lots of useful data including latitude and longitude. However it didn’t have information on their usage of certain service, in this case blood tests. That was data was in another older form with rather random names, some spelt differently and some in a different order. It also had many items that I didn’t need (about 1,500) and didn’t want in my data. The table below shows you two entries that should be the same.

New form Name Old Form Name
ec Duncan Village CHC Duncan Village Day Hospital

2. Producing sub-categories (you can skip this step if you don’t have sub-categories in data)

I couldn’t simply match the names as there were often multiple entries with the same name, e.g. MyTown Hospital, MyTown Day Centre, MyTown Clinic, MyTown Prison Hospital etc. Therefore I needed to classify them into different sub-categories otherwise the match would pick the wrong one.

In my case it was quite difficult but hopefully yours is different. However the general rules are:

  1. Find where the category part is and use VLOOKUP with a wildcard “*” format  to match part of the text and put it in a new column called “subcategory”. (see Microsoft documentation on using wildcards)
  2. Use Excel SUBSTITUTE to remove the category from the name and make a new column called “cleaned name”,
    e.g. SUBSTITUTE([Original Name],[String to remove, e.g. Day Hospital],””)

For example:

Original Old Format Name Subcategory Cleaned Name
Duncan Village Day Hospital Day Hospital Duncan Village

I then used a Pivot table to see what items fitted the list of sub-categories I had produced and kept adding sub-categories until I had enough. I then used a Excel Table which allowed me to filter by sub-category.

3. Using Microsoft Fuzzy Lookup

First I needed to install Fuzzy Lookup, which was pretty easy, see link to download (note: only works with Excel 2010). You access Fuzzy Lookup by clicking the top menu item of that name, click the square icon and you get a menu rather like a pivot table on the right hand side.

Some things I found out on how to using Fuzzy Lookup:

  • It only works with Excel Tables. Not really a problem as you can change data into a table using the menu commands Insert->table, but I wondered why it didn’t work until I found that!
  • You select the item(s) to match from the tables using the “Left column” and “Right Column” lists and then press the button between the lists to link them. An entry appears in the Match Columns list below the left/right column lists. Think of the “Left Column” as the original data and the “Right Column” as the possible matching data.
  • You can choose which columns you want in your match output from both input tables, plus some extra columns like similarity (see next bullet point).
  • After you have run it includes a very useful ‘similarity’ number column. This shows how close the match is. A value of 1.0 means they are exactly the same while a lower number shows how far the that the best match it could find is from the original name. I found that very useful for spotting poor matches.
  • You can control quality of the match it makes with the ‘similarity threshold’ control (bottom right). Any match which is below the threshold you set then doesn’t include the data from the second/right table, e.g. you get the original data but blank columns for the possible match.
  • Minor point, but be warned. When you press the Go button on the Fuzzy Lookup the output goes in the cell you have selected – it is easy overwrite existing data if you aren’t careful, but there is an Undo function.

I produced different input tables for the different sub-categories, e.g. Large Hospitals, District Hospitals, Community Care Centre and Clinics and matched each separately. This stopped the problem of places with the same name getting confused.

Then I ran a fuzzy lookup on each sub-category which gave me a list of data with the matches (or blank for the second table output if the similarity wasn’t above my set threshold) and a very useful similarity column that shows how close it thinks the match is.

I found that playing with the similarity threshold was very important to get the right level of matching. I preferred no match if could be wrong, but your problem might be different to mine. The end result for me was 98% matching for the bigger hospitals going down to 75% for the small clinics, but that was because I didn’t want bad matches.

3 Comments

  1. Charlie says:

    The PowerUps add-in for Excel works in Excel 2003 thru 2013. It has a function called pwrVLOOKUP that does fuzzy vlookups and operates basically the same as vlookup, just fuzzy if you want. An example is posted on the page here: http://officepowerups.com/2013/10/23/fuzzy-vlookup-in-excel/.

    • Jon Smith says:

      Just to say that I have not used this add-in and it isn’t free. However this might help someone who doesn’t have Excel 2010 or above.

  2. Anonimous says:

    There is an official Microsoft add-inn (Fuzzy Lookup Add-In for Excel) with similar funcioning and is free of charge:

    https://www.microsoft.com/en-us/download/details.aspx?id=15011

Leave A Reply