Fuzzy Matching

Often you have two datasets to merge but the variable you need to merge on does not relate perfectly in the two datasets.

This is especially common when you have people's names, school names, or business names. One data set will have "F. Hernandez" and another "Flor Hernandez" or "Flor Hermamdez." One will have "Princeton Elementary" and another "Princeton Elem". "Microsoft Corp" or "Microsoft Corporation."

These sorts of issues require a "fuzzy match" by which you iteratively make and remove matches based on incrementally less stringent matching requirements.

There's some good discussion of how to write this in Stata here.

[...] based on a string field that contains organization names. > Unfortunately, the names are not listed equivalently in both databases (e.g. > "The Miller Corporation" in one vs. "Miller Corp." in the other). What Brendan wants is a "fuzzy/approximate string matching function" that will do what he is thinking. I know of no such function and, even if it existed, I would not recommend he trust it. Instead, I recommend Brendan do the match himself, tailoring the rules to his particular problem. The basic method ---------------- At each step of the way, we are going to have three datasets: resid1.dta resid2.dta: matched.dta: ----------------- ----------------- --------------- Unmatched records Unmatched records Matched from 1st dataset from 2nd dataset records ----------------- ----------------- ---------------- At each step, we are going to look at resid1.dta and resid2.dta, spot some matching rule (such as change "Corp" to Corporation" and then match), and turn the crank: matching what we can from resid1 and resid2, adding the matches to matched.dta, and updating resid1.dta and resid2.dta to contain the new leftovers. One of the advantages of this method is that, once we are down to small numbers of unmatched observations, we can apply rules highly specific to our problem. Getting Started --------------- Our first step will be an exact match. Let's assume the original datasets to be matched are called left.dta and right.dta: . use right.dta . sort name . save, replace . use left.dta . sort name . merge name using right . sort name . save result /* <- just temporarily */ . keep if _merge==3 . drop _merge . save matched . use result . keep if _merge==1 . keep name . merge name using left . keep if _merge==3 . drop _merge . save resid1 . use result . keep if _merge==2 . keep name . mege name using right . keep if merge==3 . drop _merge . save resid1 . erase result We are going to ahve to do something like that each iteration, and that is too much typing for us every to get right time after time. Moreover, the next time, we are going to have to add to matched.dta, and that is just going to complicate the issue. So let's write a program: program merge3 confirm file matched.dta confirm file resid1.dta confirm file resid2.dta use resid1, clear merge name using resid2 if _N==0 { exit } save result keep if _merge==3 drop _merge append using matched save matched, replace use result, clear keep if _merge==1 keep name merge name using resid1 keep if _merge==3 drop _merge sort name save resid1, replace emptyok use result, clear keep if _merge==2 keep name merge name using resid2 keep if _merge==3 drop _merge sort name save resid2, replace emptyok erase result.dta end Starting over ------------- With that, let's start over. First, we need to do the setup to run our program: . clear . save matched, emptyok . use left . sort name . save resid1 . use right . sort name . save resid2 Now let's do the exact match: . merge3 Step 2 ------ At this point, we have the three datasets. Look at resid1.dta and resid2.dta. Spot a problem. Let's pretend we see things like "Stata Corp" and Stata Corp.", and "ABC, Inc." and "ABC Inc". Let's get rid of periods and commads in the name: . use resid1, clear . replace name = subinstr(name, ".", "", .) . replace name = subinstr(name, ",", "", .) . sort name . save, replace . use resid2, clear . replace name = subinstr(name, ".", "", .) . replace name = subinstr(name, ",", "", .) . sort name . save, replace Now we can perform the second step of the merge: . merge3 Basically, we keep working like that until we have merged all the observations we think reasonable. -- Bill

Keywords: Andrew Johnston, Andrew, Johnston, economics, applied economics, economist, microeconomics, empirics, empirical economics, Wharton, Andrew Johnston, Economics, Andrew Johnston economics, Andrew Johnston, Andrew, Johnston, economics, applied economics, economist, microeconomics, empirics, empirical economics, Wharton, Andrew Johnston, Economics, Andrew Johnston economics