Fuzzy Match FAQ
There is no standard answer to this question. Consider matching fields that should be different between records and could denote the record as unique. For example, in a standard contact database, the name, address, and phone number should identify a unique person. Many people can have the same city and state, so these would be less meaningful.
It is important to understand the relationship between using multiple fields and how much importance, or weight, is to be given to each field being considered in the matching process. For example, Name might not be as important as Address and ZIP, so weighting Name less than Address and ZIP can result in more matches where the Address and ZIP are exact, but the Name has scored less than an exact match.
Purge Mode (All Records Compared) finds matches within each individual dataset as well as matches between two datasets. Purge mode can be used on one dataset to remove duplicates from, or de-dupe, the database. This can be a prep-step before a two-database merge is performed.
Merge (Only Records from a Different Source are Compared) compares records from two different data sources. Choosing merge only find matches between two datasets.
A database should be de-duped prior to using Merge mode because:
- Merge mode does not detect duplicate records within the same source.
- The matching process is faster without duplicate records.
Dataset 1 has 5 duplicates. Dataset 2 has 10. If Merge is run without purging these duplicates, the match will check 50 match pairs. If the duplicates are purged, the match will check 1 match pair.
The Fuzzy Matching tool uses an identifier (ID) to label matches, either from one file to another or from one row to another in a single file. The tool uses the ID to report which records match.
The ID must be unique for each record, including records from different datasets, to ensure accurate output from the tool. Adhere to the following best practices for unique IDs:
- Know the size of your datasets to better understand the necessary starting value for each RecordID column.
- Add a Record ID tool to both dataset streams.
- Set the "Starting Value” of different dataset streams several magnitudes off from each other to ensure all records have a unique value assigned.
Assign 100000000 as the initial value for the RecordID tool for the master file and 200000000 as the initial value for the customer file. Consistently using this practice allows you to easily identify the sources of the match records.
In Purge mode, the data in RecordID1 and RecordID2 are the row identifiers from your dataset.
In Merge mode, RecordID1 and RecordID2 correspond to matched IDs, one from each dataset. Setting the Record IDs at starting values of different magnitudes allows you to more easily recognize which dataset is being referenced.
RecordID1 is always the "first" value in the matched pair if the two IDs are sorted alphanumerically.
Fuzzy Match match pair IDs are sorted alphanumerically by row. Numeric RecordID fields sort Record ID1 to RecordID2, smallest to largest respectively, but string RecordIDs can sort in unexpected ways.
Record 101 matches with record 11. If the fields are stored as numbers, RecordID1 would be 11 and RecordID2 would be 101. If the fields are stored as strings, RecordID1 would be 101 and RecordID2 would be 11.
Switch to a numeric RecordID field, or verify that strings with prepended RecordIDs have a standardized format between records.
In most address match scenarios, where the address database is consistently populated with data, city and state fields are not needed in matching. Name, Address, and Zip Code are more commonly used Match Style options. Examine your data to determine if city or state fields may be relevant.
Use Double Metaphone if:
- The city and state fields are not abbreviated.
- The fields may contain spelling errors.
Use Whole Field or Whole Field - Case Insensitive if:
- The state field is abbreviated and requires and exact match.
An exact match is typically required if moving on to a more granular process of matching.
In many address match scenarios, suite fields are not needed in matching. Name, Address, and Zip Code are more commonly used Match Style options. Examine your data to determine if suite fields may be relevant.
Double Metaphone w/Digits is the preferred match style for any address field, regardless of whether the address includes suite numbers. Also consider using the Strip punctuation and Remove Units From US Address option under Preprocess.
In most cases, parsing a name field into individual component fields is not necessary and should not result in a better match. Use the Generate Keys for Each Word option with the Soundex algorithm to generate name field keys. This ensures that the word order is not considered, so both "Cindy Smith” or "Smith, Cindy” are considered a match.
Parsing the name field is advantageous when you want to put different weights on each value.
For Rosey Smith to match R Smith, the last name is weighted at 80% and the first name is weighted at 20%.
Under Edit... > Preprocess, use Strip Punctuation & Salutations to ignore these words while performing a match.