Fuzzy Match FAQ
Docs are available before the release of Designer Cloud so you can get a sneak peek. This content might change between now and the official release.
There is no standard answer to this question. Consider matching columns that should be different between rows and could denote the row as unique. For example, in a standard contact database, the name, address, and phone number should identify a unique person. Many people can have the same city and state, so these would be less meaningful.
It is important to understand the relationship between using multiple columns and how much importance, or weight, to give to each column being considered in the matching process. For example, Name might not be as important as Address and ZIP, so weighting Name less than Address and ZIP can result in more matches where the Address and ZIP are exact, but the Name has scored less than an exact match.
Find Values to Purge finds matches within each individual dataset, as well as matches between 2 datasets. Find Values to Purge can be used on 1 dataset to remove duplicates from, or de-dupe, the database. This can be a prep-step before a 2-database merge is performed.
Find Values to Merge compares rows from 2 different data sources (rows with different source IDs). Choosing merge only finds matches between 2 datasets.
A database should be de-duped prior to using Find Values to Merge because:
- Find Values to Merge doesn't detect duplicate rows within the same source.
- The matching process is faster without duplicate rows. For example, Dataset 1 has 5 duplicates. Dataset 2 has 10. If a merge runs without purging those duplicates, the match will check 50 match pairs. If the duplicates are purged, the match checks 1 match pair.
The Fuzzy Match tool uses an identifier (ID) to label matches, either from 1 file to another or from 1 row to another, in a single file. The tool uses the ID to report which rows match.
The ID have to be unique for each row, including rows from different datasets, to make sure output from the tool is accurate. Follow these best practices for unique IDs:
- Know the size of your datasets to better understand the necessary starting value for each Row ID column.
- Add a Record ID tool to both dataset streams.
- Set the Starting Value of different dataset streams several magnitudes off from each other to ensure all rows have a unique value assigned.
Assign 100000000 as the initial value for the Record ID tool for the primary file and 200000000 as the initial value for the customer file. Consistently using this practice allows you to easily identify the sources of the match rows.
In Purge mode, the data in RecordID1 and RecordID2 are the row identifiers from your dataset.
In Merge mode, RecordID1 and RecordID2 correspond to matched IDs, 1 from each dataset. Setting the Record IDs at starting values of different magnitudes allows you to more easily recognize which dataset is being referenced.
RecordID1 is always the "first" value in the matched pair if the 2 IDs are sorted alphanumerically.
Fuzzy Match match pair IDs are sorted alphanumerically by row. Numeric RecordID columns sort Record ID1 to RecordID2, smallest to largest respectively, but string RecordIDs can sort in unexpected ways.
Take a scenario where row 101 matches with row 11. If the columns are stored as numbers, RecordID1 would be 11 and RecordID2 would be 101. If the columns are stored as strings, RecordID1 would be 101 and RecordID2 would be 11.
Switch to a numeric RecordID column, or verify that strings with prepended RecordIDs have a standardized format between rows.
In most address match scenarios, where the address database is consistently populated with data, city and state columns are not needed in matching. Name, Address, and Zip Code are more commonly used Match Style options. Examine your data to determine if city or state columns may be relevant.
Use Double Metaphone if:
- The city and state columns are not abbreviated.
- The columns may contain spelling errors.
Use Whole Field or Whole Field - Case Insensitive if:
- The state column is abbreviated and requires and exact match. An exact match is typically required if moving on to a more granular process of matching.
In many address match scenarios, suite columns are not needed in matching. Name, Address, and Zip Code are more commonly used Match Style options. Examine your data to determine if suite columns may be relevant.
Double Metaphone w/Digits is the preferred match style for any address column, regardless of whether the address includes suite numbers. Also consider using the Strip Punctuation and Remove Units From US Address option under Pre-process.
In most cases, parsing a Name column into individual component columns is not necessary and should not result in a better match. Use the Generate Keys for Each Word option with the Soundex algorithm to generate name column keys. This ensures that the word order is not considered, so both "Cindy Smith” or "Smith, Cindy” are considered a match.
Parsing the Name column is advantageous when you want to put different weights on each value. For Rosey Smith to match R Smith, the last name is weighted at 80% and the first name is weighted at 20%.
Under Edit > Pre-process, use Strip Punctuation & Salutations to ignore these words while performing a match.
Designer Cloud automatically applies custom match styles in a workflow's XML when you select them. Unlike Designer on the desktop, there's no need for you to select Save.