Fuzzy Match FAQ

Version:
2019.3
Last modified: October 09, 2019

The following topics are common questions related to the Fuzzy match tool and the related Fuzzy match edit match options.

How many fields should I configure in a single Fuzzy match tool?

There is no standard answer to this question. Consider matching fields that should be different between records and could denote the record as unique. For example, in a standard contact database, the name, address, and phone number should identify a unique person. Many people can have the same city and state, so these would be less meaningful.

It is important to understand the relationship between using multiple fields and how much importance, or weight, is to be given to each field being considered in the matching process. For example, Name might not be as important as Address and ZIP, so weighting Name less than Address and ZIP can result in more matches where the Address and ZIP are exact, but the Name has scored less than an exact match.

What is the difference between Merge and Purge Mode?

Purge Mode (All Records Compared) finds matches within each individual dataset as well as matches between two datasets. Purge mode can be used on one dataset to remove duplicates from, or de-dupe, the database. This can be a prep-step before a two-database merge is performed.

Merge (Only Records from a Different Source are Compared) compares records from two different data sources. Choosing merge only find matches between two datasets.

Why do I need to de-dupe my database prior to sending it through Merge Mode?

A database should be de-duped prior to using Merge mode because:

  • Merge mode does not detect duplicate records within the same source.
  • The matching process is faster without duplicate records.

    Dataset 1 has 5 duplicates. Dataset 2 has 10. If Merge is run without purging these duplicates, the match will check 50 match pairs. If the duplicates are purged, the match will check 1 match pair.

Why do I need to have a unique ID per record?

The Fuzzy Matching tool uses an identifier (ID) to label matches, either from one file to another or from one row to another in a single file. The tool uses the ID to report which records match.

The ID must be unique for each record, including records from different datasets, to ensure accurate output from the tool. Adhere to the following best practices for unique IDs:

  • Know the size of your datasets to better understand the necessary starting value for each RecordID column.
  • Add a Record ID tool to both dataset streams.
  • Set the "Starting Value” of different dataset streams several magnitudes off from each other to ensure all records have a unique value assigned.

Best practice

Assign 100000000 as the initial value for the RecordID tool for the master file and 200000000 as the initial value for the customer file. Consistently using this practice allows you to easily identify the sources of the match records.

 

The RecordID1 and RecordID2 field naming convention is confusing coming out of the Fuzzy Match tool. Is there a trick to keep these fields straight?

In Purge mode, the data in RecordID1 and RecordID2 are the row identifiers from your dataset.

In Merge mode, RecordID1 and RecordID2 correspond to matched IDs, one from each dataset. Setting the Record IDs at starting values of different magnitudes allows you to more easily recognize which dataset is being referenced.

RecordID1 is always the "first" value in the matched pair if the two IDs are sorted alphanumerically.

The records from my sources are split between the RecordID fields. Why aren't they in the same column?

Fuzzy Match match pair IDs are sorted alphanumerically by row. Numeric RecordID fields sort Record ID1 to RecordID2, smallest to largest respectively, but string RecordIDs can sort in unexpected ways.

Record 101 matches with record 11. If the fields are stored as numbers, RecordID1 would be 11 and RecordID2 would be 101. If the fields are stored as strings, RecordID1 would be 101 and RecordID2 would be 11.

Switch to a numeric RecordID field, or verify that strings with prepended RecordIDs have a standardized format between records.

If I want to use city or state in my match configuration, which match style should I use?

In most address match scenarios, where the address database is consistently populated with data, city and state fields are not needed in matching. NameAddress, and Zip Code are more commonly used Match Style options. Examine your data to determine if city or state fields may be relevant.

Use Double Metaphone if:

  • The city and state fields are not abbreviated.
  • The fields may contain spelling errors.  

Use Whole Field or Whole Field - Case Insensitive if:

  • The state field is abbreviated and requires and exact match.

    An exact match is typically required if moving on to a more granular process of matching.

If I have multiple address configurations in my database (i.e. some address with suite numbers, some with not, and some with suite numbers in an additional field), which match style should I use?

In many address match scenarios, suite fields are not needed in matching. NameAddress, and Zip Code are more commonly used Match Style options. Examine your data to determine if suite fields may be relevant.

Double Metaphone w/Digits is the preferred match style for any address field, regardless of whether the address includes suite numbers. Also consider using the Strip punctuation and Remove Units From US Address option under Preprocess.

If I parse out a name field into multiple fields (i.e. First Name, Last Name, Middle Initial) will I get better match results?

In most cases, parsing a name field into individual component fields is not necessary and should not result in a better match. Use the Generate Keys for Each Word option with the Soundex algorithm to generate name field keys. This ensures that the word order is not considered, so both "Cindy Smith” or "Smith, Cindy” are considered a match.

Parsing the name field is advantageous when you want to put different weights on each value.

For Rosey Smith to match R Smith, the last name is weighted at 80% and the first name is weighted at 20%.

What if the name field contains Mr., Mrs., Miss etc. Will this affect the match rate for this field?

Under Edit... > Preprocess, use Strip Punctuation & Salutations to ignore these words while performing a match.

Was This Helpful?

Need something else? Visit the Alteryx Community or contact support.