Skip to main content

Fuzzy Match Tool

Use Fuzzy Match to identify non-identical duplicates in a dataset by specifying match columns and similarity thresholds. Match scores only need to fall within the user-specified or default thresholds established in the configuration properties.

The most effective way to create a fuzzy match is to perform the match process on multiple columns within the input file. You should individually configure each column using either a predefined or custom Match Style, configured through the Fuzzy Match Edit Match Options.

Note

Fuzzy matching only works with Latin character sets, and some of the match capabilities are only compatible with the English language.

Configure the Tool

The Fuzzy Match tool requires a unique identifier for each data row. Inspect your data—if there is no key column, add a Row ID Tool 1 step upstream.

  1. Choose the preferred Find Method:

    • Purge Values: Compare all rows from a single source to identify duplicates.

    • Merge Value: Compare rows from different sources are compared, with the intent to identify duplicates across different input files. When you use this setting, also select the Column with Source ID.

  2. Select the unique Row ID Column.

  3. Enter the Match Threshold as a percentage. The default value is 80%. If the Match score generated from the Fuzzy Match tool is less than the specified threshold, the record won't qualify as a match. The Match score takes into consideration each specification within the configuration properties of the Fuzzy Match tool: Each field, the match style, the match weight, and the resulting field match score are considered in calculating the score, which is then against the specified Match Threshold.

  4. Select the Match Column. To add more than 1 matching column, select Add Column.

  5. Select the Match Style. These are the default Match Style options...

    • Address: A predefined match style configured to find address matches. This style incorporates Double Metaphone algorithms combined with a digit match to identify matching addresses. Apply this style to commercial addresses. For more information, go to Double Metaphone.

    • Address No Suite: A predefined match style configured to find address matches where the input data has no suite information in the Address column. This style incorporates Double Metaphone algorithms combined with a digit match to identify matching addresses. Apply this style to residential addresses.

    • AddressPart: A predefined match style configured to find address matches. This style incorporates Double Metaphone algorithms combined with a digit match to identify matching addresses. AddressPart differs from a traditional address match style in that it does not use word frequency analysis and the match threshold is 5% lower.

    • Company Name: A predefined match style configured to find company name matches. This style identifies matches based on Double Metaphone algorithms.

    • Exact: This field has to match exactly.

    • Name: A predefined match style configured to find name matches. This style incorporates Double Metaphone algorithms.

    • Name w/ Nicknames: A predefined match style configured to find name matches. This style incorporates Double Metaphone algorithms. Additionally, this style utilizes a Nicknames table to check against to further identify duplicates. For example, the name Andrew might match Andy and/or Drew.

    • Phone: A predefined match style configured to find phone matches. This style looks at the digits only in a phone field and matches on the reverse 10 digits, ignoring dashes, parenthesis, and leading 1s that may be contained within the field.

    • ZIP Code: A predefined match style configured to find ZIP code matches. This style looks at the 5 digits of a ZIP field and assigns a match accordingly.

    • Custom: Define your own match parameters.

  6. To edit the Match Style, select Edit under Match Options. This opens the Fuzzy Match Edit Match Options configuration window.

  7. Specify the Output Options:

    • Match Score: The Match Score appears in an additional output column. The Match Score is a percentage that is generated for matched columns.

    • Generated Keys: Output the key from the resulting match styles as an additional column.

    • Unmatched Rows: Rows that don't match any others are output as additional rows. The Ignore If Empty checkbox from the Fuzzy Match Edit Match Options is prioritized over this option.

    • Don't Compare Rows in Existing Group: Matched rows aren't compared to other rows, reducing computational expense and time. For example, if row 1 matches to row 2 and row 3, row 2 isn't matched against row 3. Use a Make Group Tool downstream to link groups together.

    • Only Generate Keys: Returns all rows with the generated keys as an additional column. No matching takes place.

For additional information regarding Fuzzy Match use, go to the Fuzzy Match FAQ.