Fuzzy Match Tool
The Fuzzy Matching tool can be used to identify non-identical duplicates of a dataset by specifying match fields and similarity thresholds. Match Scores only need to fall within the user-specified or default thresholds established in the configuration properties.
The most effective way to build a fuzzy match is to perform the match process on multiple fields within the input file. Each field should be individually configured using either a predefined or custom Match Style, configured through the Fuzzy Match Edit Match Options .
Fuzzy matching only works with Latin character sets, and some of the match capabilities are only compatible with English language.
Configure the tool
A unique identifier for each data record is necessary for the Fuzzy Match tool to work. Inspect your data; if there is no such key field, add a Record ID Tool one step upstream.
-
Choose the preferred match mode:
- Purge Mode (All Records Compared): All records from a single source are compared to identify duplicates.
-
Merge Mode (Only Records from a Different Source are Compared): Records from different sources are compared, with the intent to identify duplicates across different input files.
When using Merge mode, each source must contain a Source ID Field. A source ID field can be easily appended by choosing the Output File Name as Field option within each Input Data tool. This setting will append to each record a field with either the File Name or the entire File Path.
- Specify the unique Record ID Field.
-
Specify the Match Threshold as a percentage. The default value is 80%. If the Match score generated from the Fuzzy Match tool is less than the specified threshold, the record will not qualify as a match.
The Match score takes into consideration each specification within the configuration properties of the Fuzzy Match tool: Each field, the match style, the match weight, and the resulting field match score is considered in calculating the score, which is then against the specified Match Threshold.
- Set up your Match Fields. Use Up and Down to arrange them in order of matching. Use Delete to remove unneeded matches.
- Select the Field Name to match on. Any field already in the input connection will be available from this drop down list.
Select the Match Style from the drop down list. Choices include:
Address: A predefined match style configured to find address matches. This style incorporates Double Metaphone algorithms combined with a digit match to identify matching addresses.
Apply this style to Commercial Addresses
Address No Suite: A predefined match style configured to find address matches where the input data has no suite information in the Address field. This style incorporates Double Metaphone algorithms combined with a digit match to identify matching addresses.
Apply this style to Residential Addresses
- AddressPart: A predefined match style configured to find address matches. This style incorporates Double Metaphone algorithms combined with a digit match to identify matching addresses. AddressPart differs from a traditional address match style in that it does not use word frequency analysis and the match threshold is 5% lower.
- Company Name: A predefined match style configured to find company name matches. This style identifies matches based on Double Metaphone algorithms.
- Phone: A predefined match style configured to find phone matches. This style looks at the digits only in a phone field and matches on the reverse 10 digits, ignoring dashes, parenthesis and leading 1s that may be contained within the field.
- ZIP Code: A predefined match style configured to find ZIP code matches. This style looks at the 5 digits of a ZIP field and assigns a match accordingly.
- Exact: This field must match exactly to be considered a match. This logic is not fuzzy at all.
- Name: A predefined match style configured to find name matches. This style incorporates Double Metaphone algorithms.
Name with Nicknames: A predefined match style configured to find name matches. This style incorporates Double Metaphone algorithms. Additionally this style utilizes a Nicknames table to check against to further identify duplicates.
The name Andrew may match Andy and/or Drew.- Custom: Allows the user to define their own match parameters, so that the match can be run repeatedly without having to reconfigure the match properties. Of course these custom match styles can also be reconfigured and overwritten or new custom styles can be created.
- Edit the Match Style as necessary by clicking the Edit button. The Fuzzy Match Edit Match Options dialog displays.
-
Specify Advanced Options:
- Output Match Score: The match score will be present in an additional output field.
- Output Generated Keys: Outputs the key from the resulting match styles as an additional field.
- Output Unmatched Records: Records that do not match any other records will output as additional records. Occasionally, output unmatched records will report a match score, which should be ignored. This may be addressed in a future release.
-
Don't Compare Records already in a Group: Records that have been matched will not both be compared to other records, reducing processing effort and time.
If record 1 matches to record 2 and record 3, then record 2 is not matched against record 3. Use a Make Group tool downstream to link these groups together. - Generate Keys Only: All records are returned with the generated keys as an additional field. No matching takes place.
The Ignore if empty option from the Edit Match Option is prioritized over this option.
For additional information regarding Fuzzy Match use, see the Fuzzy Match FAQ.