Fuzzy Match Edit Match Options
Any predefined or custom, user-defined match styles will appear in this list. The subsequent specifications in the dialog box will be selected based on the match style chosen.
If you edit a predefined match style, it will change to "Custom" in the drop down list. The settings specified in this custom match style will save with the workflow.
Add new custom match styles rather than deleting or editing default options.
You can delete a match style by selecting it from the drop down and clicking Delete. You can add a match style by typing in a new name and clicking OK.
Preprocess describes a procedure that runs before Generate Keys and the Fuzzy Match function. The Preprocess should result in better matches. The choices from this list include:
- None: No Preprocess is run.
- Strip Punctuation: Any punctuation characters within the specified data field will be ignored while the tool is determining matches.
- Strip Punctuation & Salutations: Any punctuation characters as well as any titles such as "MR" "MS" and"MRS" within the specified data field are ignored while the tool is determining a match.
- Strip Punctuation & AND, OF & THE: Any punctuation characters as well as any instances of the words "AND" "OF" and "THE" within the specified data field are ignored while the tool is determining matches.
- Strip Punctuation & Remove Units from US Addresses: Any punctuation characters as well as any unit numbers within the specified data field are ignored while the tool is determining matches.
Manual edits to preprocessing
The preprocess can be user-defined by editing the FuzzyMatchStyles.xml. This file is located in the Alteryx Runtime directory:
\Program Files\Alteryx\bin\RuntimeData\FuzzyMatch. This file should only be edited by a user who is familiar with XML and Regular Expressions.
Generate Keys is the method by which a potential match is identified.
Alteryx reads through the specified field and assigns Keys to the components of that field. Once all keys are generated, Alteryx compares the concatenated keys for every match field. If the keys generated are equal for two records, a potential match is identified and the pair will proceed to the next phase of the match process. Function choices are:
- None: Keys for this field are considered when deciding which records match.
- Digits Only: Only records with the same digits in the specified field will be matched.
1-(303)440-8896 would match 303-440-8896.
Non-digit characters are ignored and numbers are matched from last (6) to first (3 or 1). For this record to match, specify that the Maximum Key Length = 10 to ignore the leading 1.
- Double Metaphone: Double Metaphone is the preferred algorithm. An algorithm to code English words (and foreign words often heard in the English Language) phonetically by reducing them to 12 consonant sounds. This reduces matching problems from wrong spelling. The Double Metaphone is the preferred method for matching based on sound. It returns two keys if a word has two feasible pronunciations, such as a foreign word. For more information, see Double Metaphone.
- Double Metaphone w/ Digits: Uses the same Double Metaphone algorithm but includes digits as well. When there are digits in string, digits in the first token will be the key.
1234 5th St.
The "1234" would be the key.
Soundex: An algorithm to code surnames phonetically by reducing them to the first letter and up to three digits, where each digit is one of six consonant sounds. This reduces matching problems from different spellings.
The algorithm was devised to code names recorded in US census records. The standard algorithm works best on European names. Variants have been devised for names from other cultures. For more information, see Soundex.
Leading letter replacements
Alteryx automatically replaces the following leading letters and letter combinations prior to generating the match key:
Leading letter(s) Replacement AV AF AH A AW A CAAN TAAN DG G D G HA A KN K K C MAC MC M N NST NS PF F PH F Q G SCH SH Z S
- Soundex w/ Digits: Uses the same Soundex algorithm but includes digits as well. When there are digits in string, digits in the first token will be the key.
- Whole Field (Case Insensitive): Only records where the entire field matches will be matched. Case is ignored.
- Alphanumeric Only (Case Insensitive): Looks only at alphanumeric characters to make a match. Case is ignored.
- Address Number + Soundex: Removes the address number from a string and applies the Soundex algorithm to the remainder of the field. The Soundex code is then appended to the address number to create a unique key.
Generate Keys for Each Word: Generates a separate key for each word.
"john smith" and "smith john" will be able to line up as a potential match even though words are out of order.
- Don't Generate Keys for the following words: Specify or select words to exclude from key assignment. This can reduces processing time by limiting the number of potential matches.
- Don't Generate Keys for Single Letter Words: Select to exclude single letter words from key assignment. This can reduce processing time by limiting the number of potential matches.
- Ignore if Empty: Ignores an empty value of the specified match field. If the field is empty, then no key will be generated and record will be thrown out.
- Maximum Key Length: Specify the maximum length of the key to consider for the match.
The Match function is a more granular process by which a match is identified, and a score is applied. This differs from keys, which must match exactly. Choices are:
- None - Key Match Only: Looks only at the Key Generation specifications.
- Levenshtein Distance: The smallest number of insertions, deletions, and substitutions required to change one string or tree into another. When the Levenshtein Distance is selected, the match score will be significantly lower due to differences. For more information, see Levenshtein Distance.
- Jaro Distance: A measure of similarity between two strings. The Jaro measure is the weighted sum of percentage of matched characters and necessary transpositions. The Jaro Distance is more forgiving than the Levenshtein Distance with respect to difference in strings. For more information, see Jaro-Winkler.
- Best of Jaro & Levenshtein: both match types are analyzed and the score is taken.
- Word-based (Match Function begins with "Words:") functions look at any words within the specified field, regardless of the order the words are in.
- Non-word-based functions matches against the entire string as a whole.
- For word & digit functions, all tokens that have digits in them must be in both sides to consider a match. These would typically be used for addresses.
Word-based function options
- When Using Word Based Match, also use: You can specify an additional match method that will produce an additional score, taking the best one, and eliminate the need for running two instances of a Fuzzy Match tool:
- None: Uses the word based score only.
- Character: Uses the word-based match score in addition to a character match function. Two scores are generated and the best match score is used to identify the match.
- Character (No Spaces): Same as above, but spaces are ignored when generating the character-based match.
- Word Frequency Statistics (Word Match Only): You can specify a Word Frequency table based on predefined statistics. When specified, the words that appear in the database carry less importance when they are present in the incoming data, and the match score will be adjusted accordingly. Options include:
- [None]: No Word Frequency Statistics are used.
- Name: Contains frequent words in a name field. The frequency inversely relates to how important those words are in the match score.
- US Address: Contains frequent words in a US Address field. The frequency inversely relates to how important those words are in the match score.
- US Company: Contains frequent words in a Company Name field. The frequency inversely relates to how important those words are in the match score.
Match "Albert Commette" to "Albert Commette MD."
The Word Frequency Statistics table for "Name" includes the word "MD." When Word Frequency: Name is specified, the resulting match score is roughly 5 points higher than if Word Frequency: Name is not specified.
Word frequency statistics location
Word Frequency Statistics are contained within Alteryx Database files *yxdb and can be located in the RunTime Data Directory:
You can also create your own Word Frequency Statistics by editing the workflow CollectStats.yxmd located in the same directory.
- Nickname/Abbreviation Table (Word Match Only): Use a common Nickname table to check against and further identify duplicates. Use this option on fields containing either only the first name or both the first and last names.
Add additional nicknames and abbreviations:
- Update the Common Nicknames.yxdb database found at
- Any .yxdb files placed in this directory will become available from the drop down box in the Nicknames section of the Fuzzy Match tool.
- Update the Common Nicknames.yxdb database found at
- Penalty: Set the penalty percentage applied when a match is made with data from the Nickname table. The default value is 15%. A penalty is recommended as a nickname match is another potential source of error. The penalty percent will be subtracted from the match score prior to comparison with the match threshold.
Match Threshold: Set the allowable uncertainty percentage to return a match for a particular field.
If the threshold for field 1 is 60% and the field only matches with 55% confidence, the record will be thrown out.
Match Weight: Apply importance to the field, causing the field to be considered more or less strongly during a match.
If "Company Name" is twice as important as "Contact Name," you can set the importance here. So the Match Weight for Company Name should be twice the value of the Match Weight for Contact Name. This weight will be used when calculating the overall Match Score.
For additional information regarding Fuzzy Match use, see the Fuzzy match FAQ.