Fuzzy Match Edit Match Options
Any predefined or custom, user-defined match styles appear in this list. The subsequent specifications in the dialog box are selected based on the match style chosen.
If you edit a predefined match style, the name changes to "Custom" in the dropdown. The settings specified in this custom match style save with the workflow.
Add new custom match styles rather than deleting or editing default options.
- To delete a match style, select it from the dropdown, and select Delete.
- To add a match style, enter a new name, and select OK.
Preprocess describes a procedure that runs before Generate Keys and the Fuzzy Match function. The Preprocess should result in better matches. The choices from this list include...
- None: No Preprocess is run.
- Strip Punctuation: Any punctuation characters within the specified data field are ignored while the tool determines matches.
- Strip Punctuation & Salutations: Any punctuation characters, as well as any titles like
Mrswithin the specified data field, are ignored while the tool determines a match.
- Strip Punctuation & AND, OF & THE: Any punctuation characters, as well as any instances of the words
thewithin the specified data field, are ignored while the tool determines matches.
- Strip Punctuation & Remove Units from US Addresses: Any punctuation characters, as well as any unit numbers within the specified data field, are ignored while the tool determines matches.
Manual Edits to Preprocessing
The preprocess can be user-defined by editing FuzzyMatchStyles.xml. This file is located in the Alteryx Runtime directory: \Program Files\Alteryx\bin\RuntimeData\FuzzyMatch. You should only edit this file if you are familiar with XML and Regular Expressions.
Generate Keys is the method by which a potential match is identified.
Alteryx reads through the specified field and assigns keys to the components of that field. Once all keys are generated, Alteryx compares the concatenated keys for every match field. If the keys generated are equal for 2 records, a potential match is identified and the pair proceeds to the next phase of the match process. Function choices are...
- None: Keys for this field are considered when deciding which records match.
- Digits Only: Only records with the same digits in the specified field are matched. For example, 1-(800)555-1234 matches 800-555-1234.
Non-digit characters are ignored and numbers are matched from last (4) to 1st (8 or 1). For this record to match, specify that the Maximum Key Length = 10 to ignore the leading 1.
- Double Metaphone*: Double Metaphone is the preferred algorithm. This is an algorithm to code English words (and foreign words often heard in the English language) phonetically by reducing them to 12 consonant sounds. This reduces matching problems caused by spelling errors. The Double Metaphone is the preferred method for matching based on sound. It returns 2 keys if a word has 2 feasible pronunciations, like a foreign word. For more information, go to Double Metaphone.
*The Double Metaphone method lets you specify a Maximum Key Length, however, this setting is not functional. The Double Metaphone Maximum Key Length is hard-coded to 8 regardless of what you specify via this setting.
- Double Metaphone w/ Digits: Uses the same Double Metaphone algorithm but includes digits as well. When there are digits in the string, digits in the first token are the key. For example, in the string "1234 5th St" the "1234" is the key.
Soundex: An algorithm to code surnames phonetically by reducing them to the 1st letter and up to 3 digits, where each digit is one of six consonant sounds. This reduces matching problems from different spellings. The algorithm was devised to code names recorded in US census records. The standard algorithm works best on European names. Variants have been devised for names from other cultures. For more information, go to Soundex.Leading letter replacements: Alteryx automatically replaces these leading letters and letter combinations prior to generating the match key:
Leading Letters Replacement AV AF AH A AW A CAAN TAAN DG G D G HA A KN K K C MAC MC M N NST NS PF F PH F Q G SCH SH Z S
- Soundex w/ Digits: Uses the same Soundex algorithm but includes digits as well. When there are digits in the string, digits in the 1st token are the key.
- Whole Field (Case Insensitive): Only records where the entire field matches are matched. Case is ignored.
- Alphanumeric Only (Case Insensitive): Looks only at alphanumeric characters to make a match. Case is ignored.
- Address Number + Soundex: Removes the address number from a string and applies the Soundex algorithm to the remainder of the field. The Soundex code is then appended to the address number to create a unique key.
Generate Keys for Each Word: Generates a separate key for each word.
"john smith" and "smith john" are able to line up as a potential match even though the words are out of order.
- Don't Generate Keys for the following words: Specify or select words to exclude from key assignment. This can reduce the processing time by limiting the number of potential matches.
- Don't Generate Keys for Single Letter Words: Select to exclude single letter words from key assignment. This can reduce the processing time by limiting the number of potential matches.
- Ignore if Empty: Ignores an empty value of the specified match field. If the field is empty, then no key is generated and the record is thrown out.
- Maximum Key Length: Specify the maximum length of the key to consider for the match.
The Match function is a more granular process by which a match is identified, and a score is applied. This differs from keys, which must match exactly. Choices are:
- None - Key Match Only: Looks only at the Key Generation specifications.
- Levenshtein Distance: The smallest number of insertions, deletions, and substitutions required to change 1 string or tree into another. When the Levenshtein Distance is selected, the match score is significantly lower due to differences. For more information, go to Levenshtein Distance. Levenshtein Distance options include...
- Character: Levenshtein Distance
- Character (No Spaces): Levenshtein Distance
- Words: Levenshtein Distance
- Words & Digits: Levenshtein Distance
- Jaro Distance: A measure of similarity between 2 strings. The Jaro measure is the weighted sum of the percentage of matched characters and necessary transpositions. The Jaro Distance is more forgiving than the Levenshtein Distance with respect to differences in strings. For more information, go to Jaro-Winkler. Jaro-Winkler options include...
- Character: Jaro Distance
- Character (No Spaces): Jaro Distance
- Words: Jaro Distance
- Words & Digits: Jaro Distance
- Best of Jaro & Levenshtein: Both match types are analyzed and the score is taken. Best of Jaro & Levenshtein options include...
- Character: Best of Jaro & Levenshtein Distance
- Character (No Spaces): Best of Jaro & Levenshtein Distance
- Words: Best of Jaro & Levenshtein Distance
- Words & Digits: Best of Jaro & Levenshtein Distance
- Word-based (Match Function begins with Words:) functions look at any words within the specified field, regardless of the order the words are in.
- Non-word-based functions match against the entire string as a whole.
- For Words & Digits functions, all tokens that have digits in them must be in both sides to consider a match. These would typically be used for addresses.
Word-based Function Options
- When Using Word Based Match, also use: You can specify an additional match method that will produce an additional score (taking the best one) and eliminate the need for running 2 instances of a Fuzzy Match tool:
- None: Uses the word-based score only.
- Character: Uses the word-based match score in addition to a character match function. 2 scores are generated and the best match score is used to identify the match.
- Character (No Spaces): Same as above, but spaces are ignored when generating the character-based match.
- Word Frequency Statistics (Word Match Only): You can specify a Word Frequency table based on predefined statistics. When specified, the words that appear in the database carry less importance when they are present in the incoming data, and the match score is adjusted accordingly. Options include...
- [None]: No Word Frequency Statistics are used.
- Name: Contains frequent words in a name field. The frequency inversely relates to how important those words are in the match score.
- US Address: Contains frequent words in a US Address field. The frequency inversely relates to how important those words are in the match score.
- US Company: Contains frequent words in a Company Name field. The frequency inversely relates to how important those words are in the match score.
Match "Albert Commette" to "Albert Commette MD."
The Word Frequency Statistics table for "Name" includes the word "MD." When Word Frequency: Name is specified, the resulting match score is roughly 5 points higher than if Word Frequency: Name is not specified.
Word Frequency Statistics Location
Word Frequency Statistics are contained within Alteryx Database files *yxdb and can be located in the RunTime Data Directory:
You can also create your own Word Frequency Statistics by editing the workflow CollectStats.yxmd located in the same directory.
- Nickname/Abbreviation Table (Word Match Only): Use a common Nickname table to check against and further identify duplicates. Use this option on fields containing either only the first name or both the first and last names. Add additional nicknames and abbreviations:
- Update the Common Nicknames.yxdb database found at:
- Any .yxdb files placed in this directory will become available from the dropdown box in the Nicknames section of the Fuzzy Match tool.
- Update the Common Nicknames.yxdb database found at:
- Penalty: Set the penalty percentage applied when a match is made with data from the Nickname table. The default value is 15%. A penalty is recommended as a nickname match is another potential source of error. The penalty percent will be subtracted from the match score prior to comparison with the match threshold.
Match Threshold: Set the allowable uncertainty percentage to return a match for a particular field. If the threshold for field 1 is 60% and the field only matches with 55% confidence, the record is ignored.
Match Weight: Apply importance to the field, causing the field to be considered more or less strongly during a match.
If "Company Name" is twice as important as "Contact Name," you can set the importance here. So the Match Weight for Company Name should be twice the value of the Match Weight for Contact Name. This weight is used when calculating the overall Match Score.
For additional information regarding Fuzzy Match use, see the Fuzzy Match FAQ.