Fuzzy Match Edit Match Options
Use the Edit button of the Fuzzy Match Tool Configuration window to access the Edit Match Options.
Match Style is a predetermined method of finding an appropriate match between records of an input file. The individual match style choices are defined on the Fuzzy Match Tool page.
Any predefined or custom, user-defined match styles will appear in this list. The subsequent specifications in the dialog box will be selected based on the match style chosen.
If you edit a predefined match style, it will change to "Custom" in the drop down list. The settings specified in this custom match style will save with the workflow.
Add new custom match styles rather than deleting or editing default options.
You can delete a match style by selecting it from the drop down and clicking Delete. You can add a match style by typing in a new name and clicking OK.
Preprocess describes a procedure that runs before Generate Keys and the Fuzzy Match function. The Preprocess should result in better matches. The choices from this list include:
- None: No Preprocess is run.
- Strip Punctuation: Any punctuation characters within the specified data field will be ignored while the tool is determining matches.
- Strip Punctuation & Salutations: Any punctuation characters as well as any titles such as "MR" "MS" and"MRS" within the specified data field are ignored while the tool is determining a match.
- Strip Punctuation & AND, OF & THE: Any punctuation characters as well as any instances of the words "AND" "OF" and "THE" within the specified data field are ignored while the tool is determining matches.
- Strip Punctuation & Remove Units from US Addresses: Any punctuation characters as well as any unit numbers within the specified data field are ignored while the tool is determining matches.
Manual edits to preprocessing
The preprocess can be user-defined by editing the FuzzyMatchStyles.xml. This file is located in the Alteryx Runtime directory: \Program Files\Alteryx\bin\RuntimeData\FuzzyMatch. This file should only be edited by a user who is familiar with XML and Regular Expressions.
Generate Keys is the method by which a potential match is identified.
Alteryx reads through the specified field and assigns Keys to the components of that field. Once all keys are generated, Alteryx compares the concatenated keys for every match field. If the keys generated are equal for two records, a potential match is identified and the pair will proceed to the next phase of the match process. Function choices are:
- None: Keys for this field are considered when deciding which records match.
- Digits Only: Only records with the same digits in the specified field will be matched.
- Digits Only - Reverse: Only records with the same digits (in the order from last to first) in the specified field will be matched.
- Double Metaphone: Double Metaphone is the preferred algorithm. An algorithm to code English words (and foreign words often heard in the English Language) phonetically by reducing them to 12 consonant sounds. This reduces matching problems from wrong spelling. The Double Metaphone is the preferred method for matching based on sound. It returns two keys if a word has two feasible pronunciations, such as a foreign word. For more information, see Double Metaphone.
- Double Metaphone w/ Digits: Uses the same Double Metaphone algorithm but includes digits as well. When there are digits in string, digits in the first token will be the key.
-
Soundex: An algorithm to code surnames phonetically by reducing them to the first letter and up to three digits, where each digit is one of six consonant sounds. This reduces matching problems from different spellings.
The algorithm was devised to code names recorded in US census records. The standard algorithm works best on European names. Variants have been devised for names from other cultures. For more information, see Soundex.
- Soundex w/ Digits: Uses the same Soundex algorithm but includes digits as well. When there are digits in string, digits in the first token will be the key.
- Whole Field (Case Insensitive): Only records where the entire field matches will be matched. Case is ignored.
- Alphanumeric Only (Case Insensitive): Looks only at alphanumeric characters to make a match. Case is ignored.
- Address Number + Soundex: Removes the address number from a string and applies the Soundex algorithm to the remainder of the field. The Soundex code is then appended to the address number to create a unique key.
1-(303)440-8896 would not match 303-440-8896.
Even though non-digit characters are ignored, these phone numbers still do not match because there is a leading 1 in the first record.
1-(303)440-8896 would match 303-440-8896.
Non-digit characters are ignored and numbers are matched from last (6) to first (3 or 1). For this record to match, specify that the Maximum Key Length = 10 to ignore the leading 1.
1234 5th St.
The "1234" would be the key.
Alteryx automatically replaces the following leading letters and letter combinations prior to generating the match key:
Leading letter(s) | Replacement |
---|---|
AV | AF |
AH | A |
AW | A |
CAAN | TAAN |
DG | G |
D | G |
HA | A |
KN | K |
K | C |
MAC | MC |
M | N |
NST | NS |
PF | F |
PH | F |
Q | G |
SCH | SH |
Z | S |
Generate Keys for Each Word: Generates a separate key for each word.
Ignore if Empty: Ignores an empty value of the specified match field. If the fieldis empty, then no key will be generated and record will be thrown out.
Maximum Key Length: Specify the maximum length of the key to consider for the match.
The Match function is a more granular process by which a match is identified, and a score is applied. This differs from keys, which must match exactly. Choices are:
- None - Key Match Only: Looks only at the Key Generation specifications.
- Levenshtein Distance: The smallest number of insertions, deletions, and substitutions required to change one string or tree into another. When the Levenshtein Distance is selected, the match score will be significantly lower due to differences. For more information, see Levenshtein Distance.
- Jaro Distance: A measure of similarity between two strings. The Jaro measure is the weighted sum of percentage of matched characters and necessary transpositions. The Jaro Distance is more forgiving than the Levenshtein Distance with respect to difference in strings. For more information, see Jaro-Winkler.
- Best of Jaro & Levenshtein: both match types are analyzed and the score is taken.
Function types
- Word-based (Match Function begins with "Words:") functions look at any words within the specified field, regardless of the order the words are in.
- Non-word-based functions matches against the entire string as a whole.
- For word & digit functions, all tokens that have digits in them must be in both sides to consider a match. These would typically be used for addresses.
Word-based function options
- When Using Word Based Match, also use: You can specify an additional match method that will produce an additional score, taking the best one, and eliminate the need for running two instances of a Fuzzy Match tool:
- None: Uses the word based score only.
- Character: Uses the word-based match score in addition to a character match function. Two scores are generated and the best match score is used to identify the match.
- Character (No Spaces): Same as above, but spaces are ignored when generating the character-based match.
- Word Frequency Statistics (Word Match Only): You can specify a Word Frequency table based on predefined statistics. When specified, the words that appear in the database carry less importance when they are present in the incoming data, and the match score will be adjusted accordingly. Options include:
- [None]: No Word Frequency Statistics are used.
- Name: Contains frequent words in a name field. The frequency inversely relates to how important those words are in the match score.
- US Address: Contains frequent words in a US Address field. The frequency inversely relates to how important those words are in the match score.
- US Company: Contains frequent words in a Company Name field. The frequency inversely relates to how important those words are in the match score.
- Nickname/Abbreviation Table (Word Match Only): Use a common Nickname table to check against and
further identify duplicates. Use this option on fields
containing either only the first name or both the first and last
names.
Add additional nicknames and abbreviations:
- Update the Common Nicknames.yxdb database found at \Program Files\Alteryx\bin\RuntimeData\FuzzyMatch\Nicknames\
- Any .yxdb files placed in this directory will become available from the drop down box in the Nicknames section of the Fuzzy Match tool.
Match "Albert Commette" to "Albert Commette MD."
The Word Frequency Statistics table for "Name" includes the word "MD." When Word Frequency: Name is specified, the resulting match score is roughly 5 points higher than if Word Frequency: Name is not specified.
Word Frequency Statistics are contained within Alteryx Database files *yxdb and can be located in the RunTime Data Directory:
\Program Files\Alteryx\bin\RuntimeData\FuzzyMatch\
You can also create your own Word Frequency Statistics by editing the workflow CollectStats.yxmd located in the same directory.
- Penalty: Set the penalty percentage applied when a match is made with data from the Nickname table. The default value is 15%. A penalty is recommended as a nickname match is another potential source of error. The penalty percent will be subtracted from the match score prior to comparison with the match threshold.
Match Threshold: Set the allowable uncertainty percentage to return a match for a particular field.
Match Weight: Apply importance to the field, causing the field to be considered more or less strongly during a match.
For additional information regarding Fuzzy Match use, see the Fuzzy Match FAQ.