◄Show Table of Contents
The Regular Expression tool uses regular expression syntax to parse, match, or replace data.
Select the field to parse from the drop down list.
Enter in the codes to build the Regular Expression. Use the drop down list to make the building of the expression easier. Consult the Complete Perl RegEx syntax for help.
Note the Case Insensitive checkbox.
Select the Output Method (4 options) described in detail below. Each method will parse differently.
Replace: The Replace Method simply replaces Expression#1 with Expression#2
Build the Replacement Text expression by selecting items from the drop down list. The expression is built by identifying the Marked Group to replace the expression with.
Consult the complete Boost-Extended Format String Syntax for help in building this expression. Use the drop down list to make the building of the expression easier.
Tokenize: The Tokenize Method allows you to specify a regular expression to match on and that part of the string is parsed into separate columns (or rows). When using the Tokenize method, you want to match to the whole token, and if you have a marked group, only that part is returned.
This method behaves like the Text To Column tool, but instead of matching on what you don't want (such as a delimiter), you match on what you do want. Tokenize is much more flexible in that you can split on any pattern of data.
For more specific use cases of the Tokenize function, see the Tokenize section below.
There are two options for Tokenize, Split to Columns and Split to Rows.
Split to Columns: will split a single column of data at each instance of the specified regular expression pattern into multiple columns.
Tokenize Properties: Specify the New Field Root name and the number of New Fields to create. Alteryx will append a number sequentially to the new field root name. (i.e. New Field Root Name: NAME and Number of New Fields: 3 will result in 3 new fields added to the table: NAME1, NAME2, NAME3)
Specify the action to perform in the event there is additional data for the number of columns that you had specified in step 2. Options include:
Drop Extra with Warning: The extra data will be dropped but a warning will appear in the Results window (i.e.● RegEx (7): The Value "E & E SOCCER CAMP" had too many parts to fit # of output fields)
Drop Extra without Warning: The extra data will be dropped and you will receive no warning
Error: the workflow will fail to run and an error indicating as such will appear in the Results window (i.e.● RegEx (7): The Value "E & E SOCCER CAMP" had too many parts to fit # of output fields)
Split to Rows:will split a single column of data at each instance of the specified regular expression pattern into multiple rows.
Parse: The Parse Method separates the Expression into new fields, allowing you to specify the output field name, field type, size and expression.
For each expression, separated into a group, a new field is created in the Output Fields box.
Click in the Output field name to rename the field (defaults to RegExOut1, RegExOut2, etc)
Pull down the drop down list to specify the field type
Type in the Size column to specify the field size
The expression automatically appears in the expression box, consistent to the expression listed above. This Expression indicates what will be populated in this parsed field.
Match: The Match method assigns either a 1 or 0 based on whether or not the expression matched the string. When choosing this method, connect to a Filter tool to separate matched results from unmatched ones.
Type in the new field created by the Expression specified above. This field will be populated either with a "1" indicating a match or "0" indicating unmatched.
Note: there is a checkbox to throw an error if not Matched
To separate matched records from unmatched records, connect to a Filter tool downstream.
The following are three use cases for the Regex tool Tokenize method.
Parse a 9 character string (123456789) into 3 fields:
The regular expression is:
The resulting 3 fields will be 123 456 789
If we only wanted the second character back, we would mark that:
The resulting 3 fields would be 2 5 8
Parse a Ctrl-A delimited field
The regular expression for tokenizing a Ctrl-A delimited string is:
Now we will break down the regex:
[^...] The brackets specify a match to a single character in a set of characters. Starting the set with ^ changes it to match any character not in the set.
\cA This simply matches the Ctrl-A character.
+ This means match 1 or more of the previous.
Allowing blank tokens
Sometimes you want to ensure you are preserving blanks. Consider the following text:
In this case you want 3 fields out, not 2, so in this situation you need to include the delimiter in the match:
Breaking down the regex:
(...) Parenthesis create a marked group of the expression. The tokenize mode allows you to match a larger part of the input field, yet only return a subset that was marked. This way you avoid returning the delimiter. You may only have 1 marked expression.
[^,] Like the example above, match any character other thana ,
* Match 0 or more of the previous. This differs from the first example in that we allow an empty set. You cannot end here because the regex engine doesn't like a match of 0 characters since there is an infinite number of matches, so we have to terminate the match on something.
(?:....) This is an unmarked group. We need this for the or which we use a|for.
| This is saying match either the thing before or after, but not both. This almost always needs to be used with a marked or unmarked group.
$ Matches the end of the string. Hence(?:,|$) matches up toa ,or the end of the string.
Parsing HTML Links
Another common task is parsing HTML documents. This example will parse out HTML links from a home page. The expression is quite simple:
Breaking down the regex:
<a This is a literal match for the text<a
.*? . is any character, * is 0 or more, the ? is where it gets interesting. The ? modifies the * to make in non-greedy. What this means is to match to the shortest possible match. Without this, the expression might find a single token from the beginning of the 1st link to the end of the last.
> Just matches the literal > in the HTML data.
.*? The shortest possible match of any characters that still satisfies the entire regex.
</a> Just the literal characters </a>. This ends the match.
This pulls every link out of a large HTML document into a series of records (or fields).
For more information on writing regular expressions, consult www.regexlib.com. Note that Alteryx is not affiliated with this website, but there are useful resources and links contained there. Be aware however, that Alteryx employs Perl 5 syntax (the most common syntax) and not all syntaxes are specified on this site.
Alteryx also recommends the RegEx Coach, a graphical application for Windows which can be used to experiment with (Perl-compatible) regular expressions interactively. It can be downloaded from the following site: http://weitz.de/regex-coach/
©2018 Alteryx, Inc., all rights reserved. Allocate®, Alteryx®, Guzzler®, and Solocast® are registered trademarks of Alteryx, Inc.