The Regular Expression tool uses regular expression syntax to parse, match, or replace data.
Regular expression support
While regular expressions are supported in Designer, users are responsible for their own expressions and how the expressions impact their data.
For more resources on how to write regular expressions, see www.regexlib.com, a site unaffiliated with Alteryx, or the RegEx Coach, a unaffiliated graphical application for Windows which can be used to experiment with (Perl-compatible) regular expressions interactively.
For additional information about Boost RegEx, see the following:
Configure the tool
- Select the Field to Parse.
- Enter your Regular Expression. The drop down provides common regular expressions that you may need while creating your expression. More information can be found on the website Perl Regular Expression Syntax.
- Select the Output Method to use when parsing. Then configure the related Properties.
Case Insensitive: Selected by default. Searches do not distinguish between uppercase and lowercase letters.
- Replace: Replace the expression you searched for with a second expression.
- Replacement Text: Enter an expression to replace your original regular expression by identifying the Marked Group to replace the expression with. The drop down provides common regular expressions that you may need while creating your expression.
- Copy Unmatched Text to Output
- Tokenize: Split the incoming data using a regular expression. This options works similarly to the Text To Columns Tool, except instead of matching and removing what you do not want, you match for what you want to keep. You want to match to the whole token, and if you have a marked group, only that part is returned.
- Parse a 9 character string 123456789 into 3 fields.
- Parse a 9 character string into 3 fields, returning only the second character.
- Parse a field with the delimiter Ctrl-A
- Allow blank tokens to preserve entries: abc, ,def
- Parsing HTML links from a home page
- Split to Columns: Split a single column of data at each instance of the specified delimiter into multiple columns.
- # of Columns: Set how many columns are created.
- Extra Columns: Select the behavior that is applied to extra columns.
- Drop Extra with Warning: Data that extends past the split is dropped and a warning is generated indicating that there was excess information.
- Drop Extra without Warning: Data that extends past the split is dropped and no warning is generated.
- Error: Data the extends past the split causes an error and the workflow stops processing.
- Output Root Name: Enter the name for the newly generated columns to be based off of. The new columns are named as the root name with a serially increasing integer appended.
- Split to Rows: Split a single column of data at each instance of the specified delimiter into multiple rows. Use a key column in your record so you don't lose track of which value came from which row.
- Parse: Separate the expression into new columns, and set the name, type, and size of the new columns.
- A new column is created in the Output Fields table. In the table:
- Click on column names to rename.
- Use the drop down to select the new data type.
- Enter the new size.
- Expression is populated automatically.
- Match: Append a column containing a number; 1 if the expression matched, 0 if it did not.
- Match Status Field Name: Provide a name for the appended column.
- Error if not Matched: Select to throw an error if the expression and string do not match to end the workflow processing.
The following use cases rely on the Regex tool's Tokenize method.
The regular expression is ...
The regular expression is .(.).
The regular expression for tokenizing a Ctrl-A delimited string is [^\cA]+
[^...] The brackets specify a match to a single character in a set of characters. Starting the set with ^ changes it to match any character not in the set.
\cA This simply matches the Ctrl-A character.
+ This means match 1 or more of the previous.
The regular expression is ([^,]*) (?:,|$)
(...) Parenthesis create a marked group of the expression. The tokenize mode allows you to match a larger part of the input field, yet only return a subset that was marked. This way you avoid returning the delimiter. You may only have 1 marked expression.
[^,] Starting the set with ^ changes it to match any character not in the set, in this case a ,.
* Match 0 or more of the previous, which allows for an empty set. You cannot end here because the regex engine doesn't like a match of 0 characters since there is an infinite number of matches, so we have to terminate the match on something.
(?:....) This is an unmarked group. We need this for the or which we use a | for.
| This is saying match either the thing before or after, but not both. This almost always needs to be used with a marked or unmarked group.
$ Matches the end of the string. Hence(?:,|$) matches up toa ,or the end of the string.
The regular expression is <a .*?>.*?</a>. This pulls every link out of a large HTML document into a series of records.
<a This is a literal match for the text <a
.*?. is any character, * is 0 or more. The ? modifies the * to make it match to the shortest possible match. Without this, the expression might find a single token from the beginning of the first link to the end of the last.
> This is a literal match for the text >.
.*? The shortest possible match of any characters that still satisfies the entire regex.
</a> This is a literal match for the text </a>. This ends the match.