Use the Regular Expression (RegEx) tool to build an expression to parse, match, or replace data.
Regular expression support
While regular expressions are supported in Designer, you are responsible for your own expressions and how the expressions impact your data.
For more information about Boost RegEx, visit the following:
- Boost-Supported RegEx Character Classes
- Collating Element Names
- Boost-Extended Format String Syntax
- POXIS Extended Regular Expression Syntax
Configure the tool
- Column to parse: Select the column you want to search and parse, match, or replace data.
- Regular Expression: Click the plus sign to start building an expression that will be used to find data to parse, match, or replace.
- Output Method: Which output method do you want to use after the search finds data matching the expression you built? Select a method and then configure the related output properties for the method.
Case Insensitive: Selected by default. Searches do not distinguish between uppercase and lowercase letters.
- Replace: Use this method to replace the expression you search for with a second expression.
- Replacement Text: Enter an expression you want the tool to use when replacing text that matches your original regular expression. The drop down provides common regular expressions that you may need while creating your expression.
- Copy unmatched text to output: Selected by default. All text that was not matched is included in the output.
- Tokenize: Split the incoming data using a regular expression. This options works similarly to the Text To Columns Tool, except instead of matching and removing what you do not want, you match for what you want to keep. You want to match to the whole token, and if you have a marked group, only that part is returned.
- Parse a 9 character string 123456789 into 3 fields.
- Parse a 9 character string into 3 fields, returning only the second character.
- Parse a field with the delimiter Ctrl-A
- Allow blank tokens to preserve entries: abc, ,def
- Parsing HTML links from a home page
- Split to columns: Split a single column of data at each instance of the specified delimiter into multiple columns.
- Number of columns: Set how many columns are created.
- Extra columns: Select the behavior that is applied to extra columns.
- Drop extra with warning: Data that extends past the split is dropped and a warning is generated indicating that there was excess information.
- Drop extra without warning: Data that extends past the split is dropped and no warning is generated.
- Error: Data the extends past the split causes an error and the workflow stops processing.
- Output Root Name: Enter the name for the newly generated columns to be based off of. The new columns are named as the root name with a serially increasing integer appended.
- Split to rows: Split a single column of data at each instance of the specified delimiter into multiple rows. Use a key column in your record so you don't lose track of which value came from which row.
- Parse: Separate the expression into new columns, and set the name, type, and size of the new columns.
- A new column is created in the Output columns table. In the table:
- Click on column names to rename.
- Use the drop down to select the new data type.
- Enter the new size.
- Expression is populated automatically.
- Match: Append a column containing a number; 1 if the expression matched, 0 if it did not.
- Column name for match status: Provide a name for the appended column.
- Error if not matched: This is not selected by default. Select if you want the system to stop processing and display an error when the expression and string do not match.
The following use cases rely on the Regex tool's Tokenize method.
The regular expression is ...
The regular expression is .(.).
The regular expression for tokenizing a Ctrl-A delimited string is [^\cA]+
[^...] The brackets specify a match to a single character in a set of characters. Starting the set with ^ changes it to match any character not in the set.
\cA This simply matches the Ctrl-A character.
+ This means match 1 or more of the previous.
The regular expression is ([^,]*) (?:,|$)
(...) Parenthesis create a marked group of the expression. The tokenize mode allows you to match a larger part of the input field, yet only return a subset that was marked. This way you avoid returning the delimiter. You may only have 1 marked expression.
[^,] Starting the set with ^ changes it to match any character not in the set, in this case a ,.
* Match 0 or more of the previous, which allows for an empty set. You cannot end here because the regex engine doesn't like a match of 0 characters since there is an infinite number of matches, so we have to terminate the match on something.
(?:....) This is an unmarked group. We need this for the or which we use a | for.
| This is saying match either the thing before or after, but not both. This almost always needs to be used with a marked or unmarked group.
$ Matches the end of the string. Hence(?:,|$) matches up toa ,or the end of the string.
The regular expression is <a .*?>.*?</a>. This pulls every link out of a large HTML document into a series of records.
<a This is a literal match for the text <a
.*?. is any character, * is 0 or more. The ? modifies the * to make it match to the shortest possible match. Without this, the expression might find a single token from the beginning of the first link to the end of the last.
> This is a literal match for the text >.
.*? The shortest possible match of any characters that still satisfies the entire regex.
</a> This is a literal match for the text </a>. This ends the match.