Split Transform
Anmerkung
Transforms are a part of the underlying language, which is not directly accessible to users. This content is maintained for reference purposes only. For more information on the user-accessible equivalent to transforms, see Transformation Reference.
Splits the specified column into separate columns of data based on the delimiters in the transform. Delimiters can be specified in a number of literal or pattern-based methods. Whitespace delimiters are supported.
This transform might be automatically applied as one of the first steps of your recipe.
Tipp
When this transform appears in a suggestion card, the maximum number of suggested columns to split is 250, which may prevent the browser from crashing. If your dataset requires additional column splits, you can edit the transformation and increase the maximum number of splits. Avoid creating datasets that are wider than 1000 columns.
When the split
transform is applied, the source column is dropped.
Before applying this transform, you can create a copy of the source column using the
derive
transform. See Derive Transform.To retain the source column, you can use the
extract
transform and pattern-based matching. See Extract Transform.
Basic Usage
split col: MyValues on: ',' limit: 3
Output: Splits the source MyValues
column into four separate columns. Values in the columns are determined based on the comma (,
) delimiter. If a row only has two commas in it, then the final generated column is null.
Syntax and Parameters
split col:column_ref [quote:'quoted_string'] [ignoreCase:true|false] [limit:int_num] [after:start_point | from: start_point] [before:end_point | to:end_point] [on:'exact_match'] [at:(start_index,end_index)] [delimiters:'string1','string2', 'string3'] [positions: int1,int2,int3] [every:int_num]
For more information on syntax standards, see Language Documentation Syntax Notes.
The split
transform supports the following general methods for specifying the delimiters by which to split the column. Depending on your use of the transform, different sets of parameters apply.
Delimiter Methods | Description | Operative Parameters |
---|---|---|
single-pattern delimiters | Column is split based on one of the following: 1) patterns used to describe the beginning and ending of the field delimiter(s), 2) single delimiter, which may be repeated, 3) index values of the start and end points of the delimiter | At least one of the following parameters must be specified: |
multi-pattern delimiters | Column is split based one of the following literal methods: 1) explicit sequence of delimiters, 2) explicit list of character index positions 3) every N characters | At least one of the following parameters must be specified: |
Shared parameters:
The following parameters are shared between the operating modes:
Token | Required? | Data Type | Description |
---|---|---|---|
split | Y | transform | Name of the transform |
col | Y | string | Source column name |
ignoreCase | N | boolean | If |
limit | N | integer (positive) | Specifies the maximum of columns to split from the source column |
Single-pattern delimiter parameters:
Tipp
For this method of matching, at least one of the following parameters must be used: at
, before
, from
, on
, or to
.
Token | Required? | Data Type | Description |
---|---|---|---|
after | N | string | String literal or pattern that precedes the pattern to match |
at | N | Array | Two-integer array identifying the character indexes of start and end characters to match |
before | N | string | String literal or pattern that appears after the pattern to match |
from | N | string | String literal or pattern that identifies the start of the pattern to match |
on | N | string | String literal or pattern that identifies the pattern to match. |
to | N | string | String literal or pattern that identifies the end of the pattern to match |
quote | N | string | Specifies a quoted object that is omitted from pattern matching |
Multi-pattern delimiter parameters:
Tipp
Use one of the following parameters for this method of matching. Do not use combinations of them.
Token | Required? | Data Type | Description |
---|---|---|---|
delimiters | N | array | Array of strings that list the explicit field delimiters in the order to apply them to the column. |
positions | N | array | Array of integers that identify the zero-based character index values where to split the column. |
every | N | integer | String literal or pattern that appears after the pattern to match |
col
Identifies the column to which to apply the transform. You can specify only one column.
split col: MyCol on: 'MyString'
Output: Splits the MyCol
column into two separate columns whose values are to the left and right of the MyString
value in each cell.
If a delimiter value is not detected, the cell value appears in the first of the new columns.
When the
limit
parameter is not specified, the default value of1
is applied.
Usage Notes:
Required? | Data Type |
---|---|
Yes | String (column name) |
ignoreCase
Indicates whether the match should ignore case or not.
Set to
true
to ignore case matching.(Default) Set to
false
to perform case-sensitive matching.
split col: MyCol on: 'My String' ignoreCase: true
Output: Splits the MyCol
column on case-insensitive versions of the on
parameter value, if they appear in cell values: My String
, my string
, My string
, etc.
Usage Notes:
Required? | Data Type |
---|---|
No | Boolean |
limit
The limit
parameter defines the maximum number of times that a pattern can be matched within a column.
Anmerkung
The limit
parameter cannot be used with the following parameters: at
, positions
, or delimiters
.
A set of new columns is generated, as defined by the limit
parameter. Each matched instance populates a separate column, until there are no more matches or all of the limit
-generated new columns are filled.
split col: MyCol on: 'z' limit: 3
Output: Splits the MyCol
column on each instance of the letter z
, generating 4 new columns. If there are fewer than 3 instances of z
in a cell, the corresponding columns after the split are blank.
Anmerkung
Avoid creating datasets that are wider than 2500 columns. Performance can degrade significantly on very wide datasets.
Usage Notes:
Required? | Data Type |
---|---|
No | Integer (positive) |
Defines the number of columns that can be created by the
split
transform.If not specified, exactly one column is created. See Pattern Clause Position Matching.
after
A pattern identifier that precedes the value or pattern to match. Define the after
parameter value using string literals, regular expressions, or Wrangle .
If this parameter is the only pattern describer:
The column is split into two. The first column contains the part of the source column before the
after
matching value. The second column is blank.If the value appears more than once, no additional splitting is made, since there is no other pattern parameter.
Anmerkung
For
after
,before
,from
, andto
, matching occurs only one time at most. Additional instances of the parameter's value in the cell do not cause another column split. For more predictable results, you should specify another pattern parameter.
If the
after
value does not appear in the column, the original column value is written to the first split column.This parameter is typically used with another to describe a field delimiting pattern. See below.
split col: MyCol after: '\' before:'|'
Output: Splits values in MyCol
based on value between the two characters. The first column contains the part of the MyCol
that appears before the backslash (\
), and the second column contains the part of MyCol
that appears after the pipe character (|
). The content between the delimiting characters is dropped.
Usage Notes:
Required? | Data Type |
---|---|
No | String or pattern |
The
after
andfrom
parameters are very similar.from
includes the matching value as part of the delimiter string.after
can be used with eitherto
orbefore
. See Pattern Clause Position Matching.
at
Identifies the start and end point of the pattern to interest.
Parameter inputs are in the form of x,y
where x
and y
are positive integers indicating the starting character and ending character, respectively, of the pattern of interest.
x
must be less thany
.If
y
is greater than the length of the value, the pattern is defined to the end of the value, and a match is made.
split col: MyCol at: 2,6
Output: Splits the MyCol
column on the value that begins at the second character in the column and extends to the sixth character of the column. Contents before the value are in the first column, and contents after the value are in the second column.
Usage Notes:
Required? | Data Type |
---|---|
No | Array of two Integers ( |
The at
parameter cannot be combined with any of the following: on
, after
, before
, from
, to
, and quote
. See Pattern Clause Position Matching.
before
A pattern identifier that occurs after the value or pattern to match. Define the pattern using string literals, regular expressions, or Wrangle .
If this parameter is the only pattern describer:
The column is split into two. The first column is blank. The second column contains the part of the source column after the
before
matching value.If the value appears more than once, no additional splitting is made, since there is no other pattern parameter.
Anmerkung
For
after
,before
,from
, andto
, matching occurs only one time at most. Additional instances of the parameter's value in the cell do not cause another column split. For more predictable results, you should specify another pattern parameter.
If the
before
value does not appear in the column, the original column value is written to the first split column.This parameter is typically used with another to describe a field delimiting pattern. See below.
split col: MyCol before: '/' from:'Go:'
Output: Splits contents of MyCol
into two columns. The first column contains the values that appear before the Go:
string, and the second column contains the values after the backslash.
Usage Notes:
Required? | Data Type |
---|---|
No | String or pattern |
The
before
andto
parameters are very similar.to
includes the matching value as part of the delimiter string.before
can be used with eitherfrom
orafter
. See Pattern Clause Position Matching.
from
Identifies the pattern that marks the beginning of the value to match. It can be a string literal, Wrangle , or regular expression. The from
value is included in the match.
If this parameter is the only pattern describer:
The column is split into two. The first column contains the part of the source column before the
from
matching value. The second column is blank.If the value appears more than once, no additional splitting is made, since there is no other pattern parameter.
Anmerkung
For
after
,before
,from
, andto
, matching occurs only one time at most. Additional instances of the parameter's value in the cell do not cause another column split. For more predictable results, you should specify another pattern parameter.
If the
from
value does not appear in the column, the output value is original column value.This parameter is typically used with another to describe a field delimiting pattern. See below.
split col: MyCol from: 'go:' to:'stop:'
Output: Splits contents of MyCol
from go:
, including go:
to stop:
, including stop:
. Contents before the string appear in the first column, contents after the string appear in the second one.
Usage Notes:
Required? | Data Type |
---|---|
No | String or pattern |
The
after
andfrom
parameters are very similar.from
includes the matching value as part of the delimiter string.from
can be used with eitherto
orbefore
. See Pattern Clause Position Matching.
on
Identifies the pattern to match, which can be a string literal, Wrangle , or regular expression.
If the value does not appear in the source column, the original value is written to the first column of the split columns.
split col: MyCol on: `###ERROR`
Output: Column into two columns. The first column contains values in the column appearing before ###ERROR
, and the second column contains the values appearing after this string.
Tipp
You can insert the Unicode equivalent character for this parameter value using a regular expression of the form /\uHHHH/
. For example, /\u0013/
represents Unicode character 0013
(carriage return). For more information, see Supported Special Regular Expression Characters.
Usage Notes:
Required? | Data Type |
---|---|
No | String or pattern |
to
Identifies the pattern that marks the ending of the value to match. Pattern can be a string literal, Wrangle , or regular expression. The to
value is included in the match.
If this parameter is the only pattern describer:
The column is split into two. The first column is blank. The second column contains the part of the source column after the
to
matching value.If the value appears more than once, no additional splitting is made, since there is no other pattern parameter.
Anmerkung
For
after
,before
,from
, andto
, matching occurs only one time at most. Additional instances of the parameter's value in the cell do not cause another column split. For more predictable results, you should specify another pattern parameter.
If the
to
value does not appear in the column, the original column value is written to the first split column.This parameter is typically used with another to describe a field delimiting pattern. See below.
split col:MyCol from:'note:' to: ` `
Output: Splits MyCol
column all contents that appear before note:
in the first column and all contents that appear after the first space after note:
in the second column.
Usage Notes:
Required? | Data Type |
---|---|
No | String or pattern |
The
before
andto
parameters are very similar.to
includes the matching value as part of the delimiter string.to
can be used with eitherfrom
orafter
. See Pattern Clause Position Matching.
quote
Can be used to specify a string as a single quoted object. This parameter value can be one or more characters.
split col: MyLog on: `|` limit:10 quote: '"'
Output: Splits the MyLog
column, on the pipe character (|
), while ignoring any pipe characters that are found between double-quote characters in the column. Based on the value in the limit
parameter, the transform is limited to creating a maximum of 10 splits.
Usage Notes:
Required? | Data Type |
---|---|
No | String |
Parameter value is the quoted object.
The
quote
value can appear anywhere in the column value. It is not limited by the constraints of any other parameters.
delimiters
The delimiters
parameter specifies a comma-separated list of string literals or patterns to identify the delimiters to use to split the data. Values can be string literals, regular expressions, or Wrangle .
The sequence of values defines the order in which the delimiters are applied.
Values do not need to be the same.
split col:myCol delimiters:'|',' ','|'
Output: Splits the myCol
column into four separate columns, as indicated by the sequence of delimiters.
Usage Notes:
Anmerkung
Do not use the limit
or quote
parameters with the delimiters
parameter.
Required? | Data Type |
---|---|
No | Array of Strings (literal, regular expression,Wrangle) |
positions
The positions
parameter specifies a comma-separated list of integers that identify zero-based character index values at which to split the column. Values must be Integers.
split col:myCol positions:20,55,80
Tipp
Numeric values for positions do not need to be in sorted order.
Output: Splits the myCol
column into four separate columns, where:
column1
= characters 0-20 from the source column,column2
= characters 21-55column3
= characters 56-80column4
= characters 80 to the end of the cell value
Usage Notes:
Anmerkung
Do not use the limit
or quote
parameters with the positions
parameter.
Required? | Data Type |
---|---|
No | Array of Integers (literal, regular expression,Wrangle) |
every
The every
parameter can be used to specify fixed-width splitting of the source column. This Integer value defines the number of characters in each column of the split output.
If needed, you can use the every
parameter with the limit
parameter to define the maximum number of output columns:
split col:myCol every:20 limit:5
Output: Splits the myCol
column every 20 characters, with a limit of five splits. The sixth column contains all characters after the 100th character in the cell value.
Usage Notes:
Required? | Data Type |
---|---|
No | Integer |
Pattern Groups
When you build or edit a split
transform step in the Transform Builder, you can select one of the following pattern groups to apply to your transform. A pattern group is a set of related patterns that define a method of matching in a cell's data. Some pattern groups apply to multiple transforms, and some apply to the split
transform only.
Examples
Tipp
For additional examples, see Common Tasks.
Example - Split with single pattern delimiters
Source:
ColA | ColB | ColC |
---|---|---|
This my String That | abXcdXefXgh | 01AA001 |
my string This That | ijXklXmnXop | 02BB002 |
This That My String | qrXstXuvXwy | 03CC003 |
Transformation:
ColA: You can use the following transformation to split on the variations of My String
: In this case, the ignoreCase
parameter ensures that all variations on capitalization are matched:
Transformation Name | |
---|---|
Parameter: Column | ColA |
Parameter: Option | On pattern |
Parameter: Match pattern | 'My String' |
Parameter: Ignore case | true |
ColB: For this column, the letter x
is the split marker, and the data is consistently formatted with three instances per row:
Transformation Name | |
---|---|
Parameter: Column | ColB |
Parameter: Option | On pattern |
Parameter: Match pattern | 'X' |
Parameter: Number of matches | 3 |
ColC: In this column, the double-letter marker varies between the rows. However, it is consistently in the same location in each row:
Transformation Name | |
---|---|
Parameter: Column | ColC |
Parameter: Option | Sequence of positions |
Parameter: Positions | 2,4 |
Results:
When the above transforms are added, the source columns are dropped, leaving the following columns:
ColA1 | ColA2 | ColB1 | ColB2 | ColB3 | ColB4 | ColC1 | ColC2 |
---|---|---|---|---|---|---|---|
This | That | ab | cd | ef | gh | 01 | 001 |
This That | ij | kl | mn | op | 02 | 002 | |
This That | qr | st | uv | wy | 03 | 003 |
Example - Split with quoted values
This example demonstrates how to use quote
parameter for more sophisticated splitting of columns of data using the split
transform.
Source:
In this example, the following CSV data, which contains contact information, is imported into the application:
LastName,FirstName,Role,Company,Address,Status Wagner,Melody,VP of Engineering,Example.com,"123 Main Street, Oakland, CA 94601",Prospect Gruber,Hans,"Director, IT",Example.com,"456 Broadway, Burlingame, CA, 94401",Customer Franks,Mandy,"Sr. Manager, Analytics",Tricorp,"789 Market Street, San Francisco, CA, 94105",Customer
Transformationn:
When this data is pulled into the application, some initial parsing is performed for you:
column2 | column3 | column4 | column5 | column6 | column7 |
---|---|---|---|---|---|
LastName | FirstName | Role | Company | Address | Status |
Wagner | Melody | VP of Engineering | Example.com | "123 Main Street, Oakland, CA 94601" | Prospect |
Gruber | Hans | "Director, IT" | Example.com | "456 Broadway, Burlingame, CA, 94401" | Customer |
Franks | Mandy | "Sr. Manager, Analytics" | Tricorp | "789 Market Street, San Francisco, CA, 94105" | Customer |
When you open the Recipe Panel, you should see the following transforms:
Transformation Name | |
---|---|
Parameter: Column | column1 |
Parameter: Split on | \n |
Parameter: Ignore matches between | \" |
Parameter: Quote escape character | \" |
Transformation Name | |
---|---|
Parameter: Column | column1 |
Parameter: Option | On pattern |
Parameter: Match pattern | ',' |
Parameter: Number of Matches | 5 |
Parameter: Ignore matches between | \" |
The first transform splits the raw source data into separate rows in the carriage return character (\r
), ignoring all values between the double-quote characters. Note that this value must be escaped. The double-quote character does not require escaping. While there are no carriage returns within the actual data, the application recognizes that these double-quotes are identifying single values and adds the quote value.
The second transform splits each row of data into separate columns. Since it is comma-separated data, the application recognizes that this value is the column delimiter, so the on
value is set to the comma character (,
). In this case, the quoting is necessary, as there are commas in the values in column4
and column6
, which are easy to clean up.
To finish clean up of the dataset, you can promote the first row to be your column headers:
Transformation Name | |
---|---|
Parameter: Option | Use row(s) as column names |
Parameter: Type | Use a single row to name columns |
Parameter: Row number | 1 |
You can remove the quotes now. Note that the following applies to two columns:
Transformation Name | |
---|---|
Parameter: Column | Address,Role |
Parameter: Find | '\"' |
Parameter: Replace | '' |
Parameter: Match all occurrences | true |
Now, you can split up the Address
column. You can highlight one of the commas and the space after it in the column, but make sure that your final statement looks like the following:
Transformation Name | |
---|---|
Parameter: Column | column1 |
Parameter: Option | On pattern |
Parameter: Match pattern | ',' |
Parameter: Number of Matches | 2 |
Notice that there is some dirtiness to the resulting Address3
column:
Address3 |
---|
CA 94601 |
CA, 94401 |
CA, 94105 |
Use the following to remove the comma. In this case, it's important to leave the space between the two values in the column, so the on
value should only be a comma. Below, the width
value is two single quotes:
Transformation Name | |
---|---|
Parameter: Column | Address3 |
Parameter: Find | ',' |
Parameter: Replace | '' |
Parameter: Match all occurrences | true |
You can now split the Address3
column on the space delimiter:
Transformation Name | |
---|---|
Parameter: Column | Address3 |
Parameter: Option | by delimiter |
Parameter: Delimiter | ' ' |
Parameter: Number of columns to create | 2 |
Results:
After you rename the columns, you should see the following:
LastName | FirstName | Role | Company | Address | City | State | Zipcode | Status |
---|---|---|---|---|---|---|---|---|
Wagner | Melody | VP of Engineering | Example.com | 123 Main Street | Oakland | CA | 94601 | Prospect |
Gruber | Hans | Director, IT | Example.com | 456 Broadway | Burlingame | CA | 94401 | Customer |
Franks | Mandy | Sr. Manager, Analytics | Tricorp | 789 Market Street | San Francisco | CA | 94105 | Customer |
Example - Splitting with different delimiter types
This example shows how you can split data from a single column into multiple columns using delimiters.
single-pattern delimiter: One pattern is applied one or more times to the source column to define the delimiters for the output columns
multi-pattern delimiter: Multiple patterns, in the form of explicit strings, character index positions, or fixed-width fields, are used to split the column.
Source:
In this example, your CSV dataset contains status messages from a set of servers. In this case, the data about the server and the timestamp is contained in a single value within the CSV.
Server|Date Time,Status admin.examplecom|2016-03-05 07:04:00,down webapp.examplecom|2016-03-05 07:04:00,ok admin.examplecom|2016-03-05 07:04:30,rebooting webapp.examplecom|2016-03-05 07:04:00,ok admin.examplecom|2016-03-05 07:05:00,ok webapp.examplecom|2016-03-05 07:05:00,ok
Transformation:
When the data is first loaded into the Transformer page, the CSV data is split using the following two transformations:
Transformation Name | |
---|---|
Parameter: Column | column1 |
Parameter: Split on | \n |
Transformation Name | |
---|---|
Parameter: Column | column1 |
Parameter: Option | On pattern |
Parameter: Match pattern | ',' |
Parameter: Ignore matches between | \" |
You might need to add a header
as the first step:
Transformation Name | |
---|---|
Parameter: Option | Use row(s) as column names |
Parameter: Type | Use a single row to name columns |
Parameter: Row number | 1 |
At this point, your data should look like the following:
Server_Date_Time | Status |
---|---|
admin.example.com|2016-03-05 07:04:00 | down |
webapp.example.com|2016-03-05 07:04:00 | ok |
admin.example.com|2016-03-05 07:04:30 | rebooting |
webapp.example.com|2016-03-05 07:04:30 | ok |
admin.example.com|2016-03-05 07:05:00 | ok |
webapp.example.com|2016-03-05 07:05:00 | ok |
The first column contains three distinct sets of data: the server name, the date, and the time. Note that the delimiters between these fields are different, so you should use a multi-pattern delimiter to break them apart:
Transformation Name | |
---|---|
Parameter: Column | Server|Date Time |
Parameter: Option | Sequence of patterns |
Parameter: Pattern1 | ',' |
Parameter: Pattern2 | ' ' |
When the above is added, you should see three separate columns with the individual fields of information. Note that the source column has been automatically dropped.
Now, you decide that it would be useful to break apart the date information column into separate columns for year, month, and day. Since the column delimiter of this field is consistently a dash (-
), you can use a single-pattern delimiter with the following transformation:
Transformation Name | |
---|---|
Parameter: Column | Server|Date Time2 |
Parameter: Option | By delimiter |
Parameter: Delimiter | '-' |
Parameter: Number of columns to create | 2 |
Results:
After you rename the generated columns, your dataset should look like the following. Note that the source timestamp column has been automatically dropped.
server | year | month | day | time | Status |
---|---|---|---|---|---|
admin.example.com | 2016 | 03 | 05 | 07:04:00 | down |
webapp.example.com | 2016 | 03 | 05 | 07:04:00 | ok |
admin.example.com | 2016 | 03 | 05 | 07:04:30 | rebooting |
webapp.example.com | 2016 | 03 | 05 | 07:04:30 | ok |
admin.example.com | 2016 | 03 | 05 | 07:05:00 | ok |
webapp.example.com | 2016 | 03 | 05 | 07:05:00 | ok |