Extracting one or more values from within a column of values can turn data into meaningful and discrete information. This section describes how to extract column data, the methods for which may vary depending on the data type.
Extract and split transformations do not do the same thing:
A split transformation separates a single column into one or more separate columns based on one or more values in the source column that identify where the data should be split. These delimiters can be determined by the application or specified by the user when defining the transformation.
An extract transformation matches literal or pattern values from a source column and stores it in a separate column.
Nota
The source column is untouched by extract transformations.
In the Transformer page, you can use the following methods to extract values:
Method | Description |
---|---|
By selection | Select part of a value in the data grid to prompt a series of suggestions on what to do with the data. Typically, extract options are near the top of the suggestions when you select part of a value. |
By column menu | From the menu to the right of the column, select Extract and a sub-menu item to begin configuring a transformation. |
By Transformer toolbar | At the top of the data grid, click the Extract icon in the Transformer toolbar to begin configuring extract transformations. |
By Search panel | In the Search panel, enter |
A primary use of extraction is to remove literal or patterned values of text from a column of values. Suppose your dataset included a column of LinkedIn updates. You can use one of the following methods to extract keywords from these values.
The following example transformation extracts the word #bigdata
from the column msg_LinkedIn
:
Transformation Name | |
---|---|
Parameter: Column to extract from | msg_LinkedIn |
Parameter: Option | Custom text or pattern |
Parameter: Text to extract | '#bigdata' |
Parameter: Number of matches to extract | 1 |
Notes:
The
option
parameter identifies that the pattern to match is a custom one specified by the user.The
Number of matches to extract
parameter defaults to1
, meaning that the transformation extracts a maximum of one value from each cell. This value can be set from 1-50.
You can generate a new column of values extracted from a source column by entering example values to match with source values. Values with similar patterns may also be matched based on your entered example value.
Suggerimento
This method provides an easy way to build pattern-based matching for values in a source column.
For more information on transformation by example, see Overview of TBE.
Within the extract transformation, you can specify literals or patterns before or after which the match is found. This method can be used to remove parts of each cell value from erroneously matching on the literal or pattern that is desired.
The following example extracts the second three-digit element of a phone number, skipping the area code:
Transformation Name | |
---|---|
Parameter: Column to extract from | phone_num |
Parameter: Option | Custom text or pattern |
Parameter: Text to extract | `{digit}` |
Parameter: Number of matches to extract | 1 |
Parameter: Ignore matches between | `{start}{digit}{3}\-` |
You can also do pattern-based extractions using Alteryx patterns or regular expressions.
Regular expressions are a standards-based method of describing patterns of characters for matching purposes. Regular expressions are very powerful but can be difficult to use.
A Alteryx pattern is a proprietary method of describing patterns, which is much simpler to use than regular expressions.
For more information on both types of patterns, see Text Matching.
The following example extracts all words that begin with #
in the msg_LinkedIn
column:
Transformation Name | |
---|---|
Parameter: Column to extract from | msg_LinkedIn |
Parameter: Option | Custom text or pattern |
Parameter: Text to extract | `\#{alphanum-underscore}+` |
Parameter: Number of matches to extract | 50 |
Notes:
The
Text to extract
parameter has changed:Element
Description
Two back-ticks (
`
)Indicate that the expression between them represents a Alteryx pattern.
\#
The slash indicates that the character right after it should be interpreted as a character only; it should not be interpreted as any special character in the pattern.
{alphanum-underscore}
This Alteryx pattern element is used to indicate a single alphanumeric or underscore character.
+
Adding the plus sign after the above character signifies that the pattern can match on a sequence of alphanumeric or underscore characters of one or more length.
The
Number of matches to extract
parameter has been increased to grab up to 50 hashtags.
Option | Description |
---|---|
Number of patterns to extract | Set this value to the total number of patterns you wish to extract. Nota This value determines the number of columns that are generated by the extraction. If no value is available, an empty value is written into the corresponding column. The default is |
Ignore case | By default, pattern matching is case-sensitive. Select this checkbox to ignore case when matching. |
Ignore matches between | You can enter a pattern here to describe any patterns that should not be part of any match. This option is useful if you have multiple instances of text but want to ignore the first one, for example. |
In your pattern expressions, you can use the vertical pipe character (|
) to define multiple patterns to find. The following example extracts any value from the myDate
column that ends in 7
pr in 8
:
Transformation Name | |
---|---|
Parameter: Column to extract from | myDate |
Parameter: Text to extract | `{any}+7|{any}+8` |
Parameter: End extracting before | `{end}` |
You can use the vertical pipe in both Alteryx patterns and regular expressions.
You can extract the first or last set of characters from a column into a new column. In the following example, the first five characters from the ProductName
column are extracted into a new product identifier column:
Transformation Name | |
---|---|
Parameter: Column to extract from | ProductName |
Parameter: Option | First characters |
Parameter: Number of characters to extract | 5 |
You can change the Option value to Last characters
to extract from the right side of the column value.
If you need to remove the characters that you extracted, you can use the following transformation. In this case, the first five characters, which were extracted in the previous transformation, are removed:
Transformation Name | |
---|---|
Parameter: Columns | ProductName |
Parameter: Formula | RIGHT(ProductName, LEN(ProductName)-5) |
You can extract values between specified index positions within a set of column values. In the following example, the text between the fifth and tenth characters in a column are extracted to a new column.
Suggerimento
This extraction method is useful if the content before and after the match area is inconsistent and cannot be described using patterns. If it is consistent, you should use the Extract text or pattern transformation.
Transformation Name | |
---|---|
Parameter: Column to extract from | ProductName |
Parameter: Option | Between two positions |
Parameter: Starting position | 5 |
Parameter: Ending position | 10 |
You can perform extractions that are specific to a data type or based on failures of the data to match a specified data type.
You can use functions to extract values from Datetime columns. The example below extracts the year value from the myDate
column:
Transformation Name | |
---|---|
Parameter: Formula type | Single row formula |
Parameter: Formula | YEAR(myDate) |
Parameter: New column name | myYear |
The following functions can be used to extract values from a Datetime column, as long as the values are present in the formatted date:
Function Name | Description |
---|---|
Derives the numeric day value from a Datetime value. Source value can be a a reference to a column containing Datetime values or a literal. | |
Derives the month integer value from a Datetime value. Source value can be a a reference to a column containing Datetime values or a literal. | |
Derives the four-digit year value from a Datetime value. Source value can be a a reference to a column containing Datetime values or a literal. | |
Derives the hour value from a Datetime value. Generated hours are expressed according to the 24-hour clock. | |
Derives the minutes value from a Datetime value. Minutes are expressed as integers from 0 to 59. | |
Derives the seconds value from a Datetime value. Source value can be a a reference to a column containing Datetime values or a literal. |
You can also reformat the whole Datetime column using the DATEFORMAT
function. The following reformats the column to show only the two-digit year:
Transformation Name | |
---|---|
Parameter: Columns | myDate |
Parameter: Formula | DATEFORMAT(myDate, "yy") |
You can extract numerical data from text values. In the following example, the first number is extracted from the address
column, which would correspond to extracting the street number for the address:
Transformation Name | |
---|---|
Parameter: Column to extract from | address |
Parameter: Option | Numbers |
Parameter: Number of matches to extract | 1 |
Empty values in this new column might indicate a formatting problem with the address.
Suggerimento
If you set the number of patterns to extract to 2
for the address
column, you might extract apartment or suite information.
Using functions, you can extract specific elements of a valid URL. The following transformation pulls the domain values from the myURL
column:
Transformation Name | |
---|---|
Parameter: Formula type | Single row formula |
Parameter: Formula | DOMAIN(myURL) |
Parameter: New column name | myDomain |
In some cases, the function may not return values. For example, the SUBDOMAIN function returns empty values if there is no sub-domain part of the URL.
The following functions can be used to extract values from a set of URLs:
Function Name | Description |
---|---|
Finds the host value from a valid URL. Input values must be of URL or String type and can be literals or column references. | |
Finds the value for the domain from a valid URL. Input values must be of URL or String type. | |
Finds the value a subdomain value from a valid URL. Input values must be of URL or String type. | |
Finds the suffix value after the domain from a valid URL. Input values must be of URL or String type. | |
Extracts the query parameters of a URL into an Object. The Object keys are the parameter's names, and its values are the parameter's values. Input values must be of URL or String type. |
You can extract query parameter values from an URL. The following example extracts the store_id
value from the storeURL
field value:
Transformation Name | |
---|---|
Parameter: Column to extract from | storeURL |
Parameter: Option | HTTP Query strings |
Parameter: Fields to extract | store_id |
If your data includes sets of arrays, you can extract array elements into columns for each key, with the values written to each key column.
Suppose your restaurant dataset includes a set of characteristics in the restFeatures
column in the following JSON format:
{
"Credit": "Y",
"Accessible": "Y",
"Restrooms": "Y",
"EatIn": "Y",
"ToGo": "N",
"AlcoholBeer": "Y",
"AlcoholHard": "N",
"TotalTables": "10",
"TotalTableSeats": "36",
"Counter": "Y",
"CounterSeats": "8"
}
You can use the following transformation to extract the values from TotalTableSeats
and CounterSeats
into separate columns:
Transformation Name | |
---|---|
Parameter: Column | restFeatures |
Parameter: Paths to elements - 1 | TotalTableSeats |
Parameter: Paths to elements - 2 | CounterSeats |
Parameter: Include original column name | Selected |
After the above is executed, you can perform a simple sum of the TotalTableSeats
and CounterSeats
columns to determine the total number of seats in the restaurant.
In some cases, your data may contain arrays of repeated key-value pairs, where each pair would exist on a separate line. Suppose you have a column called, Events
, which contains date and time information about the musician described in the same row of data. The Events
column might look like the following:
[{"Date":"2018-06-15","Time":"19:00"},{"Date":"2018-06-17","Time":"19:00"},{"Date":"2018-06-19","Time":"20:00"},{"Date":"2018-06-20","Time":"20:00"}]
The following transformation creates a separate row for each entry in the Events
column, populating the other fields in the new rows with the data from the original row:
Nota
This type of transformation can significantly increase the size of your dataset.
Transformation Name | |
---|---|
Parameter: Column | Events |
You can also extract sets of values into an array list of values.
Suggerimento
This transformation is useful for extracting types or patterns of information from a single column.
Using Alteryx patterns, you can extract the values of the column to form a new column of arrays. The following example shows the usage of {any} pattern to extract the cell values and form a new array column.
Transformation:
Transformation Name | |
---|---|
Parameter: Column | product |
Parameter: Pattern matching elements in the list | `{any}` |
Parameter: Delimiter separating each element | `,` |
Results:
Before | After |
---|---|
socks, socks, socks | ["socks", "socks", "socks"] |
pants, pants | ["pants", "pants"] |
In this example, you extract one or more values from a source column and assemble them in an Array column.
Suppose you need to extract the hashtags from customer tweets to another column. In such cases, you can use the {hashtag}
Alteryx pattern to extract all hashtag values from a customer's tweets into a new column.
Source:
The following dataset contains customer tweets across different locations.
User Name | Location | Customer tweets |
---|---|---|
James | U.K | Excited to announce that we’ve transitioned Wrangler from a hybrid desktop application to a completely cloud-based service! #dataprep #businessintelligence #CommitToCleanData # London |
Mark | Berlin | Learnt more about the importance of identifying issues in your data—early and often #CommitToCleanData #predictivetransformations #realbusinessintelligence |
Catherine | Paris | Clean data is the foundation of your analysis. Learn more about what we consider the five tenets of sound #dataprep, starting with #1a prioritizing and setting targets. #startwiththeuser #realbusinessintelligence #Paris |
Dave | New York | Learn how #NewYorklife onboarded as part of their #bigdata #dataprep initiative to unlock hidden insights and make them accessible across departments. |
Christy | San Francisco | How can you quickly determine the number of times a user ID appears in your data?#dataprep #pivot #aggregation#machinelearning initiatives #SFO |
Transformation:
The following transformation extracts the hashtag messages from customer tweets.
Transformation Name | |
---|---|
Parameter: Column | customer_tweets |
Parameter: Pattern matching elements in the list | `{hashtag}` |
Parameter: New column name | Hashtag tweets |
Then, the source column can be deleted.
Results:
User Name | Location | Hashtag tweets |
---|---|---|
James | U.K | ["#dataprep", "#businessintelligence", "#CommitToCleanData", " # London"] |
Mark | Berlin | ["#CommitToCleanData", "#predictivetransformations", "#realbusinessintelligence", "0"] |
Catherine | Paris | ["#dataprep", "#startwiththeuser","#realbusinessintelligence", "# Paris"] |
Dave | New York | ["#NewYorklife", "dataprep", "bigdata", "0"] |
Christy | SanFrancisco | [ "dataprep", "#pivot", "#aggregation", "#machinelearning"] |