API Task - Run Job on Dataset with Parameters
Warning
API access is migrating to Enterprise only. Beginning in Release 9.5, all new or renewed subscriptions have access to public API endpoints on the Enterprise product edition only. Existing customers on non-Enterprise editions will retain access their available endpoints (Legacy) until their subscription expires. To use API endpoints after renewal, you must upgrade to the Enterprise product edition or use a reduced set of endpoints (Current). For more information on differences between product editions in the new model, please visit Pricing and Packaging.
Overview
This example task describes how to run jobs on datasets with parameters through Dataprep by Trifacta.
A dataset with parameters is a dataset in which some part of the path to the data objects has been parameterized. Since one or more of the parts of the path can vary, you can build a dataset with parameters to capture data that spans multiple files. For example, datasets with parameters can be used to parameterize serialized data by region or data or other variable. For more information on datasets with parameters, see Overview of Parameterization.
Basic Task
The basic method by which you build and run a job for a dataset with parameters is very similar to the non-parameterized dataset method with a few notable exceptions. The steps in this task follow the same steps for the standard task. Where the steps overlap links have been provided to the non-parameterized task. For more information, see API Task - Develop a Flow.
Example Datasets
This example covers three different datasets, each of which features a different type of dataset with parameters.
Example Number | Parameter Type | Description |
---|---|---|
1 | Datetime parameter | In this example, a directory is used to store daily orders transactions. This dataset must be defined with a Datetime parameter to capture the preceding 7 days of data. Jobs can be configured to process all of this data as it appears in the directory. |
2 | Variable | This dataset segments data into four timezones across the US. These timezones are defined using the following text values in the path: |
3 | Pattern parameter | This example is a directory containing point-of-sale transactions captured into individual files for each region. Since each region is defined by a numeric value ( |
4 | Environment parameter | An environment parameter is defined by an admin and is available for every user of the project or workspace. In particular, environment parameters are useful for defining source bucket names, which may vary between environments in the same organization. |
Step - Create Containing Flow
You must create the flow to host your dataset with parameters.
In the response, you must capture and retain the flow Identifier. For more information, see API Task - Develop a Flow.
Step - Create Datasets with Parameters
Note
When you import a dataset with parameters, only the first matching dataset is used for the initial file. If you want to see data from other matching files, you must collect a new sample within the Transformer page.
Example 1 - Dataset with Datetime parameter
Suppose your files are stored in the following paths:
MyFiles/1/Datetime/2018-04-06-orders.csv MyFiles/1/Datetime/2018-04-05-orders.csv MyFiles/1/Datetime/2018-04-04-orders.csv MyFiles/1/Datetime/2018-04-03-orders.csv MyFiles/1/Datetime/2018-04-02-orders.csv MyFiles/1/Datetime/2018-04-01-orders.csv MyFiles/1/Datetime/2018-03-31-orders.csv
When you navigate to the directory through the application, you mouse over one of these files and select Parameterize.
In the window, select the date value (e.g. YYYY-MM-DD
) and then click the Datetime icon.
Datetime Parameter:
Format:
YYYY-MM-DD
Date Range: Date is last 7 days.
Click Save.
The Datetime parameter should match with all files in the directory. Import this dataset and wrangle it.
After you wrangle the dataset, return to its flow view and select the recipe. You should be able to extract the flowId and recipeId values from the URL.
For purposes of this example, here are some key values:
flowId: 35
recipeId: 127
Example 2 - Dataset with Variable
Suppose your files are stored in the following paths:
MyFiles/1/variable/census-eastern.csv MyFiles/1/variable/census-central.csv MyFiles/1/variable/census-mountain.csv MyFiles/1/variable/census-pacific.csv
When you navigate to the directory through the application, you mouse over one of these files and select Parameterize.
In the window, select the region value, which could be one of the following depending on the file: eastern
, central
, mountain
, or pacific
. Click the Variable icon.
Variable Parameter:
Name:
region
Default Value:Set this default to
pacific
.Click Save.
In this case, the variable only matches one value in the directory. However, when you apply runtime overrides to the region
variable, you can set it to any value.
Import this dataset and wrangle it.
After you wrangle the dataset, return to its flow view and select the recipe. You should be able to extract the flowId and recipeId values from the URL.
For purposes of this example, here are some key values:
flowId: 33
recipeId: 123
Example 3 - Dataset with pattern parameter
Suppose your files are stored in the following paths:
MyFiles/1/pattern/POS-r01.csv MyFiles/1/pattern/POS-r02.csv MyFiles/1/pattern/POS-r03.csv
When you navigate to the directory through the application, you mouse over one of these files and select Parameterize.
In the window, select the two numeric digits (e.g. 02
). Click the Pattern icon.
Pattern Parameter:
Type: Wrangle
Matching regular expression:
{digit}{2}
Click Save.
In this case, the Wrangle should match any sequence of two digits in a row. In the above example, this expression matches: 01
, 02
, and 03
, all of the files in the directory.
Import this dataset and wrangle it.
After you wrangle the dataset, return to its flow view and select the recipe. You should be able to extract the flowId and recipeId values from the URL.
For purposes of this example, here are some key values:
flowId: 32
recipeId: 121
Note
You have created flows for each type of dataset with parameters.
Example 4 - Dataset with parameterized bucket name
You can parameterize part or all of the bucket name in your source or target paths.
Suppose you have multiple workspaces that use different S3 buckets for sources of data. For example, your environments might look like the following:
Environment | S3 Bucket Name |
---|---|
Dev | myco-dev |
Prod | myco-prod |
For your datasources, you can parameterize the name of the bucket, so that if you migrate your flow between these environments, the references to datasources are updated based on the parameterized value for the bucket in the new environment.
Create environment parameter
Parameterized buckets are a good use for environment parameters. An environment parameter is a parameter that is available for use by every user in the project or workspace. In this case, the bucket name can be referenced for all datasets in the project or workspace, so turning that value into a parameter makes managing your datasources much more efficient.
You can use the following example to create environment parameter called env.bucketName
, with a value of myco-dev
. This environment parameter would be created in your Dev environment:
Note
The overrideKey
value, which is the name of the environment parameter, must begin with env.
.
Endpoint | http://www.example.com:3005/v4/environmentParameters |
---|---|
Authentication | Required |
Method | POST |
Request Body | { "overrideKey": "env.bucketName", "value": { "variable": { "value": "myco-dev" } } } |
Response | { "id": 1, "overrideKey": "env.bucketName", "value": { "variable": { "value": "myco-dev" } }, "createdAt": "2021-06-24T14:15:22Z", "updatedAt": "2021-06-24T14:15:22Z", "deleted_at": "2021-06-24T14:15:22Z", "usageInfo": { "runParameters": 1 } } |
For more information on creating environment parameters, see Dataprep by Trifacta: API Reference docs
Create dataset with parameterized bucket name
The following example creates an imported dataset with two parameters:
Parameter Name | Parameter Type | Environment Parameter? | Description |
---|---|---|---|
myPath | path | No | The parameterized part of the path. The static value is The default value is In this case, for the job run, the value is overridden with |
env.bucketName | bucket | Yes | The parameterized part of the bucket path. The static value is In this case, for the job run, the value |
Endpoint | http://www.example.com:3005/v4/importedDatasets |
---|---|
Authentication | Required |
Method | POST |
Request Body | { "name": "Dummy Dataset", "uri": "/path", "description": "My S3 parameterized dataset", "type": "S3", "isDynamic": true, "runParameters": [ { "type": "path", "overrideKey": "myPath", "insertionIndices": [ { "index": 1, "order": 0 } ], "value": { "variable": { "value": "dummy2" } } }, { "type": "bucket", "overrideKey": "env.bucketParam", "insertionIndices": [ { "index": 5, "order": 0 } ], "value": { "variable": { "value": "dev" } } } ], "dynamicBucket": "myco-", "dynamicPath": "/" } |
Response | { "visible": true, "numFlows": 0, "path": "/dummy", "bucket": "", "type": "s3", "isDynamic": true, "runParameters": [ { "type": "path", "overrideKey": "myPath", "insertionIndices": [ { "index": 1, "order": 0 } ], "value": { "variable": { "value": "dummy2" } }, "isEnvironmentParameter": false }, { "type": "bucket", "overrideKey": "env.bucketParam", "insertionIndices": [ { "index": 5, "order": 0 } ], "value": { "variable": { "value": "dev" } }, "isEnvironmentParameter": true } ], "dynamicBucket": "myco-", "dynamicPath": "/" } |
For more information, see Dataprep by Trifacta: API Reference docs
Step - Wrangle Data
After you have created your dataset with parameter, you can wrangle it through the application. For more information, see Transformer Page.
Step - Run Job
Below, you can review the API calls to run a job for each type of dataset with parameters, including relevant information about overrides.
Example 1 - Dataset with Datetime parameter
Note
You cannot apply overrides to these types of datasets with parameters. The following request contains overrides for write settings but no overrides for parameters.
Endpoint
http://www.example.com:3005/v4/jobGroups
Authentication
Required
Method
POST
Request Body
{ "wrangledDataset": { "id": 127 }, "overrides": { "execution": "photon", "profiler": true, "writesettings": [ { "path": "MyFiles/queryResults/joe@example.com/2018-04-03-orders.csv", "action": "create", "format": "csv", "compression": "none", "header": false, "asSingleFile": false } ] }, "runParameters": {} }
In the above example, the job has been launched for recipe
127
to execute on the Trifacta Photon running environment with profiling enabled.Output format is CSV to the designated path. For more information on these properties, see Dataprep by Trifacta: API Reference docs
Output is written as a new file with no overwriting of previous files.
A response code of
201 - Created
is returned. The response body should look like the following:{ "sessionId": "79276c31-c58c-4e79-ae5e-fed1a25ebca1", "reason": "JobStarted", "jobGraph": { "vertices": [ 21, 22 ], "edges": [ { "source": 21, "target": 22 } ] }, "id": 29, "jobs": { "data": [ { "id": 21 }, { "id": 22 } ] } }
Retain the
jobgroupId=29
value for monitoring.
Example 2 - Dataset with Variable
In the following example, the region
variable has been overwritten with the value central
to execute the job on orders-central.csv
:
Endpoint
http://www.example.com:3005/v4/jobGroups
Authentication
Required
Method
POST
Request Body
{ "wrangledDataset": { "id": 123 }, "overrides": { "execution": "photon", "profiler": true, "writesettings": [ { "path": "MyFiles/queryResults/joe@example.com/region-eastern.csv", "action": "create", "format": "csv", "compression": "none", "header": false, "asSingleFile": false } ] }, "runParameters": { "overrides": { "data": [{ "key": "region", "value": "central" } ]} } }
In the above example, the job has been launched for recipe
123
to execute on the Trifacta Photon running environment with profiling enabled.Output format is CSV to the designated path. For more information on these properties, see Dataprep by Trifacta: API Reference docs
Output is written as a new file with no overwriting of previous files.
A response code of
201 - Created
is returned. The response body should look like the following:{ "sessionId": "79276c31-c58c-4e79-ae5e-fed1a25ebca1", "reason": "JobStarted", "jobGraph": { "vertices": [ 21, 22 ], "edges": [ { "source": 21, "target": 22 } ] }, "id": 27, "jobs": { "data": [ { "id": 21 }, { "id": 22 } ] } }
Retain the
jobgroupId=27
value for monitoring.
Example 3 - Dataset with pattern parameter
Note
You cannot apply overrides to these types of datasets with parameters. The following request contains overrides for write settings but no overrides for parameters.
Endpoint
http://www.example.com:3005/v4/jobGroups
Authentication
Required
Method
POST
Request Body
{ "wrangledDataset": { "id": 121 }, "overrides": { "execution": "photon", "profiler": false, "writesettings": [ { "path": "hdfs://hadoop:50070/trifacta/queryResults/admin@example.com/POS-r02.txt", "action": "create", "format": "csv", "compression": "none", "header": false, "asSingleFile": false } ] }, "runParameters": {} }
In the above example, the job has been launched for recipe
121
to execute on the Trifacta Photon running environment with profiling enabled.Output format is CSV to the designated path. For more information on these properties, see Dataprep by Trifacta: API Reference docs
Output is written as a new file with no overwriting of previous files.
A response code of
201 - Created
is returned. The response body should look like the following:{ "sessionId": "79276c31-c58c-4e79-ae5e-fed1a25ebca1", "reason": "JobStarted", "jobGraph": { "vertices": [ 21, 22 ], "edges": [ { "source": 21, "target": 22 } ] }, "id": 28, "jobs": { "data": [ { "id": 21 }, { "id": 22 } ] } }
Retain the
jobgroupId=28
value for monitoring.
Example 4 - Dataset with parameterized bucket name
The following example contains a parameterized bucket reference, with a specified override value. Administrators and project owners can specify the default value for environment parameters, and users can specify overrides for these values at job execution time.
Endpoint | http://www.example.com:3005/v4/jobGroups |
---|---|
Authentication | Required |
Method | POST |
Request Body | { "wrangledDataset": { "id": 121 }, "runParameters": { "overrides": { "data": [ { "key": "env.bucketName", "value": "myco-dev2" } ] } } } |
In the above example, the job has been launched for recipe 121
to execute with the env.bucketName
override value (myco-dev2
) for the environment parameter.
For more information on these properties, see Dataprep by Trifacta: API Reference docs
Step - Monitoring Your Job
After the job has been created and you have captured the jobGroup Id, you can use it to monitor the status of your job. For more information, see Dataprep by Trifacta: API Reference docs
Step - Re-run Job
If you need to re-run the job as specified, you can use the wrangledDataset identifier to re-run the most recent job.
Tip
When you re-run a job, you can change any variable values as part of the request.
Example request:
Endpoint | http://www.example.com:3005/v4/jobGroups |
---|---|
Authentication | Required |
Method | POST |
Request Body | { "wrangledDataset": { "id": 123 }, "runParameters": { "overrides": { "data": [{ "key": "region", "value": "central" } ]} } } |
For more information, see API Task - Develop a Flow.