API Task - Run Job
Warning
API access is migrating to Enterprise only. Beginning in Release 9.5, all new or renewed subscriptions have access to public API endpoints on the Enterprise product edition only. Existing customers on non-Enterprise editions will retain access their available endpoints (Legacy) until their subscription expires. To use API endpoints after renewal, you must upgrade to the Enterprise product edition or use a reduced set of endpoints (Current). For more information on differences between product editions in the new model, please visit Pricing and Packaging.
This section describes how to run a job using the APIs available in Dataprep by Trifacta.
A note about API URLs:
In the listed examples, URLs are referenced in the following manner:
<protocol>://<platform_base_url>/
In your product, these map references map to the following:
https://www.api.clouddataprep.com/
For more information, see API Reference.
Run Job Endpoints
Depending on the type of job that you are running, you must use one of the following endpoints:
Run job
Run a job to generate the outputs from a single recipe in a flow.
Tip
This method is covered on this page.
Endpoint | /v4/jobGroups/:id |
---|---|
Method | POST |
Reference documentation | Dataprep by Trifacta: API Reference docs |
Run flow
Run all outputs specified in a flow. Optionally, you can run all scheduled outputs.
Endpoint | /v4/flows/:id/run |
---|---|
Method | POST |
Reference documentation | Dataprep by Trifacta: API Reference docs |
Prerequisites
Before you begin, you should verify the following:
Get authentication credentials. As part of each request, you must pass in authentication credentials to the platform. For more information, see Manage API Access Tokens.
For more information, see Dataprep by Trifacta: API Reference docs
Verify job execution. Run the desired job through the Trifacta Application and verify that the output objects are properly generated.
Note
By default, when scheduled or API jobs are executed, no validations are performed of any writesettings objects for file-based outputs. Issues with these objects may cause failures during transformation or publishing stages of job execution. Jobs of these types should be tested through the Trifacta Application first. A workspace administrator can disable the skipping of these validations.
Acquire recipe (wrangled dataset) identifier. In Flow View, click the icon for the recipe whose outputs you wish to generate. Acquire the numeric value for the recipe from the URL. In the following, the recipe Id is
28629
:http://<platform_base_url>/flows/5479?recipe=28629&tab=recipe
Create output object. A recipe must have at least one output object created for it before you can run a job via APIs. For more information, see Flow View Page.
If you wish to apply overrides to the inputs or outputs of the recipe, you should acquire those identifiers or paths now. For more information, see "Run Job with Parameter Overrides" below.
Step - Run Job
Through the APIs, you can specify and run a job. To run a job with all default settings, construct a request like the following:
Note
A wrangledDataset
is an internal object name for the recipe that you wish to run. Please see previous section for how to acquire this value.
Tip
You cannot apply overrides to the job definition through the API request. However, overrides can be specified through the Dataprep by Trifacta interface.
Endpoint |
|
---|---|
Authentication | Required |
Method | POST |
Request Body | { "wrangledDataset": { "id": 28629 } } |
Response Code | 201 - Created |
Response Body | { "sessionId": "79276c31-c58c-4e79-ae5e-fed1a25ebca1", "reason": "JobStarted", "jobGraph": { "vertices": [ 21, 22 ], "edges": [ { "source": 21, "target": 22 } ] }, "id": 961247, "jobs": { "data": [ { "id": 21 }, { "id": 22 } ] } } |
If the 201
response code is returned, then the job has been queued for execution.
Tip
Retain the id
value in the response. In the above, 961247
is the internal identifier for the job group for the job. You will need this value to check on your job status.
For more information, see Dataprep by Trifacta: API Reference docs
Tip
You have queued your job for execution.
Step - Monitoring Your Job
You can monitor the status of your job through the following endpoint:
Endpoint | <protocol>://<platform_base_url>/v4/jobGroups/<id>/ |
---|---|
Authentication | Required |
Method | GET |
Request Body | None. |
Response Code | 200 - Ok |
Response Body | { "id": 961247, "name": null, "description": null, "ranfrom": "ui", "ranfor": "recipe", "status": "Complete", "profilingEnabled": true, "runParameterReferenceDate": "2019-08-20T17:46:27.000Z", "createdAt": "2019-08-20T17:46:28.000Z", "updatedAt": "2019-08-20T17:53:17.000Z", "workspace": { "id": 22 }, "creator": { "id": 38 }, "updater": { "id": 38 }, "snapshot": { "id": 774476 }, "wrangledDataset": { "id": 28629 }, "flowRun": null } |
When the job has successfully completed, the returned status message includes the following:
"status": "Complete",
For more information, see Dataprep by Trifacta: API Reference docs
Tip
You have executed the job. Results have been delivered to the designated output locations.
Step - Re-run Job
In the future, you can re-run the job using the same, simple request:
Endpoint | <protocol>://<platform_base_url>/v4/jobGroups |
---|---|
Authentication | Required |
Method | POST |
Request Body | { "wrangledDataset": { "id": 28629 } } |
The job is re-run as it was previously specified.
For more information, see Dataprep by Trifacta: API Reference docs
Step - Run Job with Overrides - Files
As needed, you can specify runtime overrides for any of the settings related to the job definition or its outputs. For file-based jobs, these overrides include:
Data sources
Execution environment
profiling
Output file, format, and other settings
Input file overrides
You can override the file-based data sources your job run. In the following example, two datasets are overridden with new files.
Note
Overrides for data sources apply only to file-based sources. File-based sources that are converted during ingestion, such as Microsoft Excel files and JSON files, cannot be swapped in this manner.
Note
Overrides must be applied to the entire file path. As part of this overrides, you can redefine the bucket from which the source data is taken.
Endpoint | <protocol>://<platform_base_url>/v4/jobGroups |
---|---|
Authentication | Required |
Method | POST |
Request Body | { "wrangledDataset": { "id": 28629 }, "overrides": { "datasources": { "airlines - region 1": [ "s3://my-new-bucket/test-override-input/airlines3.csv", "s3://my-new-bucket/test-override-input/airlines4.csv", "s3://my-new-bucket/test-override-input/airlines5.csv" ], "airlines - region 2": [ "s3://my-new-bucket/test-override-input/airlines1.csv", ] } } } |
The job specified for recipe 28629
is re-run using the new data sources.
Notes:
The names of the datasources (
airlines - region 1
andairlines - region 2
) refer to the display name values for the datasets that are the sources for the wrangledDataset (recipe) in the flow.You can use this API method to overwrite the bucket name for your source, but you must replace the entire path.
The parameterized list of files can be from different folders, too.
File type and size information is not displayed in the Job Details page for these overridden jobs.
No validation is performed on the existence of these files prior to execution. If the files do not exist, the job fails.
For more information, see Dataprep by Trifacta: API Reference docs
Output file overrides
Note
Override values applied to a job are not validated. Invalid overrides may cause your job to fail.
See Dataprep by Trifacta: API Reference docs
Acquire the internal identifier for the recipe for which you wish to execute a job. In the previous example, this identifier was
28629
.Construct a request using the following:
Endpoint
<protocol>://<platform_base_url>/v4/jobGroups
Authentication
Required
Method
POST
Request Body:
{ "wrangledDataset": { "id": 28629 }, "overrides": { "profiler": true, "execution": "spark", "writesettings": [ { "path": "<new_path_to_output>", "format": "csv", "header": true, "asSingleFile": true, "includeMismatches": true } ] }, "ranfrom": null }
In the above example, the job has been launched with the following overrides:
Job will be executed on the Spark cluster. Other supported values depend on your product edition and available running environments:
Value for
overrides.execution
Description
photon
Running environment on Trifacta node
spark
Spark on integrated cluster, with the following exceptions.
databricksSpark
Spark on Azure Databricks
emrSpark
Spark on AWS EMR
dataflow
Dataflow
Job will be executed with profiling enabled.
Output is written to a new file path.
Output format is CSV to the designated path.
Output has a header and is generated as a single file.
Output will include values if they are mismatched for the column's data type.
Note
includeMismatches
isfalse
by default. You can set it totrue
as an override or as part of the output object definition.
A response code of
201 - Created
is returned. The response body should look like the following:{ "sessionId": "79276c31-c58c-4e79-ae5e-fed1a25ebca1", "reason": "JobStarted", "jobGraph": { "vertices": [ 21, 22 ], "edges": [ { "source": 21, "target": 22 } ] }, "id": 962221, "jobs": { "data": [ { "id": 21 }, { "id": 22 } ] } }
Retain the
id
value, which is the job identifier, for monitoring.
Step - Run Job with Overrides - Tables
Note
This feature may not be available in all product editions. For more information on available features, see Compare Editions.
You can also pass job definition overrides for table-based outputs. For table outputs, overrides include:
Path to database to which to write (must have write access)
Connection to write to the target.
Tip
This identifier is for the connection used to write to the target system. This connection must already exist. For more information on how to retrieve the identifier for a connection, see
Dataprep by Trifacta: API Reference docs
Name of output table
Target table type
Tip
You can acquire the target type from the
vendor
value in the connection response. For more information, seeDataprep by Trifacta: API Reference docs
action:
Key value
Description
create
Create a new table with each publication.
createAndLoad
Append your data to the table.
truncateAndLoad
Truncate the table and load it with your data.
dropAndLoad
Drop the table and write the new table in its place.
Identifier of connection to use to write data.
See Dataprep by Trifacta: API Reference docs
Acquire the internal identifier for the recipe for which you wish to execute a job. In the previous example, this identifier was
28629
.Construct a request using the following:
Endpoint
<protocol>://<platform_base_url>/v4/jobGroups
Authentication
Required
Method
POST
Request Body:
{ "wrangledDataset": { "id": 28629 }, "overrides": { "publications": [ { "path": [ "prod_db" ], "tableName": "Table_CaseFctn2", "action": "createAndLoad", "targetType": "postgres", "connectionId": 3 } ] }, "ranfrom": null }
In the above example, the job has been launched with the following overrides:
Note
When overrides are applied to publishing, any publications that are already attached to the recipe are ignored.
Output path is to the
prod_db
database, using table name isTable_CaseFctn2
.Output action is "create and load." See above for definitions.
Target table type is a PostgreSQL table.
A response code of
201 - Created
is returned. The response body should look like the following:{ "sessionId": "79276c31-c58c-4e79-ae5e-fed1a25ebca1", "reason": "JobStarted", "jobGraph": { "vertices": [ 21, 22 ], "edges": [ { "source": 21, "target": 22 } ] }, "id": 962222, "jobs": { "data": [ { "id": 21 }, { "id": 22 } ] } }
Retain the
id
value, which is the job identifier, for monitoring.
Step - Run Job with Overrides - Webhooks
Note
This feature may not be available in all product editions. For more information on available features, see Compare Editions.
When you execute a job, you can pass in a set of parameters as overrides to generate a webhook message to a third-party application, based on the success or failure of the job.
For more information on webhooks, see Create Flow Webhook Task.
Acquire the internal identifier for the recipe for which you wish to execute a job. In the previous example, this identifier was
28629
.Construct a request using the following:
Endpoint
<protocol>://<platform_base_url>/v4/jobGroups
Authentication
Required
Method
POST
Request Body:
{ "wrangledDataset": { "id": 28629 }, "overrides": { "webhooks": [{ "name": "webhook override", "url": "http://example.com", "method": "post", "triggerEvent": "onJobFailure", "body": { "text": "override" }, "headers": { "testHeader": "val1" }, "sslVerification": true, "secretKey": "123" }] } }
In the above example, the job has been launched with the following overrides:
Override setting
Description
name
Name of the webhook.
url
URL to which to send the webhook message.
method
The HTTP method to use. Supported values:
POST
,PUT
,PATCH
,GET
, or DELETE. Body is ignored forGET
andDELETE
methods.triggerEvent
Supported values:
onJobFailure
- send webhook message if job failsonJobSuccess
- send webhook message if job completes successfullyonJobDone
- send webhook message when job fails or finishes successfullybody
(optional) The value of the
text
field is the message that is sent.Note
Some special token values are supported. See Create Flow Webhook Task.
header
(optional) Key-value pairs of headers to include in the HTTP request.
sslVerification
(optional) Set to
true
if SSL verification should be completed. If not specified, the value istrue
.secretKey
(optional) If enabled, this value should be set to the secret key to use.
A response code of
201 - Created
is returned. The response body should look like the following:{ "sessionId": "79276c31-c58c-4e79-ae5e-fed1a25ebca1", "reason": "JobStarted", "jobGraph": { "vertices": [ 21, 22 ], "edges": [ { "source": 21, "target": 22 } ] }, "id": 962222, "jobs": { "data": [ { "id": 21 }, { "id": 22 } ] } }
Retain the
id
value, which is the job identifier, for monitoring.
Step - Run Job with Parameter Overrides
You can pass overrides of the default parameter values as part of the job definition. You can use the following mechanism to pass in parameter overrides of the following types:
Datasets with parameters (variable type)
Output object parameters
Flow parameters
The syntax is the same for each type.
Acquire the internal identifier for the recipe for which you wish to execute a job. In the previous example, this identifier was
28629
.Construct a request using the following:
Endpoint
<protocol>://<platform_base_url>/v4/jobGroups
Authentication
Required
Method
POST
Request Body:
{ "wrangledDataset": { "id": 28629 }, "overrides": { "runParameters": { "overrides": { "data": [ { "key": "varRegion", "value": "02" } ] } } }, "ranfrom": null }
In the above example, the specified job has been launched for recipe
28629
. The run parametervarRegion
has been set to02
for this specific job. Depending on how it's defined in the flow, this parameter could influence change either of the following:The source for the imported dataset.
The path for the generated output.
A flow parameter reference in the recipe
For more information, see Overview of Parameterization.
A response code of
201 - Created
is returned. The response body should look like the following:{ "sessionId": "79276c31-c58c-4e79-ae5e-fed1a25ebca1", "reason": "JobStarted", "jobGraph": { "vertices": [ 21, 22 ], "edges": [ { "source": 21, "target": 22 } ] }, "id": 962223, "jobs": { "data": [ { "id": 21 }, { "id": 22 } ] } }
Retain the
id
value, which is the job identifier, for monitoring.
Step - Dataflow Execution Overrides
Note
Overrides applied to the jobGroup are merged with any overrides specified as part of the output objects associated with the wrangledDataset
. For more information, see API Task - Manage Outputs.
If neither object has a specified override for a Dataflowproperty, the applicable project setting is used. See User Execution Settings Page.
General example
You can submit overrides to a specific set of Dataflow properties for your job execution. For general information on how these settings affect your jobs, see Run Job on Cloud Dataflow.
Note
If you are using automatic VPC network mode, then network
, subnetwork
, and usePublicIPs
do not apply.
The following example shows how to run a job for a specified recipe with Dataflow property overrides applied to it:
Endpoint | https://www.api.clouddataprep.com/v4/jobGroups |
---|---|
Authentication | Required |
Method | POST |
Request Body:
{ "wrangledDataset": { "id": 28629 }, "execution": "dataflow", "dataflowOptions": [ {"region": "first-region"}, {"zone": "second-zone"}, {"machineType": "n1-standard-32"}, {"network": ""}, {"subnetwork": ""}, {"autoscalingAlgorithm": "THROUGHPUT_BASED"}, {"maxNumWorkers": "1000"}, {"numWorkers": "10"} ] }
Notes on properties:
You can submit empty or null values for property values in the payload. These values are submitted.
If you are not using auto-scaling on your job:
"autoscalingAlgorithm": "NONE",
Use "
numWorkers
" instead to specify the number of compute nodes to use for the job.Note
This feature may not be available in all product editions. For more information on available features, see Compare Editions.
If you are using auto-scaling on your job:
"autoscalingAlgorithm": "throughput_based",
Use the
"maxNumWorkers"
and "numWorkers
" instead to specify the number of compute nodes to use for the job.Note
This feature may not be available in all product editions. For more information on available features, see Compare Editions.
Example using VPC
By default, Dataflow expects that submitted jobs are executed across publicly available IP addresses (usePublicUPs = true
). As needed, you can use resources available through a VPC.
Note
Google Private Access must be enabled on your Virtual Private Cloud (VPC) for Dataprep by Trifacta to access it.
If needed, you can override the default settings to execute the job on workers that are available through your VPC.
The following example shows how to run a job for a specified recipe with Dataflow to use your specified VPC:
Endpoint | https://www.api.clouddataprep.com/v4/jobGroups |
---|---|
Authentication | Required |
Method | POST |
Request Body:
{ "wrangledDataset": { "id": 28629 }, "execution": "dataflow", "dataflowOptions": [ {"region": "first-region"}, {"zone": "second-zone"}, {"machineType": "n1-standard-32"}, {"network": "my-network-name"}, {"subnetwork": "my-subnetwork-url"}, {"autoscalingAlgorithm": "THROUGHPUT_BASED"}, {"serviceAccount": "my-service-account-name@<project-id>.iam.gserviceaccount.com"}, {"numWorkers": "1"}, {"maxNumWorkers": "1000"}, {"usePublicIps": "false"} ] }
Subnetwork values:
To specify a different sub-network, enter the URL of the sub-network. The URL should be in the following format:
regions/<REGION>/subnetworks/<SUBNETWORK>
where:
<REGION>
is the region identifier specified under Regional Endpoint. These values must match.<SUBNETWORK>
is the subnetwork identifier.
If you have access to another project, you can execute your Dataflow job through it by specifying a full URL in the following form:
https://www.googleapis.com/compute/v1/projects/<HOST_PROJECT_ID>/regions/<REGION>/subnetworks/<SUBNETWORK>
where:
<HOST_PROJECT_ID>
corresponds to the project identifier. This value must be between 6 and 30 characters. The value can contain only lowercase letters, digits, or hyphens. It must start with a letter. Trailing hyphens are prohibited.
Example with labels
You can use labels to assign billing information for the job in your project.
Note
This feature may not be available in all product editions. For more information on available features, see Compare Editions.
The following example shows how to run a job for a specified recipe with Dataflow labels applied to it:
Endpoint | https://www.api.clouddataprep.com/v4/jobGroups |
---|---|
Authentication | Required |
Method | POST |
Request Body:
{ "wrangledDataset": { "id": 28629 }, "execution": "dataflow", "dataflowOptions": [ {"region": "first-region"}, {"zone": "second-zone"}, {"machineType": "n1-standard-32"}, {"network": ""}, {"subnetwork": ""}, {"autoscalingAlgorithm": "THROUGHPUT_BASED"}, {"maxNumWorkers": "1000"}, {"numWorkers": "10"}, {"labels": [ { "key": "first-new-label-key", "value": "first-new-label-value" }, { "key": "second-new-label-key", "value": "second-new-label-value" } ] } ] }
Notes on labels:
Key: This value must be unique among your job labels.
Value: Assign based on the accepted values for the label.
For more information, see https://cloud.google.com/resource-manager/docs/creating-managing-labels.
You can apply up to 64 labels for a job. For more information on the available properties, see Runtime Dataflow Execution Settings.