API Task - Swap Datasets
Warning
API access is migrating to Enterprise only. Beginning in Release 9.5, all new or renewed subscriptions have access to public API endpoints on the Enterprise product edition only. Existing customers on non-Enterprise editions will retain access their available endpoints (Legacy) until their subscription expires. To use API endpoints after renewal, you must upgrade to the Enterprise product edition or use a reduced set of endpoints (Current). For more information on differences between product editions in the new model, please visit Pricing and Packaging.
Overview
After you have created a flow, imported a dataset, and created a recipe for that dataset, you may need to swap in a different dataset and run the recipe against that one. This task steps through that process via the APIs.
Note
If you are processing multiple parallel datasources in a single job, you should create a dataset with parameters and then run the job. For more information, see API Task - Run Job on Dataset with Parameters.
This task utilizes the following methods:
Creating an imported dataset. After the new file has been added to the backend datastore, you can import into Dataprep by Trifacta as an imported dataset.
Swap dataset. Using the ID of the imported dataset you created, you can now assign the dataset to the recipe in your flow.
Run a job. Run the job against the dataset.
Monitor progress. Monitor the progress of the job until it is complete.
Example Datasets
In this example, you are wrangling data from orders placed in different regions on a quarterly basis. When a new file drops, you want to be able to swap out the current dataset that is assigned to the recipe and swap in the new one. Then, run the job.
Example Files:
The following files are stored in HDFS:
Path and Filename | Description |
---|---|
hdfs:///user/orders/MyCo-orders-west-Q1.txt | Orders from West region for Q1 |
hdfs:///user/orders/MyCo-orders-west-Q2.txt | Orders from West region for Q2 |
hdfs:///user/orders/MyCo-orders-north-Q1.txt | Orders from North region for Q1 |
hdfs:///user/orders/MyCo-orders-north-Q2.txt | Orders from North region for Q2 |
hdfs:///user/orders/MyCo-orders-east-Q1.txt | Orders from East region for Q1 |
hdfs:///user/orders/MyCo-orders-east-Q1.txt | Orders from East region for Q2 |
Assumptions
You have already created a flow, which contains the following imported dataset and recipe:
Note
When an imported dataset is created via API, it is always imported as an unstructured dataset. Any recipe that references this dataset should contain initial parsing steps required to structure the data.
Tip
Through the UI, you can import one of your datasets as unstructured. Create a recipe for this dataset and then edit it. In the Recipe panel, you should be able to see the structuring steps. Back in Flow View, you can chain your structural recipe off of this one. Dataset swapping should happen on the first recipe.
Object Type | Name | Id |
---|---|---|
flow | MyCo-Orders-Quarter | 2 |
Imported Dataset | MyCo-orders-west-Q1.txt | 8 |
Recipe (wrangledDataset) | n/a | 9 |
Job | n/a | 3 |
Base URL:
For purposes of this example, the base URL for the platform is the following:
http://www.example.com:3005
Step - Import Dataset
Note
You cannot add datasets to the flow through the flows
endpoint. Moving pre-existing datasets into a flow is not supported in this release. Create or locate the flow first and then when you create the datasets, associate them with the flow at the time of creation.
See Dataprep by Trifacta: API Reference docs
See Dataprep by Trifacta: API Reference docs
Note
When an imported dataset is created via API, it is always imported as an unstructured dataset. Any recipe that references this dataset should contain initial parsing steps required to structure the data.
The following steps describe how to create an imported dataset and assign it to the flow that has already been created (flowId=2).
Steps:
To create an imported dataset, you must acquire the following information about the source.
path
type
name
description
bucket (if a file stored on S3)
In this example, the file you are importing is
MyCo-orders-west-Q2.txt
. Since the files are similar in nature and are stored in the same directory, you can acquire this information by gathering the information from the imported dataset that is already part of the flow. Execute the following:Endpoint
http://www.example.com:3005/v4/importedDatasets
Authentication
Required
Method
POST
Request Body
{ "path": "hdfs:///user/orders/MyCo-orders-west-Q2.txt", "name": "MyCo-orders-west-Q2.txt", "description": "MyCo-orders-west-Q2" }
The response should be a
201 - Created
status code with something like the following:{ "id": 12, "size": "281032", "path": "hdfs:///user/orders/MyCo-orders-west-Q2.txt", "dynamicPath": null, "workspaceId": 1, "isSchematized": false, "isDynamic": false, "disableTypeInference": false, "createdAt": "2018-10-29T23:15:01.831Z", "updatedAt": "2018-10-29T23:15:01.889Z", "parsingRecipe": { "id": 11 }, "runParameters": [], "name": "MyCo-orders-west-Q2.txt.txt", "description": "MyCo-orders-west-Q2.txt", "creator": { "id": 1 }, "updater": { "id": 1 }, "connection": null }
You must retain the
id
value so you can reference it when you create the recipe.See Dataprep by Trifacta: API Reference docs
Note
You have imported a dataset that is unstructured and is not associated with any flow.
Step - Swap Dataset from Recipe
The next step is to swap the primary input dataset for the recipe to point at the newly imported dataset. This step automatically adds the imported dataset to the flow and drops the previous imported dataset from the flow.
Note
When you swap datasets, existing samples are not automatically discarded. These samples are invalid. As a workaround, you can generate a new sample manually. For more information on generating samples through the application, see Samples Panel.
Use the following to swap the primary input dataset for the recipe:
Endpoint
http://www.example.com:3005/v4/wrangledDatasets/9/primaryInputDataset
Authentication
Required
Method
PUT
Request Body
{ "importedDataset": { "id": 12 } }
The response should be a
200 - OK
status code with something like the following:{ "id": 9, "wrangled": true, "createdAt": "2019-03-03T17:58:53.979Z", "updatedAt": "2019-03-03T18:01:11.310Z", "recipe": { "id": 9, "name": "POS-r01", "description": null, "active": true, "nextPortId": 1, "createdAt": "2019-03-03T17:58:53.965Z", "updatedAt": "2019-03-03T18:01:11.308Z", "currentEdit": { "id": 8 }, "redoLeafEdit": { "id": 7 }, "creator": { "id": 1 }, "updater": { "id": 1 } }, "referenceInfo": null, "activeSample": { "id": 7 }, "creator": { "id": 1 }, "updater": { "id": 1 }, "referencedFlowNode": null, "flow": { "id": 2 } }
The new imported dataset is now the primary input for the recipe, and the old imported dataset has been removed from the flow.
Dataprep by Trifacta: API Reference docs
Step - Swap Bucket and Path for Imported Dataset
Note
This feature may not be available in all product editions. For more information on available features, see Compare Editions.
For a file-based backend datastore, you can change the source of your imported dataset to use a different Cloud Storage bucket and path for your imported dataset.
Note
This endpoint changes the source of the imported dataset. The wrangleddataset (recipe) continues to point to the imported dataset, which points to the new source. Since the source of the imported dataset object is altered, this change affects all objects that reference the imported dataset, even in other flows.
Tip
This endpoint is useful if you have imported your flow into a different project that uses a different source bucket.
Use the following to swap the source of your imported dataset for the recipe:
Endpoint
http://www.example.com:3005/v4/importedDatasets/9/
Authentication
Required
Method
PUT
Request Body
{ "bucket": "MyBucket", "path": "/path/to/my/file.csv", }
The response should be a
200 - OK
status code with the imported dataset definition.The new definition of the imported dataset is now applicable to all objects that reference it.
Dataprep by Trifacta: API Reference docs
Step - Rerun Job
To execute a job on this recipe, you can simply re-run any job that was executed on the old imported dataset, since you reference the job by jobId and wrangledDataset (recipe) Id.
Endpoint | http://www.example.com:3005/v4/jobGroups |
---|---|
Authentication | Required |
Method | POST |
Request Body | { "wrangledDataset": { "id": 9 } } |
The job is re-run as it was previously specified.
If you need to modify any job parameters, you must create a new job definition.
Step - Monitor Your Job
After the job has been queued, you can track it to completion. See API Task - Develop a Flow.
Step - Schedule Your Job
When you are satisfied with how your flow is working, you can set up periodic schedules using a third-party tool to execute the job on a regular basis.
The tool must hit the above endpoints to swap in the new dataset and run the job.