API Task - Swap Datasets
Warning
API access is migrating to Enterprise only. Beginning in Release 9.5, all new or renewed subscriptions have access to public API endpoints on the Enterprise product edition only. Existing customers on non-Enterprise editions will retain access their available endpoints (Legacy) until their subscription expires. To use API endpoints after renewal, you must upgrade to the Enterprise product edition or use a reduced set of endpoints (Current). For more information on differences between product editions in the new model, please visit Pricing and Packaging.
Overview
After you have created a flow, imported a dataset, and created a recipe for that dataset, you may need to swap in a different dataset and run the recipe against that one. This task steps through that process via the APIs.
Note
If you are processing multiple parallel datasources in a single job, you should create a dataset with parameters and then run the job. For more information, see API Task - Run Job on Dataset with Parameters.
This task utilizes the following methods:
Creating an imported dataset. After the new file has been added to the backend datastore, you can import into Designer Cloud as an imported dataset.
Swap dataset. Using the ID of the imported dataset you created, you can now assign the dataset to the recipe in your flow.
Run a job. Run the job against the dataset.
Monitor progress. Monitor the progress of the job until it is complete.
Example Datasets
In this example, you are wrangling data from orders placed in different regions on a quarterly basis. When a new file drops, you want to be able to swap out the current dataset that is assigned to the recipe and swap in the new one. Then, run the job.
Example Files:
The following files are stored in HDFS:
Path and Filename | Description |
---|---|
hdfs:///user/orders/MyCo-orders-west-Q1.txt | Orders from West region for Q1 |
hdfs:///user/orders/MyCo-orders-west-Q2.txt | Orders from West region for Q2 |
hdfs:///user/orders/MyCo-orders-north-Q1.txt | Orders from North region for Q1 |
hdfs:///user/orders/MyCo-orders-north-Q2.txt | Orders from North region for Q2 |
hdfs:///user/orders/MyCo-orders-east-Q1.txt | Orders from East region for Q1 |
hdfs:///user/orders/MyCo-orders-east-Q1.txt | Orders from East region for Q2 |
Assumptions
You have already created a flow, which contains the following imported dataset and recipe:
Note
When an imported dataset is created via API, it is always imported as an unstructured dataset. Any recipe that references this dataset should contain initial parsing steps required to structure the data.
Tip
Through the UI, you can import one of your datasets as unstructured. Create a recipe for this dataset and then edit it. In the Recipe panel, you should be able to see the structuring steps. Back in Flow View, you can chain your structural recipe off of this one. Dataset swapping should happen on the first recipe.
Object Type | Name | Id |
---|---|---|
flow | MyCo-Orders-Quarter | 2 |
Imported Dataset | MyCo-orders-west-Q1.txt | 8 |
Recipe (wrangledDataset) | n/a | 9 |
Job | n/a | 3 |
Base URL:
For purposes of this example, the base URL for the platform is the following:
http://www.example.com:3005
Step - Import Dataset
Note
You cannot add datasets to the flow through the flows
endpoint. Moving pre-existing datasets into a flow is not supported in this release. Create or locate the flow first and then when you create the datasets, associate them with the flow at the time of creation.
See Designer Cloud Powered by Trifacta: API Reference docs
See Designer Cloud Powered by Trifacta: API Reference docs
Note
When an imported dataset is created via API, it is always imported as an unstructured dataset. Any recipe that references this dataset should contain initial parsing steps required to structure the data.
The following steps describe how to create an imported dataset and assign it to the flow that has already been created (flowId=2).
Steps:
To create an imported dataset, you must acquire the following information about the source.
path
type
name
description
bucket (if a file stored on S3)
In this example, the file you are importing is
MyCo-orders-west-Q2.txt
. Since the files are similar in nature and are stored in the same directory, you can acquire this information by gathering the information from the imported dataset that is already part of the flow. Execute the following:Endpoint
http://www.example.com:3005/v4/importedDatasets
Authentication
Required
Method
POST
Request Body
{ "path": "hdfs:///user/orders/MyCo-orders-west-Q2.txt", "name": "MyCo-orders-west-Q2.txt", "description": "MyCo-orders-west-Q2" }
The response should be a
201 - Created
status code with something like the following:{ "id": 12, "size": "281032", "path": "hdfs:///user/orders/MyCo-orders-west-Q2.txt", "dynamicPath": null, "workspaceId": 1, "isSchematized": false, "isDynamic": false, "disableTypeInference": false, "createdAt": "2018-10-29T23:15:01.831Z", "updatedAt": "2018-10-29T23:15:01.889Z", "parsingRecipe": { "id": 11 }, "runParameters": [], "name": "MyCo-orders-west-Q2.txt.txt", "description": "MyCo-orders-west-Q2.txt", "creator": { "id": 1 }, "updater": { "id": 1 }, "connection": null }
You must retain the
id
value so you can reference it when you create the recipe.See Designer Cloud Powered by Trifacta: API Reference docs
Note
You have imported a dataset that is unstructured and is not associated with any flow.
Step - Swap Dataset from Recipe
The next step is to swap the primary input dataset for the recipe to point at the newly imported dataset. This step automatically adds the imported dataset to the flow and drops the previous imported dataset from the flow.
Use the following to swap the primary input dataset for the recipe:
Endpoint
http://www.example.com:3005/v4/wrangledDatasets/9/primaryInputDataset
Authentication
Required
Method
PUT
Request Body
{ "importedDataset": { "id": 12 } }
The response should be a
200 - OK
status code with something like the following:{ "id": 9, "wrangled": true, "createdAt": "2019-03-03T17:58:53.979Z", "updatedAt": "2019-03-03T18:01:11.310Z", "recipe": { "id": 9, "name": "POS-r01", "description": null, "active": true, "nextPortId": 1, "createdAt": "2019-03-03T17:58:53.965Z", "updatedAt": "2019-03-03T18:01:11.308Z", "currentEdit": { "id": 8 }, "redoLeafEdit": { "id": 7 }, "creator": { "id": 1 }, "updater": { "id": 1 } }, "referenceInfo": null, "activeSample": { "id": 7 }, "creator": { "id": 1 }, "updater": { "id": 1 }, "referencedFlowNode": null, "flow": { "id": 2 } }
The new imported dataset is now the primary input for the recipe, and the old imported dataset has been removed from the flow.
Designer Cloud Powered by Trifacta: API Reference docs
Step - Rerun Job
To execute a job on this recipe, you can simply re-run any job that was executed on the old imported dataset, since you reference the job by jobId and wrangledDataset (recipe) Id.
Endpoint | http://www.example.com:3005/v4/jobGroups |
---|---|
Authentication | Required |
Method | POST |
Request Body | { "wrangledDataset": { "id": 9 } } |
The job is re-run as it was previously specified.
If you need to modify any job parameters, you must create a new job definition.
Step - Monitor Your Job
After the job has been queued, you can track it to completion. See API Task - Develop a Flow.
Step - Schedule Your Job
When you are satisfied with how your flow is working, you can set up periodic schedules using a third-party tool to execute the job on a regular basis.
The tool must hit the above endpoints to swap in the new dataset and run the job.