Run Job on Cloud Dataflow
In Dataprep by Trifacta, most jobs to transform your data are executed by default on Dataflow, a managed service for executing data pipelines within the Google Cloud Platform. Dataprep by Trifacta has been designed to integrate with Dataflow and to take advantage of multiple features available in the service. This section describes how to execute a job on Dataflow, as well as its options.
Project owners can choose to enable Trifacta Photon, an in-memory running environment hosted on the Trifacta node. This running environment yields faster performance on small- to medium-sized jobs. For more information, see Dataprep Project Settings Page.
Default Dataflow Jobs
Steps:
To run a job, open the flow containing the recipe whose output you wish to generate.
Locate the recipe. Click the recipe's output object icon.
On the right side of the screen, information about the output object is displayed. The output object defines:
The type and location of the outputs, including filenames and method of updating.
Profiling options
Execution options
For now, you can ignore the options for the output object. Click Run Job.
In the Run Job page, you can review the job as it is currently specified.
To run the job on Dataflow, select Dataflow.
Click Run Job.
The job is queued with default settings for execution on Dataflow.
Tracking progress
You can track progress of your job through the following areas:
Flow View: select the output object. On the right side of the screen, click the Jobs tab. Your job in progress is listed.
Job Details Page: Click the link in the Jobs tab. You can review progress and individual details related to your job.
Download results
When your job has finished successfully, a Completed message is displayed in the Job Details page.
Steps:
In the Job Details page, click the Output Destinations tab.
The generated outputs are listed. For each output, you can select download or view choices from the context menu on the right side of the screen.
Publish
If you have created the connections to do so, you can choose to publish your generated results to external systems. In the Output Destinations tab, click the Publish link.
Note
You must have a connection configured to publish to an external datastore available through the Connections page.
Output Options
When specifying the job you wish to run, you can define the following types of output options.
Profiling
When you select the Profiling checkbox, a visual profile of your generated results is generated as part of your data transformation job. This visual profile can be useful to identify any remaining issues with your data after the transformation is complete.
Tip
Use visual profiling when you are building your recipes. It can also be useful as a quick check of outputs for production flows.
When the job is completed and you enabled visual profiling, the visual profiling is available for review through the Profile tab in the Job Details page.
Tip
You can download PDF and JSON versions of the visual profile for offline analysis in the Job Details page.
For more information, see Overview of Visual Profiling.
Publishing actions
For each output object, you can define one or more publishing actions. A publishing action specifies the following:
GCS | BigQuery | |
---|---|---|
Type of output | file type | table |
Location of output | path | database |
Name | filename | table name |
Update method | create, append, replace | create, append, truncate, drop |
Other options |
|
Parameterized destination
You can parameterize the output filename or table name as needed.
Parameter values can be defined at the flow level through Flow View.
These parameters values can be passed into the running environment and inserted into the output filename or table name.
For more information, see Overview of Parameterization.
Execution Overrides
You can specify some settings on the following aspects of job execution on Dataflow:
Endpoint, region, and zone where the job is executed
Machine resources and billing account to use for the job
Network and subnetwork where the job is executed
These settings can be specified at the project level or at the individual output object level:
Execution settings: Within your preferences, you can define your execution options for jobs. By default, all of your jobs executed from flows within the project use these settings. For more information, see User Execution Settings Page.
Output object settings: The execution settings in the Execution Settings page can be overridden at the output object level. When you define an individual output object for a recipe, the execution settings that you specify in the Run Job page apply whenever the outputs are generated for this flow. See Runtime Dataflow Execution Settings.
Some examples of how these settings can be used are provided below.
Run Job Options
Run job in a different region and zone
If needed, you can run your job in a different region and zone.
The region determines the geographic region where your job is executed, including the execution details related to your Dataflow job.
The zone is a sub-section of the region.
You might want to change the default settings for the following reasons:
Security and compliance: You may need to constrain your Dataflow work to a specific region for your enterprise's security requirements.
Data locality: If you know your project data is stored in a specific region, you may wish to set the job to run in this region to minimize network latency and potential costs associated with cross-region execution.
Resilience: If there are outages in your default Google Cloud Platform region, you may need to switch regions.
For more information, see https://cloud.google.com/dataflow/docs/concepts/regional-endpoints.
Steps:
In the Dataflow Execution Settings:
Region: Choose the new Regional Endpoint from the drop-down list.
Zone: By default, the zone within the selected region is auto-selected for you. As needed, you can select a specific zone.
Tip
Unless you have a specific reason to do so, you should leave the Zone value at
Auto-Select
to allow the platform to choose it for you.
Run job in custom VPC network
Dataprep by Trifacta supports job execution in the following Virtual Private Cloud (VPC) network modes:
Auto: (default) Dataflow job is executed over publicly available IP addresses using the VPC Network and Subnetwork settings determined by Google Cloud Platform.
Note
In Auto mode, do not set values in the Dataflow Execution Settings for Network, Subnetwork, or (if available) Worker IP address configuration. These settings are ignored in Auto mode.
Custom: Optionally, you can customize the VPC network settings that are applied to your job if you need to apply specific network settings, including a private network. Set the VPC Network Mode to
Custom
and apply additional settings from the following settings.Note
Custom network settings do not apply to data previewing or sampling, which use the default network settings.
For more information on Google Virtual Private Clouds (VPCs):
Public vs. internal IP addresses
If the VPC Network mode is set to custom
, then choose one of the following:
Allow public IP addresses
- Use Dataflow workers that are available through public IP addresses. No further configuration is required.Use internal IP addresses only
- Dataflow workers use private IP addresses for all communication. Additional configuration is below.
Run job in custom VPC using Network value (internal IP addresses)
You can specify the VPC network to use in the Network value.
Note
This network must be in the region that you have specified for the job. Do not specify a Subnetwork value.
Note
The Network must have Google Private Access enabled.
Run job in custom VPC using Subnetwork value (internal IP addresses)
For a subnetwork that is associated with your project, you can specify the subnetwork using a short URI.
Note
If the Subnetwork value is specified, then the Network value should be set to default
. Dataflow chooses the Network for you.
Note
The Subnetwork must have Google Private Access enabled.
Short URI form:
regions/<REGION>/subnetworks/<SUBNETWORK>
where:
<REGION>
is the region to use.Note
This value must match the Regional Endpoint value.
<SUBNETWORK>
Is the subnetwork identifier.
Subnet permissions for managed user service accounts
To execute the job on a shared VPC, you must set up subnet-level permissions for the managed user service account:
In your host project, you must add the
Cloud Dataprep Service Account
with the role ofNetwork user
.If you are using a Shared VPC, you must enable access to the Shared VPC to the managed user service account. This account must be added as a member of the shared subnet permissions for your shared VPC. For more information, see https://cloud.google.com/vpc/docs/provisioning-shared-vpc.
In the Dataflow Execution Settings:
VPC Network Mode:
Custom
Network: Leave as
default
.Subnetwork: Specify the full URL, including the host project identifier, region, and subnetwork values.
Service Account: Enter the name of the managed user service account.
For more information on subnet-level permissions, see https://cloud.google.com/vpc/docs/provisioning-shared-vpc#networkuseratsubnet.
Run job API
You can also run jobs using the REST APIs.
Tip
You can pass in overrides to the dataflow execution settings as part of your API request.
For more information, see API Task - Run Job.
Configure Machine Resources
By default, Dataflow attempts to select the appropriate machine for your job, based on the size of the job and any specified account-level settings. As needed, you can override these settings at the project level or for specific jobs through the Dataflow Execution Settings.
Tip
Unless performance issues related to your resource selections apply to all jobs in the project, you should make changes to your resources for individual output objects. If those changes improve performance and you are comfortable with the higher costs associated with the change, you can consider applying them through the Execution Settings page.
Choose machine type
A machine type is a set of virtualized hardware resources, including memory size, CPU, and persistent disk storage, which are assigned to a virtual machine (VM) responsible for executing your job.
Notes:
Billing for your job depends on the machine type (resources) that have been assigned to the job. If you select a more powerful machine type, you should expect higher costs for each job execution.
Dataprep by Trifacta provides a subset of available machine types from which you can select to execute your jobs. By default, Dataprep by Trifacta uses a machine type that you define in your Execution Settings page.
If you are experiencing long execution times and are willing to incur additional costs, you can select a more powerful machine type.
To select a different machine type, choose your option from the Machine Type drop-down in the Dataflow Execution Settings. Higher numbers in the machine type name indicate more powerful machines.
Machine scaling algorithms
Note
This feature may not be available in all product editions. For more information on available features, see Compare Editions.
By default, Premium Edition utilizes a scaling algorithm based on throughput to scale up or down the Google Compute Engine instances that are deployed to execute your job.
Note
Auto-scaling can increase the costs of job execution. If you use auto-scaling, you should specify a reasonable maximum limit.
Optionally, you can disable this scaling. Set Autoscaling algorithm to None
.
Below, you can see the matrix of options.
Auto-scaling Algorithm | Initial number of workers | Maximum number of workers |
---|---|---|
Throughput based | Must be an integer between Note This number may be adjusted as part of job execution. | Must be an integer between |
None | Must be an integer between Note This number determines the fixed number of Google Compute Engine instances that are launched when your job begins. | Not used. |
Change Billing Options
Use different service account
By default, Dataprep by Trifacta uses the service account that is configured for use with your project. In the Dataflow Execution Settings, enter the name for the Service Account to use.
Note
Under the Permissions tab, please verify that Include Google-provided role grants is selected.
To see the current service account for your project:
Click the Google Cloud Platform icon at the bottom of the left nav bar.
In the Google Cloud Console, select IAM & Admin > Service Accounts.
For more information, see https://cloud.google.com/iam/docs/service-accounts?_ga=2.77818962.-730391614.1565820652.
Apply job labels
As needed, you can add labels to your job. For billing purposes, labels can be applied so that expenses related to jobs are properly categorized within your Google account.
Each label must have a unique key within your project.
You can create up to 64 labels per project.