Skip to main content

Dataflow Execution Settings Page

Project administrators can modify the settings related to how all jobs from the project are executed on Dataflow.

Note

These settings are the default settings for all Dataflow jobs for the project. Individual users may be permitted to override these settings.

Jobs may fail in Dataflow due to insufficient quotas. To learn more about enabling more resources in the Google Cloud Platform, see https://cloud.google.com/dataflow/quotas.

Order of evaluation of settings:

Execution settings can be passed to Dataflow through the various pages in the application. Priority of the settings is listed below:

Tip

When users specify overrides to the default project settings, they only need to specify the specific settings to override. A blank value means that the value is to be inherited.

Note

Project administrators can enable or disable users from applying user- or job-level overrides to the default settings for the project. For more information, see Dataprep Project Settings Page.

Priority

Type

Description

1

Job

Execution settings can be defined as part of jobs that you specify. See Runtime Dataflow Execution Settings.

2

User

Individual users can specify overrides to the default execution settings for the project. See User Execution Settings Page.

3

Project

Project administrators can define default Dataflow execution settings for all users of the project. See Dataflow Execution Settings Page.

Notes:

  • Values specified here are applied to all jobs executed within the current project. To apply these changes globally, you must edit these settings in each project of which you are a member.

  • If property values are not specified here, then the properties are not passed in with any job execution, and the default property values for Dataprep by Trifacta are used.

Basic Settings

Setting

Description

Regional endpoint

A region is a specific geographical location where you can run your jobs.

Zone

A sub-section of region, a zone contains specific resources.

Select Auto Zone to allow the platform to choose the zone for you.

Machine type

Choose the type of machine on which to run your jobs. The default is n1-standard-1.

Warning

Making changes to Region, Zone, or Machine Type can affect the time and cost of job executions. For more information, see https://cloud.google.com/dataflow/docs/concepts/regional-endpoints.

For more information on machine types, https://cloud.google.com/compute/docs/machine-types.

Advanced Settings

Setting

Description

VPC network mode

Select the network mode to use.

If the network mode is set toAuto(default), jobs are executed over publicly available IP addresses. Do not set values for Network, Subnetwork, and Worker IP address configuration.

Note

Unless you have specific reasons to modify these settings, you should leave them as the default values. These network settings apply to job execution. Preview and sampling use the default network settings.

For Custom VPC networks:

  1. Specify the name of the VPC network in your region.

  2. Specify the short or full URL of the Subnetwork. If both Network and Subnetwork are specified, Subnetwork is used. See https://cloud.google.com/dataflow/docs/guides/specifying-networks.

  3. Review and specify the Worker IP address configuration setting. See below.

For more information:

Network

To use a different VPC network, enter the name of the VPC network to use. ClickSaveto apply.

Subnetwork

To specify a different subnetwork, enter the URL of the subnetwork. The URL should be in the following format:

regions/<REGION>/subnetworks/<SUBNETWORK>

where:

  • <REGION> is the region identifier specified under Region. These values must match.

  • <SUBNETWORK> is the subnetwork identifier.

If you have access to another project within your organization, you can execute your Dataflow job through it by specifying a full URL in the following form:

https://www.googleapis.com/compute/v1/projects/<HOST_PROJECT_ID>/regions/<REGION>/subnetworks/<SUBNETWORK>

where:

  • <HOST_PROJECT_ID> corresponds to the project identifier. This value must be between 6 and 30 characters. The value can contain only lowercase letters, digits, or hyphens. It must start with a letter. Trailing hyphens are prohibited.

Click Save to apply the override.

Note

This feature may not be available in all product editions. For more information on available features, see Compare Editions.

Setting

Description

Worker IP address configuration

If the VPC Network mode is set to custom, then choose one of the following for your Dataflow jobs in this project:

  • Allow public IP addresses - Use Dataflow workers that are available through public IP addresses. No further configuration is required.

  • Use internal IP addresses only - Dataflow workers use private IP addresses for all communication.

    • If a Subnetwork is specified, then the Network value is ignored.

    • The specified Network or Subnetwork must have Private Google Access enabled.

Autoscaling algorithms

The type of algorithm to use to scale the number of Google Compute Engine instances to accommodate the size of your jobs. Possible values:

  • Throughput based - Scaling is determined by the volume of data expected to be passed through Dataflow.

  • None - None algorithm is applied.

    • If none is selected, use Initial Number of Workers to specify a fixed number of Google Compute Engine instances.

Initial number of workers

Number of Google Compute Engine instances with which to launch jobs. This number may be adjusted as part of job execution. This number must be an integer between 1 and 1000, inclusive.

Maximum number of workers

Maximum number of Google Compute Engine instances to use during execution. This number must be an integer between 1 and 1000, inclusive, and must be greater than the initial number of workers.

Service account

Every job executed in Dataflow requires that the job be submitted through a service account. By default, Dataprep by Trifacta uses a single Compute Engine service account under which jobs from all project users are run.

Optionally, you can specify a different service account under which to run your jobs for the project.

Note

When using a named service account to access data and run jobs in other projects, you must be granted the roles/iam.serviceAccountUser role on the service account to use it.

Note

If companion service accounts are enabled, each user must have a service account specified for use in their Preferences.

For more information on service accounts, seeGoogle Service Account Management.

Labels

Create or assign labels to apply to the billing for the Dataprep by Trifacta jobs run in your project. You may reference up to 64 labels.

Note

Each label must have a unique key name.

For more information, see https://cloud.google.com/resource-manager/docs/creating-managing-labels.