BigQuery Running Environment

Note

This feature may not be available in all product editions. For more information on available features, see Compare Editions.

BigQuery is a scalable cloud data warehouse integrated with the Google Cloud Platform for storage of a wide range of datasets. In some use cases, your transformation jobs can be executed completely in BigQuery. If all of your source datasets and outputs are in BigQuery locations, then transferring the execution steps from the Trifacta node to BigQuery yields the following benefits:

A minimum of data (recipe steps and associated metadata) is transferred between systems. Everything else remains in BigQuery.
Recipe steps are converted into SQL that is understandable and native to BigQuery. Execution times are much faster.
Depending on your environment, total cost of executing the job may be lower in BigQuery.

In this scenario, the recipe steps are converted to SQL, which is sequentially executed on the Datasets and Tables of your source data into temporary tables, from which the results that you have defined for your output are written.

Tip

When running a job in BigQuery, your data never leaves BigQuery.

Tip

For jobs that are executed in BigQuery, you can optionally enable the execution of the visual profile in BigQuery, too. This option is enabled for individual flows. For more information, see Flow Optimization Settings Dialog.

Requirements

This feature must be enabled by the project owner. See Configure Running Environments.
The permission to execute jobs in BigQuery must be enabled. In most environments, it is enabled by default. For more information, see Required Dataprep User Permissions.
In your flow, you must enable all general and BigQuery-specific flow optimizations. When all of these optimizations are enabled, the job can be pushed down to BigQuery for execution. For more information, see Flow Optimization Settings Dialog.

If the requirements and limitations are met, the Trifacta Application automatically executes the job in BigQuery.

Limitations

BigQuery as a running environment requires that pushdowns be enabled for the project and for the specific flow for which the job is executed. If the flow and the project are properly configured, the job is automatically executed in BigQuery.

Note

BigQuery is not a running environment that you explicitly select or specify as part of a job. If all of the requirements are met, then the job is executed in BigQuery when you select Dataflow.

Note

Datasources that require conversion are not supported for execution inn BigQuery.

For any job to be executed in BigQuery:
- All datasources must be located in BigQuery or Cloud Storage.
- If the output is a file published to Cloud Storage, then inputs must be in BigQuery.
BigQuery does not publish CSV files with quoted values.
- For example, when a column value is empty in the Transformer page, BigQuery publishes it as an empty string.
- If a column string value has quotes or another delimiter, BigQuery encloses that string value with double quotes for CSV and JSON files.
When publishing to JSON format, BigQuery does not write column values that are empty or null.
Dataflow must be selected as running environment.
All recipe steps, including all Wrangle functions in the recipe, must be translatable to SQL.
Note
When attempting to execute a job in BigQuery, Trifacta Application executes each recipe in BigQuery, until it reaches a step that cannot be executed there. At that point, data is transferred to Dataflow, where the remainder of the job is executed.
BigQuery imposes a limit of 1 MB for all submitted SQL queries. If this limit is exceeded during job execution, Dataprep by Trifacta falls back to submitting the job through Dataflow.
If the schemas have changed for your datasets, pushdown execution on BigQuery is not supported. Dataprep by Trifacta falls back to submitting the job through Dataflow.
Some transformations and functions are not currently supported for execution in BigQuery. See below.
If your recipe includes data quality rules, the job cannot be fully executed in BigQuery.
BigQuery does not permit partitioned tables to be replaced. As a result, the Drop and Load publishing action is not supported when writing to a partitioned table during BigQuery execution. For more information, see https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#create_table_statement.
In BigQuery, escaped whitespace characters (\s) match a broader set of Unicode space characters than Dataflow, due to differences in implementation of regular expressions between the two running environments. Depending on your dataset, this difference may result in mismatches between rows in your results when running the same job across different running environments.
Some uncommon date formats are not supported for pushdown.

Cloud Storage File Support

Note

This feature may not be available in all product editions. For more information on available features, see Compare Editions.

In addition to BigQuery sources, you can execute jobs in BigQuery on source files from .

By default, the Full execution for GCS file option is enabled for new flows. For more information, seeFlow Optimization Settings Dialog.
For more information, see Google Cloud Storage Access.

Tip

The BigQuery running environment also supports hybrid sources, so you can use as sources Cloud Storage files and BigQuery tables in the same flow.

Requirements:

Publishing actions must be defined to be a target in BigQuery.
External views must be enabled in BigQuery. External views are used to query GCS files. For more information, see Required Dataprep User Permissions.
In the Trifacta Application, the following flow optimization settings must be enabled at the flow level.
- BigQuery optimization
- Full execution for GCS file
In the Run Job page, the Dataflow + BigQuery running environment must be selected. For more information, see Run Job Page.

Supported file formats from Cloud Storage:

CSV. CSV files that fail to meet the following requirements may cause job failures when executed in BigQuery, even though they can be imported into Dataprep by Trifacta. Requirements:
- For job execution of CSV files in BigQuery, source CSV files must be well-formatted.
- Newlines must be inserted.
- Fields must be demarcated with quotes and commas.
  - Quotes in field value must be escaped with quotes when needed ("").
- Each row must have the same number of columns.
- For more information, see https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv.
TSV
JSON (newline-delimited)
TXT
LOG
Compressed Files (gz and bz)
Note
Snappy and bz2 file formats are not supported for pushdown execution in BigQuery. When these file formats are encountered as datasources, the job automatically reverts to run on Dataflow.

Supported file encodings:

UTF-8
ISO-8859-1

Supported delimiters:

Comma
Tab
Pipe

Supported quote characters:

No quotes
double quotes

Unsupported Wrangle for BigQuery Execution

The following transformations and functions are not currently supported for execution in BigQuery.

Note

If your recipe contains any of the following transformations or functions, full job execution in BigQuery is not possible at this time. These transformations are expected to be supported and removed from this list in future releases.

General limitations

Regex patterns used must be valid RE2. Operations on non-RE2 regex patterns are not pushed down.
Source metadata references such as $rownumber and $filepath are not supported for pushdown.
For more information on limitations on specific push-downs, see Flow Optimization Settings Dialog.

Unsupported transformations

None.

Unsupported functions

The following Wrangle functions are not currently supported for execution in BigQuery.

Aggregate functions

KTHLARGEST

KTHLARGESTIF

KTHLARGESTUNIQUE

KTHLARGESTUNIQUEIF

MODE

MODEIF

QUARTILE

APPROXIMATEMEDIAN

APPROXIMATEPERCENTILE

APPROXIMATEQUARTILE

For more information, see Aggregate Functions.

Math functions

LCM

NUMVALUE

Partially supported:

NUMFORMAT: Only supported when used for rounding.

For more information, see Math Functions.

Date functions

NETWORKDAYS

NETWORKDAYSINTL

MODEDATE

WORKDAY

WORKDAYINTL

MODEDATEIF

KTHLARGESTDATE

KTHLARGESTUNIQUEDATE

KTHLARGESTUNIQUEDATEIF

KTHLARGESTDATEIF

EOMONTH

SERIALNUMBER

Partially supported:

DATEFORMAT: Some uncommon formatting options are not supported for pushdown.

For more information, see Date Functions.

String functions

RIGHTFIND

EXACT

STRINGGREATERTHAN

STRINGGREATERTHANEQUAL

STRINGLESSTHAN

STRINGLESSTHANEQUAL

DOUBLEMETAPHONEEQUALS

TRANSLITERATE

For more information, see String Functions.

Window functions

SESSION

For more information, see Window Functions.

Verify Execution

To verify execution in BigQuery, please do the following:

Steps:

In the left nav bar, click the Jobs link.
In the Job History page, select the job that you executed.
In the Overview tab, the value for Environment under the Execution summary should be: BigQuery.

For more information, see Job Details Page.

BigQuery Running Environment

Requirements

Limitations

Cloud Storage File Support

Unsupported Wrangle for BigQuery Execution

General limitations

Unsupported transformations

Unsupported functions

Verify Execution

Search results