Skip to main content

Snowflake Running Environment

Snowflake provides cloud-based data storage and analytics as a service. Among other infrastructures, Snowflake runs on Amazon S3. If all of your source datasets and outputs are in Snowflake locations and other conditions are met, then the entire execution of the transformations can occur in Snowflake.

Transferring the execution steps from the Trifacta node to Snowflake yields the following benefits:

  • A minimum of data (recipe steps and associated metadata) is transferred between systems. Everything else remains in Snowflake.

  • Recipe steps are converted into SQL that is understandable and native to Snowflake. Execution times are much faster.

  • Depending on your environment, total cost of executing the job may be lower in Snowflake.

In this scenario, the recipe steps are converted to SQL, which is sequentially executed your source data in temporary tables, from which the results that you have defined for your output are written.

Suggerimento

When running a job in Snowflake, your data never leaves Snowflake.

Suggerimento

Execution on datasets created with custom SQL is supported.

If the requirements and limitations are met, the Cloud Portal automatically executes the job in Snowflake.

Requirements

General

  • This feature must be enabled by the workspace admin. See below.

  • Cloud Portal must be integrated with Snowflake. See Snowflake Connections.

    • The permission to execute jobs in Snowflake must be enabled.

  • All sources and outputs must reside in Snowflake.

  • Permissions required to run Snowflake to a file job:

    • If stage DB is defined in the connection, requires permission to read, create a stage, create a temporary table, and create functions in the public schema of stage DB.

    • If stage DB is not defined in the connection, requires permission to read, create a stage, create a temporary table, and create functions in the public schema of Load DB.

  • Spark + Snowflake must be selected as a running environment. See Run Job Page.

  • Jobs are executed in the virtual warehouse that is specified as part of the Snowflake connection.

    Nota

    Job execution requires significantly more resources than ingest or publish jobs on Snowflake. Before you begin using Snowflake, you should verify that your Snowflake virtual warehouse has sufficient resources to handle the expected load. For more information, see Snowflake Connections.

  • In your flow, you must enable all general and Snowflake-specific flow optimizations. When all of these optimizations are enabled, the job can be pushed down to Snowflake for execution. See "Flow Optimizations" below.

Requirements across multiple Snowflake connections

If you are executing a job on Snowflake that utilizes multiple connections, the following requirements must also be met for execution of the job on Snowflake:

Limitations

Snowflake as a running environment requires that pushdowns be enabled for the workspace and for the specific flow for which the job is executed. If the flow and the workspace are properly configured, the job is automatically executed in Snowflake.

Nota

Snowflake is not a running environment that you explicitly select or specify as part of a job. If all of the requirements are met, then the job is executed in Snowflake when you select EMR.

  • Except as noted below, all datasources and all outputs specified in a job must be located within Snowflake.

    Nota

    Datasources that require conversion are not supported for execution in Snowflake.

  • All recipe steps, including all Wrangle functions in the recipe, must be translatable to SQL.

    Nota

    When attempting to execute a job in Snowflake, Cloud Portal executes each recipe in Snowflake, until it reaches a step that cannot be executed there. At that point, data is transferred to EMR, where the remainder of the job is executed.

  • If the schemas have changed for your datasets, pushdown execution on Snowflake is not supported. Designer Cloud falls back to submitting the job through another running environment.

  • Some transformations and functions are not currently supported for execution in Snowflake. See below.

  • If your recipe includes data quality rules, the job cannot be fully executed in Snowflake.

  • Visual profiling is supported with the following conditions or requirements.

    • Visual profiles are unloaded to a stage in an S3 bucket.

    • If a stage is named in the connection, it is used. This stage must point to the default S3 bucket in use.

    • If no stage is named, a temporary stage is be created in thePUBLIC schema. The connecting user must have write access toPUBLIC.

      Nota

      Creating a temporary stage requires temporary credentials from AWS. These credentials are valid for 1 hour only. If a job is expected to run longer than one hour, you should define a named stage.

    • For more information, see Snowflake Connections.

  • Sampling in Snowflake is supported with the following limitations:

    • The following sampling methods are not supported:

      • Stratified

      • Cluster-based

    • The following file formats are not supported for sampling in Snowflake when stored in S3:

      • CSV

      • JSON

    • Other JDBC-based sources are not supported.

Enable

Workspace Settings

The following setting must be enabled in the workspace. Select User menu > Admin console > Settings.

Optimization

Description

Logical and physical optimization of jobs

When enabled, the Cloud Portal attempts to optimize job execution through logical optimizations of your recipe and physical optimizations of your recipes interactions with data.

For more information, see Workspace Settings Page.

Flow Optimizations

You must enable the Snowflake optimizations in your flow. In Flow View, select More menu > Optimization settings.

Nota

All general optimizations must be enabled for your flow, as well as the following optimizations, which are specific to Snowflake.

Optimization

Description

Snowflake > Column pruning from source

When enabled, job execution performance is improved by removing any unused or redundant columns from the source database.

Snowflake > Filter pushdown

When this setting is enabled, the Cloud Portal optimizes job performance on this flow by pushing data filters directly on the source database.

Snowflake > Full pushdown

When this setting is enabled, all supported pushdown operations, including full transformation and profiling job execution, is pushed down to Snowflake, where possible.

Full execution for S3

If requirements are met for data sourced from S3, you can enable execution of your S3 datasources in Snowflake.

Nota

Snowflake pushdown is not supported for external S3 connections.

Source to Files

When this setting is enabled, Snowflake table that meet all pushdown requirements can be executed through Snowflake and published to S3.

For more information, see Flow Optimization Settings Dialog.

Run Job

To execute a job in Snowflake in theCloud Portal:

  • Your job must meet the requirements listed above.

    Suggerimento

    Some S3-based jobs can be executed in Snowflake. Additional requirements are listed below.

  • Your job must not include the functions, transformations, or other unsupported elements that are listed below.

  • You must select Snowflake + Spark as your running environment in the Run Job page.

    Nota

    If this running environment option does not appear in the Run Job page, then all required optimization settings have not been enabled for the workspace or the flow (see above) or the data or recipes do not meet the criteria for execution.

    See Run Job Page.

Suggerimento

After launching the job, you can monitor job execution through the Job Details page, which includes a link to the corresponding job in the Snowflake console.

S3 File Support

In addition to Snowflake sources, you can execute jobs in Snowflake on source files from S3.

Suggerimento

The Snowflake running environment also supports hybrid sources, so you can use as sources S3 files and Snowflake tables in the same flow.

Requirements

  • S3 or ADS is supported as the default storage layer.

  • Publishing actions must be defined to be a target in Snowflake.

  • The policies in use must support all of the requirements for Snowflake access and execution. See Required AWS Account Permissions.

  • In the Cloud Portal, the following flow optimization settings must be enabled at the flow level:

  • In the Run Job page, the Spark + Snowflake running environment must be selected.

    Suggerimento

    If this option is not available, one or more requirements for S3 file execution on Snowflake have not been met.

    For more information, see Run Job Page.

Execution requirements

Nota

For execution of S3 jobs in Snowflake, AWS credentials are passed in encrypted format as part of the SQL that is executed within Snowflake.

  • Jobs executed on an empty directory fail. Each directory that is read must contain at least one file.

  • When defining a publishing action, the schema that is used for the publication must match the staging schema defined in the connection. See Snowflake Connections.

  • If temporary credentials are used to access S3, these credentials are valid for one hour only and cannot be refreshed. As a result, any ingestion process from S3 into Snowflake that takes longer than one hour is likely to cause the job to fail.

  • Datasets with parameters can be composed of a maximum of 1000 files.

  • All files referenced in a dataset with parameters must use a single and consistent delimiter.

Supported file formats from S3

  • CSV: Files that fail to meet the following requirements may cause job failures when executed in Snowflake, even though they can be imported into Designer Cloud. Requirements:

    • For job execution of CSV files in Snowflake, source CSV files must be well-formatted.

    • Newlines must be inserted.

    • Fields must be demarcated with quotes and commas.

      Nota

      Escaped quotes in field values must be represented as double quotes ( ""). Escaped quotes with a backslash is not supported.

    • Each row must have the same number of columns.

  • TSV

  • JSON (newline-delimited)

    Nota

    Snowflake only supports UTF-8 encoding for JSON files.

  • TXT

  • gzip and bz2 compressed formats are supported.

    Nota

    Snappy compression is not supported for S3 execution on Snowflake.

Supported file encodings:

  • UTF-8

  • ISO-8859-1

Supported delimiters:

  • Comma

  • Tab

  • Pipe

Supported quote characters:

  • No quotes

  • double quotes

Uploaded File Support

When a file is uploaded from your desktop, ingested, and stored in a storage layer that is supported for file pushdown, jobs that reference datasets created from that file are eligible for execution in Snowflake. For example, if your base storage layer is S3, then files uploaded from your desktop could be used for jobs that execute like S3 files in Snowflake. The requirements and limitations listed in the previous section apply.

Unsupported Wrangle for Snowflake Execution

The following transformations and functions are not currently supported for execution in Snowflake.

Nota

If your recipe contains any of the following transformations or functions, full job execution in Snowflake is not possible at this time. These transformations are expected to be supported and removed from this list in future releases.

General limitations

For more information on limitations on specific push-downs, see Flow Optimization Settings Dialog.

Unsupported input data types

The following Snowflake data types are not supported for input into Designer Cloud:

  • BINARY

  • VARBINARY

  • GEOGRAPHY

Nota

For mismatched values in columns of Integer data type, the value is published, instead of a NULL value. This is a known issue.

Unsupported Alteryx data types

None.

Unsupported transformations

The following Wrangle functions are not currently supported for execution in Snowflake.

  • None.

Unsupported functions

The following Wranglefunctions are not currently supported for execution in Snowflake.

Aggregate functions

KTHLARGEST

KTHLARGESTIF

KTHLARGESTUNIQUE

KTHLARGESTUNIQUEIF

APPROXIMATEMEDIAN

APPROXIMATEPERCENTILE

APPROXIMATEQUARTILE

QUARTILE

For more information, see Aggregate Functions.

Math functions

LCM

NUMVALUE

Partially supported:

NUMFORMAT: Only supported when used for rounding.

For more information, see Math Functions.

Nested functions

Partially supported:

Nota

For most array functions, such as ARRAYUNIQUE and KEYS functions, the order of elements in the output cannot be guaranteed.

Date functions

NETWORKDAYS

NETWORKDAYSINTL

WORKDAY

WORKDAYINTL

KTHLARGESTDATE

KTHLARGESTUNIQUEDATE

KTHLARGESTUNIQUEDATEIF

KTHLARGESTDATEIF

EOMONTH

SERIALNUMBER

String functions

DOUBLEMETAPHONEEQUALS

TRANSLITERATE

For more information, see String Functions.

Type functions

Partially supported:

IFMISSING

Nota

When the IFMISSING function immediately follows the PREV function in your recipe steps, Snowflake generates an incorrect value. This is a known issue and will be fixed in a future Snowflake release.

Window functions

SESSION

For more information, see Window Functions.

Verify Execution

To verify execution in Snowflake, please do the following:

Steps:

  1. In the left nav bar, click the Jobs link.

  2. In the Job History page, select the job that you executed.

  3. In the Overview tab, the value for Environment under the Execution summary should be: Snowflake.

For more information, see Job Details Page.