Skip to main content

Overview of Parameterization

In Dataprep by Trifacta, parameterization enables you to apply dynamic values to the data that you import and that you generate as part of job execution.

Parameter types:

  • Environment Parameters: A workspace administrator or project owner can specify parameters that are available across the environment, including default values for them.

  • Dataset Parameters: You can parameterize the paths to inputs for your imported datasets, creating datasets with parameters. For file-based imported datasets, you can parameterize the bucket where the source is stored.

  • Flow Parameters: You can create parameters of multiple types at the flow level, which can be referenced in any recipe in the flow.

  • Output Parameters: When you run a job, you can create parameters for the output paths for file- or table-based outputs.

These parameters can be defined by timestamp, patterns, wildcards, or variable values that you specify at runtime.

Environment Parameters

Project owners or workspace administrators can define parameters that apply across the project or workspace environment. These parameters can be referenced by any user in the environment, but only a user with admin access can define, modify, or delete these parameters.

Tip

Environment parameters are a useful means of ensuring that all users of the project or workspace share common reference values to buckets, output locations, and more. Environment parameter definitions can be exported and then imported into other projects or workspaces to ensure commonality across the enterprise. The values assigned to environment parameters can be modified after they have been imported into a new project or workspace.

Note

You must have admin access to the project or workspace to define environment parameters.

  • Names of environment parameters must begin with env..

Limitations

  • Environment parameter names are unique within the environment.

  • You cannot use environment parameters in recipes.

  • You cannot use environment parameters in plans.

  • You cannot use environment parameters in schedule triggers.

Example - parameterized bucket names

In this example, you have three Alteryx workspaces, each of which has a different set of resources, although the only difference between them is the name of the S3 bucket in which they are stored:

Environment Name

S3 Bucket Name

Dev

myco-s3-dev

Test

myco-s3-test

Prod

myco-s3-prod

In your Dev workspace, you can create an environment parameter called the following:

env.bucket-source

The default value for this parameter is set to:

myco-s3-dev

When creating imported datasets in this workspace, you insert the environment parameter for the source bucket for each one.

For your Test and Prod environments:

  1. Export your environment parameters from Dev.

  2. Import them into Test and Prod. During import, the importing user can map the imported parameters to existing parameters in the environment.

  3. In the imported environments, an administrator can manage the imported parameters and values as needed.

When you later export your flows from Dev and move them to Test and Prod, the imported flows automatically connect to the correct bucket for the target environment, since the bucket name is referenced by an environment parameter.

Example - parameterized database names

In the previous example, bucket names differed between the development and production workspaces. Similarly, there may be different values for the database to use between development and production.

Tip

If you are creating datasets using custom SQL, this method can be used for migrating flows between workspaces. If the flows reference imported datasets that use custom SQL, you only need to update the value of the environment parameter in the target workspace and reset the connection identifier in order to reconnect the imported flow to its data sources.

Environment Name

Database Name

Dev

myco-DB-dev

Test

myco-DB-test

Prod

myco-DB-prod

In your Dev workspace, you can create the environment parameter as follows:

env.db-source

Export and Import

You can export environment parameters from one environment and import them to another. For example, you may be building your flows in a Dev workspace before they are exported and imported into a Prod workspace. If your flows make use of environment parameters from the Dev space, you may want to export the parameters and their values from the Dev workspace for migration to the Prod workspace.

Note

As part of the import process, you must reconcile name conflicts between imported environment parameters and the parameters that already exist in the workspace.

For more information, see Manage Environment Parameters.

Datasets with Parameters

In some cases, you may need to be able to execute a recipe across multiple instances of identical datasets. For example, if your source dataset is refreshed each week under a parallel directory with a different timestamp, you can create a variable to replace the parts of the file path that change with each refresh. This variable can be modified as needed at job runtime.

Example

Suppose you have imported data from a file system source, which has the following source path to weekly transactions:

<file_system>:///source/transactions/2018/01/29/transactions.csv

In the above, you can infer a date pattern in the form of 2018/01/29, which suggests that there may be a pattern of paths to transaction files. Based on the pattern, it'd be useful to be able to do the following:

  • Import data from parallel paths for other weeks' data.

  • Sample across all of the available datasets.

  • Execute jobs based on runtime variables that you set for other transaction sets fitting the pattern.

  • Pass in parameterized values through API to operationalize the execution of jobs across weeks of transaction data.

In this case, you would want to parameterize the date values in the path, such that the dynamic path would look like the following:

<file_system>:///source/transactions/YYYY/MM/DD/transactions.csv

The above example implements a Datetime parameter on the path values, creating a dataset with parameters.

Parameter Types

You can use the following types of parameters to create datasets with parameters:

  • Datetime parameters: Apply parameters to date and time values appearing in source paths.

    • When specifying a Datetime parameter, you must also specify a range, which limits the range of the Datetime values.

  • Variables: Define variable names and default values for a dataset with parameters.

    • Variable parameters can be applied to elements of the source path or to the bucket name, if applicable.

    • Modify these values at runtime to parameterize execution.

  • Pattern parameters:

    • Wildcards: Apply wildcards to replace path values.

    • Regular Expressions: You can apply regular expressions to specify your dataset matches. Please see the limitations section below for more information.

    • Wrangle : The platform supports a simplified means of expressing patterns.

For more information, see Create Dataset with Parameters.

Guidelines for Sources

The source files or tables for a dataset with parameters should have consistent structures. Since the sources are parsed with the same recipe or recipes, variations in schema could cause breakages in the recipe or initial parsing steps, which are applied based on the schema of the first matching source.

Note

All datasets imported through a single parameter are expected to have exactly matching schemas. For more information on variations, see Mismatched Schemas below.

Tip

If there have been changes to the schema of the sources of your dataset with parameters, you can edit the dataset and update the parameters. See Library for Data Page.

Parameters in paths for imported datasets are rendered as regular expressions. Depending on the number of parameters and the comparative depth of them in a parameterized dataset, the process of performing all pattern checks can grow large, impacting import performance.

Tip

When specifying an imported dataset with parameters, you should attempt to be as specific as possible in your parameter definitions.

Note

When importing one or more Excel files as a parameterized dataset, you select worksheets to include from the first file. If there are worksheets in other Excel files that match the names of the worksheets that you selected, those worksheets are also imported. All worksheets are unioned together into a single imported dataset with parameters. Pattern-based parameters are not supported for import of Excel worksheets.

Mismatched Schemas

Dataprep by Trifacta expects that all datasets imported using a single parameter have schemas that match exactly. The schema for the entire dataset is taken from the first dataset that matches for import.

If schemas do not match:

  • When the first dataset contains extra columns at the end, the subsequent datasets that match should import without issues.

  • If the subsequent datasets contain extra columns at the end, the datasets may import. Depending on the situation, there may be issues.

  • If the subsequent datasets have additional or missing columns in the middle of the dataset, results of the import are unpredictable.

    • If there are extra columns in the middle of the dataset, you may see extra data in the final column, in which the spill-over data has not been split.

  • Ideally, you should fix these issues in the source of the data. But if you cannot, you can try the following:

Tips:

  • After import of a dataset with parameters, perform a full scan random sample. When the new sample is selected:

    • Check the last column of your imported to see if you have multiple columns of data. See if you can perform split the columns yourself.

    • Scan the column histograms to see if there are columns where the number of mismatches or anomalous or outlier values has suddenly increased. This could be a sign of mismatches in the schemas.

  • Edit the dataset with parameters. Review the parameter definition. Click Update to re-infer the data types of the schemas. This step may address some issues.

  • You can use the union tool to import the oldest and most recent sources in your dataset with parameters. If you see variations in the schema, you can look to modify the sources to match.

    • If your sources have variation in structure, you should remove the structure from the imported dataset and create your own initial parsing steps to account for the variations. See Initial Parsing Steps.

Limitations

  • You cannot create datasets with parameters from uploaded data.

  • You cannot create dataset with parameters from multiple file types.

    • File extensions can be parameterized. Mixing of file types (e.g. TXT and CSV) only works if they are processed in an identical manner, which is rare.

    • You cannot create parameters across text and binary file types.

  • For parameterized datasets sourced from Cloud Storage, only the first 100,000 files are read.

  • For datasources that require conversion, such as Excel, PDF, or JSON files, you can create a dataset with parameters from a maximum of 500 converted files.

  • Parameter and variable names can be up to 255 characters in length.

  • For regular expressions, the following reference types are not supported due to the length of time to evaluate:

    • Backreferences. The following example matches on axa, bxb, and cxc yet generates an error:

      ([a-c])x\1
    • Lookahead assertions: The following example matches on a, but only when it is part of an ab pattern. It generates an error:

      a(?=b)
  • For some source file types, such as Parquet, the schemas between source files must match exactly.

Creating Dataset with Parameters

From file system

When browsing for data on your default storage layer, you can choose to parameterize elements of the path. Through the Import Data page, you can select elements of the path, apply one of the supported parameter types and then create the dataset with parameters.

Note

Matching file path patterns in a large directory can be slow. Where possible, avoid using multiple patterns to match a file pattern or scanning directories with a large number of files. To increase matching speed, avoid wildcards in top-level directories and be as specific as possible with your wildcards and patterns.

Tip

For best results when parameterizing directories in your file path, include the trailing slash (/ ) as part of your parameterized value.

Options:

  • You can choose to search nested folders for files that match your specified pattern.

    Note

    You cannot create multiple wildcard parameters when the Include nested folders option is selected. When this option is selected for the dataset, only one wildcard parameter is supported.

  • You can choose to search hidden folders.

    Note

    Including hidden folders must be enabled by an administrator. For more information, see Dataprep Project Settings Page.

Tip

If your imported dataset is stored in a bucket, you can parameterize the bucket name, which can be useful if you are migrating flows between environments or must change the bucket at some point in the future.

For more information, see Create Dataset with Parameters.

From relational source

If you are creating a dataset from a relational source, you can apply parameters to the custom SQL that pulls the data from the source.

Note

Avoid using parameters in places in the SQL statement that change the structure of the data. For example, within a SELECT statement, you should not add parameters between the SELECT and FROM keywords.

For more information, see Create Dataset with SQL.

Matching parameters

When a dataset with parameters is imported for use, all matching source files or tables are automatically unioned together.

Note

Sources for a dataset with parameters should have matching schemas.

The initial sample that is loaded in the Transformer page is drawn from the first matching source file or table. If the initial sample is larger than the first file, rows may be pulled from other source objects.

Managing Datasets with Parameters

Datasets with parameters in your flows

After you have imported a dataset with parameters into your flow:

  • You can review any parameters that have been applied to the dataset through the Parameterization in Flow view.

  • When the dataset with parameters is selected, you can use the right panel to review and edit the parameters that are applied to it.

  • You can override the default value applied to the parameter through Flow View. See Manage Parameters Dialog.

For more information, see Flow View Page.

Tip

You can review details on the parameters applied to your dataset. See Dataset Details Page.

Sampling from datasets with parameters

When a dataset with parameters is first loaded into the Transformer page, the initial sample is loaded from the first found match in the range of matching datasets. If this match is a multi-sheet Excel file, the sample is taken from the first sheet in the file.

With parameters:

To work with data that appears in files other than the first match in the dataset, you must create a new sample in the Transformer page. Any sampling operations performed within the Transformer page sample across all matching sources of the dataset.

With variables:

If you have created a variable with your dataset, you can apply a variable value to override the default at sampling time. In this manner, you can specify sampling to occur from specific source files from your dataset with parameters.

For more information, see Overview of Sampling.

Scheduling for datasets with parameters

Schedules can been applied to a dataset with parameters. When resolving date range rules for scheduling a dataset with parameters, the schedule time is used.

For more information, see Add Schedule Dialog.

Sharing for datasets with parameters

By default, when a flow containing parameters is copied, any changes to parameter values in the copied flow also affect parameters in the original flow. To separate these parameters, you have the following options:

  1. Optionally, when the flow is copied, you can copy the underlying datasets.

  2. As a workaround, you can export and import the flow into the same system and replace the datasets in the imported flow.

Note

For copying flows using parameterized datasets, you should duplicate the datasets, which creates separate copies of parameters and their values in the new flow. If datasets are not copied, then parameter changes in the copied flow modify the values in the source flow.

For more information, see Overview of Sharing.

Housekeeping

Since Dataprep by Trifacta never touches the source data, after a source that is matched for a dataset with parameters has been executed, you should consider removing it from the source system or adjusting any applicable ranges on the matching parameters. Otherwise, outdated data may continue to factor into operations on the dataset with parameters.

Note

Housekeeping of source data is outside the scope of Dataprep by Trifacta. Please contact your IT staff to assist as needed.

Flow Parameters

You can specify flow parameters and their permitted values, which can be invoked in the recipe steps of your flow. Wherever the flow parameter is invoked, it is replaced by the value you set for the parameter.

Uses:

  • Dynamically affect recipe steps

  • Improve flow usability; build fewer flows and recipes to maintain

  • Parameters are evaluated at design time in the Transformer page and at runtime during job execution

  • All parameter values can be overridden, as needed.

Flow parameter types:

  • Variable type flow parameters:

    • Literal values: These values are always of String data type.

      Tip

      You can wrap flow parameter references in your transformations with one of the PARSE functions.

      Note

      Wildcards are not supported.

    • Wrangle . For more information, see Text Matching.

    • Regular expressions.

  • Selector type flow parameters:

    • These parameters are specified as a list of permitted values.

    • The first value is the default value.

    • The parameter values are treated as String values.

Limitations

  • Flow parameters are converted to constants in macros. Use of the macro in other recipes results in the constant value being applied.

  • A flow parameter cannot be used in some transformation steps or fields.

Example - variable type

Suppose you need to provide variable date ranges to your transformations. For example, you may need to be able to transform the data for the preceding 30, 60, 90, or custom days.

You can create a Variable parameter called:

$paramDaysBack

To verify that this critical parameter is properly specified before job execution, you set the default value to:

##UNSPECIFIED##

The above setting implies two things:

  • If the above value appears in the output, then an override value for the parameter was not specified when the job was executed, which prevents the default value being used erroneously.

  • Before the job is executed, you must specify an override value. You can specify an override:

    • At the flow level to assist in recipe development.

    • At run time to insert the proper DaysBack value for the job run.

  • Since all values are treated as String values, you must interpret the numeric values within your transformation steps as Integer data types.

For more examples, see Create Flow Parameter.

Example - selector type

Suppose you need to process your flow across several regions of your country. These regions are identified using a region ID value.

From the Flow View context menu, you select Manage parameters. In the Parameters tab, you specify Selector for the Type.

Specify the parameter name:

paramRegion

You can then add Choice values for the permitted regions:

WEST
MOUNTAIN
CENTRAL
EAST
OTHER

After the flow parameter has been created, you can invoke it in a transformation step using the following syntax.

$paramRegion

Where the parameter is referenced, the default or applicable override value is applied.

You can test for the parameter by using the following conditional in your transformation step and selecting from one of the listed values:

$paramRegion == <selected_value>

For more examples, see Create Flow Parameter.

Upstream flow parameters

If your flow references a recipe or dataset that is sourced from an upstream flow, the flow parameters from that flow are available in your current flow. That value of the parameter at time of execution is passed to the current flow.

Note

Downstream values and overrides of parameters that share the same name take precedence. When you execute the downstream flow, the parameter value is applied to the current flow and to all upstream objects. For more information, see "Order of Evaluation" below.

Creating flow parameters

Flow parameters are created at the flow level from the context menu in Flow View. See Manage Parameters Dialog.

Managing flow parameters

Flow parameters can be edited, deleted, and overridden through the Flow View context menu. See Manage Parameters Dialog.

Flow parameters in plans

You can also apply overrides to your flow parameters as part of your plan definition. For more information, see Plan View Page.

Output Parameters

You can specify variable and timestamp parameters to apply to the file or table paths of your outputs.

Note

Output parameters are independent of dataset parameters.

Parameter Types

You can create the following types of output parameters:

  • Datetime parameters: Insert date and time values in output paths based on the job's start time.

  • Variables: Define variable names and default values for an output parameter. Modify these values at runtime to parameterize execution.

Tip

These types of parameters can be applied to file or table paths. An output path can contain multiple parameters.

Example

Suppose you are generating a JSON file as the results of job execution.

/outputs/myFlow/myOutput.json

Since this job is scheduled and will be executed on a regular interval, you want to insert a timestamp as part of the output, so that your output filenames are unique and timestamped:

/outputs/myFlow/myOutput_<timestamp>.json

In this case, you would create an output parameter of timestamp type as part of the write settings for the job you are scheduling.

Creating output parameters

For more information, see Create Outputs.

Using output parameters

Whenever you execute a job using the specified publishing action, the output parameters are applied.

After specifying variable parameters, you can insert new values for them at the time of job execution in the Run Job page.

For more information, see Run Job Page.

Bucket Name Parameters

In addition to parameterizing the paths to imported datasets or outputs, you can also apply parameters to the buckets where these assets are stored. For example, if you are developing flows in one workspace and deploying them into a production workspace, it may be useful to create a parameter for the name of the bucket where outputs are written for the workspace.

Bucket names can be parameterized for the buckets in the following datastores:

  • Cloud Storage

Bucket names can be parameterized as variable parameters or as environment parameters. For more information on examples of parameterized bucket names, see "Environment Parameters" above.

For more information:

Parameter Overrides

For each of the following types of parameter, you can apply override values as needed.

Override Type

Description

dataset parameters

When you run a job, you can apply override values to variables for your imported datasets. See Run Job Page.

flow parameters

At the flow level, you can apply override values to flow parameters. These values are passed into the recipe and the rest of the flow for evaluation during recipe development and job execution.

Note

Overrides applied at the flow level are passed into all recipes and other objects in the flow. Wherever there is case-sensitive match between the name of the overridden parameter and a parameter name in the flow, the override value is applied. These values can be overridden by ad-hoc values. See "Order of Precedence" below.

When flows are referenced in a plan task, their parameters and default values are available for override.

output parameters

When you define your output objects in Flow View, you can apply override values to the parameterized output paths on an as-needed basis when you specify your job settings. See Run Job Page.

environment parameters

Environment parameters cannot be overridden.

Triggers

In flow and plan triggers, you can apply overrides to parameter values that are inherited from the object being triggered.

Note

In a trigger, displayed parameter values may be inherited from the plan or flow that is being triggered. To ensure that the proper value is used, you should set a specific value for the override in the trigger. This is a known issue.

Order of Parameter Evaluation

Wherever a parameter value or override is specified in the following list, the value is applied to all matching parameters within the execution tree. Suppose you have created a parameter called varRegion, which is referenced in your imported dataset, recipe, and output object. If you specify an override value for varRegion in the Run Job page, that value is applied to the data you import (dataset parameter), the recipe during execution (flow parameter), and the path of the output that you generate (output parameter). Name matches are case-sensitive.

Note

Override values are applied to upstream flows, as well. Any overrides specified in the current flow are passed to downstream flows, where they can be overridden as needed.

Parameter values are evaluated based on the following order of precedence (highest to lowest):

Note

The following does not apply to environment parameters, which cannot be overridden.

  1. Runtime or trigger-based overrides: Parameter values specified at runtime for ad-hoc or scheduled jobs.

    Note

    The override value is applied to all subsequent operations in the platform. When a job is submitted to the job queue, any overrides are applied at that time. Changes to override values do not affect jobs that are already in flight.

    1. For more information on runtime overrides, see Run Job Page.

    2. For more information for trigger-based overrides for flows, see Add Schedule Dialog.

    3. For more information on trigger-based overrides for plans, see Plan View for Triggers.

  2. Flow- or plan-level overrides: At the flow or plan level, you can specify override values, which are passed into the flow's objects. These values can be overridden by overrides set in the above locations. See Manage Parameters Dialog.

  3. Default values: If no overrides are specified, the default values are applied:

    1. Imported datasets: See Create Dataset with Parameters.

    2. Flow parameters: See Manage Parameters Dialog.

    3. Output parameters: See Run Job Page.

  4. Inherited (upstream) values: Any parameter values that are passed into a flow can be overridden by any matching override specified within the downstream flow.

Run Jobs with Parameters

When running a job based on datasets with parameters, results are written into separate folders for each parameterized path.

Note

During job execution, a canary file is written for each set of results to validate the path. For datasets with parameters, if the path includes folder-level parameterization, a separate folder is created for each parameterized path. During cleanup, only the the canary files and the original folder path are removed. The parameterized folders are not removed. This is a known issue.

Note

Due to a limitation in Dataflow, when you run a job on a parameterized dataset containing more than 100 files, the input paths data must be compressed, which results in non-readable location values in the Dataflow console. Running jobs on datasets sourced from more than 6000 files may fail.

Runtime Parameter Overrides

When you choose to run a job on a dataset with parameters from the user interface, any variables are specified using their default values.

Through the Run Job page, you can specify different values to apply to variables for the job.

Note

Override values applied to a job are not validated. Invalid overrides may cause your job to fail.

Note

Values applied through the Run Job page to variables override the default values for the current execution of the job. Default values for the next job are not modified.

Note

When you edit an imported dataset, if a variable is renamed, a new variable is created using the new name. Any override values assigned under the old variable name for the dataset must be re-applied. Instances of the variable and override values used in other imported datasets remain unchanged.

For more information, see Run Job Page.

Scheduling jobs on datasets with parameters

You can schedule jobs for datasets with parameters.

Note

When a job is executed, the expected time of execution is used during execution. For scheduled jobs, this value is the scheduled time. For example, if a job scheduled for 08:00 begins execution at 08:05, any parameters that reference "now" time use 08:00 during the job run.

For a scheduled job:

  • Parameter values are evaluated based on the scheduled time of execution. Relative times are evaluated based on the scheduled time of execution.

  • If there are interruptions in service due to maintenance windows or other reasons, scheduled jobs are queued for execution on restart. These queued jobs are attempted only once.

See Schedule a Job.

Parameters in Job Details

In the Job Details page:

  • Data sources tab: For file-based parameterized datasets, you can review the files that were matched at runtime for the specified parameters.

  • Parameters tab: View the parameter names and values that were used as part of the job, including the list of matching datasets.

See Job Details Page.

Operationalization with Parameters

APIs

Through the API, you can apply runtime parameters to datasets with parameters during job execution. For more information, see Dataprep by Trifacta: API Reference docsFor more information on working with parameters and the APIs, see API Task - Run Job on Dataset with Parameters.

Use of parameters to create imported datasets through the API is not supported.

For other parameter types, you can apply overrides as key-value pairs in the API request to execute a new job. See API Task - Run Job.