Skip to main content

Join Data

You can join together data based on the presence of one or more keys in your source dataset and the joined-in dataset or recipe.

A join is a standard data operation that brings together multiple datasets based on one or more fields that they have in common to render a new dataset.

  • Two datasets are compared for joining based on one or more shared columns, called join keys.

  • There are multiple types of join. The simplest one to understand is the inner join. In an inner join, a value that is in the join key column of Dataset A must appear in the join key column of Dataset B.

    • If the join key value appears in both source datasets, a row in which values from the rows in both source datasets is included in the new dataset.

    • If the join key value does not appear in one of the source datasets, in an inner join, the row data from both datasets is excluded from the new dataset.

Overview

Using a Join transformation, you can join a recipe or dataset to any of the following objects:

  • Another recipe

  • An imported dataset

  • A reference dataset

Create Join

You can join datasets through the following mechanisms:

  • Flow View:

    1. To start a join, begin from Dataset A:

    2. In Flow View, right-click a dataset and select Add > Join.

    3. If your first dataset already has a recipe, you can right-click the recipe, and select Append Join.

    4. In the Transformer page, you can enter join in the Search panel.

    5. Choose dataset or recipe to join: select the recipe or dataset to which you want to join to Dataset A

      Tip

      The join you specify from Flow View is added as the last step to the recipe. If you selected a dataset to which to add the join, a recipe is created from the object, and the join is added as the first step of the new recipe.

  • Transform Builder: Search for and select Join.

Joins are created through the Join window. This task is described below.

Step - Choose dataset or recipe to join

Steps:

  1. In the Choose dataset or recipe panel:

    1. Search for a dataset or recipe to which you have access. Your search includes objects outside of the current flow.

    2. You can also select from:

      1. Recipes in your current flow

      2. Datasets in your current flow

      3. All datasets to which you have access.

  2. When you have found the dataset to use in your join, click Accept.

Step - Choose join keys and conditions

Steps:

  1. Next, you select the join key columns and other conditions from each dataset.

  2. Join type: Select the type of join to apply. See "Join types" below.

  3. Join keys:The application attempts to find the best columns to match as the join keys.

    1. Mouse over the percentage match to get more detailed statistics.

      Note

      For formatted data types, such as Datetime, the formatting of the join keys must also match. For example, the values 2021-01-01 and January 01, 2021 may not be interpreted as matching values.

      Note

    2. To change a join key, mouse over the key name and then click the Pencil icon. Select your new key.

    3. For more information on the options, see "Modify Keys and Conditions" below.

    4. Click Save & Continue.

  4. Click Next.

Example datasets

For discussion purposes, the following datasets are referenced in the sections below.

  • The CustId column is shared between both datasets. This column is the join key, as there are no matches between the other colums.

  • Some values in CustId in one dataset do not appear in the other.

Dataset A:

The first dataset to which you are joining in another is typically called the left dataset.

CustId

LastName

FirstName

c001

Jones

Jack

c002

Kim

Ken

c003

Lee

Larry

c004

Miller

Mike

c005

Dataset B:

The second dataset that you are joining in to the first is typically called the right dataset.

CustId

Region

CompanyName

c002

East

ACME, Inc.

c003

West

Trifax, Inc.

c005

North

Example Co.

c006

South

Ace Industries

Join types

There are multiple types of joins, which generate very different results. When you perform a join, you specify the type of join that is applied. The joined-together rows that appear in the output dataset are determined by the type of join that you selected and matching of values in the join key columns.

The following are the basic join types. The Example column references Dataset A (left) and Dataset B (right) from above.

Join Type

Description

Example

inner join

If a join key value appears in the left dataset and the right dataset, the joined rows are included in the output dataset.

In the above output, rows c002 and c003 are included only.

left join

In a left join, all of the rows that appear in the left dataset appear in the output, even if there is no matching join key value in the right dataset.

In the above output, rows c001, c002, c003, c004, and c005 are included.

Rows c006 is excluded.

right join

In a right join, all of the rows that appear in the right dataset appear in the output, even if there is no matching join key value in the right dataset.

In the above output, rows c002, c003, c005, and c006 are included.

Rows c001 and c004 are excluded.

outer join

An outer join combines the effects of a left and a right join. Each key value from both datasets is included in the output. If the key value is not present in one of the datasets, then null values are written into the columns from that dataset.

In the above output, rows c001, c002, c003, c004, c005, and c006 are included.

Rows c001, c004, c005, and c006 contain some null values.

cross join

A cross join matches every row in the source dataset with a row in the joined-in dataset, regardless of whether the join keys match.

Note

A cross join can greatly expand the number of rows in your dataset, which may impact performance.

If Dataset A has 5 rows and Dataset B has 4 rows, the output has 20 rows.

self join

A self join matches the rows in the left dataset with a version of itself (dataset or recipe) on the right side. Some limitations apply.

Step - Specify output columns for the join

Steps:

  1. In the Output columns step, you can specify the columns to include in the output dataset.

    1. Include All: To include all columns from the left and right datasets, click the checkbox below All.

    2. Use the Search box to search for specific columns to include or exclude.

  2. Advanced options: See below.

  3. Click Review.

Apply prefix for column names

In the output dataset, the column names are taken directly from the column names in the source dataset. Potential issues:

  • In some cases, source column names may be an exact match between datasets.

  • For development purposes, you may wish to track the source of a column for a period of time.

You can apply a prefix to the column names that are sourced from the left dataset, the right dataset, or both.

  • Name prefix for columns in Current data: Enter a text value to include as the prefix to any output columns that are sourced from the current (left) dataset. For example, you could enter left_.

  • Name prefix for columns in Joined-In data: Enter a text value to include as the prefix to any output columns that are sourced from the joined-in (right) dataset. For example, you could enter right_.

Apply dynamic updates of selected columns

In the recipe step that produces the join, the columns that you select are mentioned specifically by name. Optionally, you can choose to automatically add in all columns to your output. For example, if your source data for an imported dataset is augmented with 10 new columns, when you re-run your join, those new columns can be automatically added to the output dataset.

Tip

You should consider using these options if the schema of your data sources is likely to change in the future.

  • Include all columns from Current data: When selected, all columns that are subsequently added to the Current (left) dataset are automatically included as part of the join.

  • Include all columns from Joined-In data: When selected, all columns that are subsequently added to the Joined-In (right) dataset are automatically included as part of the join.

Step - Review join

Steps:

  1. In the Review step, you can verify that the specified join is as you expected.

  2. You should review the columns that are previewed as in the data grid.

  3. To add the join as a recipe step, click Add to Recipe.

Modify Keys and Conditions

Note

If you modify the selected dataset to join, the joined dataset, the join keys, or the fields to include in the output, subsequent steps in your transform recipe can be broken by the change. After you modify the join, you should select the last step in your recipe to validate all steps in the recipe.

You can apply the following modifications to how keys are matched. To modify a join key and condition, click the Pencil icon in the Join Keys & Conditions panel.

Ignore special characters

Optionally, you can configure the Cloud Portal to ignore the following special characters, when matching values in join keys:

  • Ignore case: Ignore differences in case between values in the join key columns. MyValue matches with MYVALUE.

  • Ignore special characters: Ignore special characters that appear in the join key values.

  • Ignore whitespace: Ignore spaces, tabs, and other whitespace values that may appear in join key values.

Create fuzzy join

A fuzzy join applies a fuzzy matching algorithm to String values in the join key column to account for slight differences in how values are written.

Note

Fuzzy joins can only be applied to String data types. Other data types cannot be fuzzy-matched using the algorithm.

This algorithm relies on the double metaphone function, which attempts to normalize text values based on how the string is spoken by an English speaker. For more information, see https://en.wikipedia.org/wiki/Metaphone#Double_Metaphone.

  • Fuzzy match: Enable fuzzy matching based on English language pronunciation using the doublemetaphone function.

Create range join

Values in the join key columns are matched across a range of values, instead of matching single value to single value. When range joins are enabled, you can set the Condition value between the two join key columns when specifying the join keys. For more information, seeConfigure Range Join.

Add multiple join keys

For more complex join operations, you can add additional join keys to evaluate. Multi-key joins can be helpful for:

  • Providing more finely specified join keys. For example, lastName and firstName.

  • Performance

To add a second join key, click Add when modifying the join keys and conditions. Specify the keys in each dataset as needed.