By default, Microsoft Azure deployments integrate with Azure Data Lake Store (ADLS). Optionally, you can configure your deployment to integrate with WASB.
Microsof Azure Data Lake Store (ADLS Gen1) is a scalable repository for big data analytics.
ADLS Gen1 is accessible from Azure Databricks.
For more information, see https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-overview.
For more information on the newer version of ADLS, see ADLS Gen2 Access.
Supported Environments:
Operation | Designer Cloud Powered by Trifacta Enterprise Edition | Amazon | Microsoft Azure |
---|---|---|---|
Read | Not supported | Not supported | Supported |
Write | Not supported | Not supported | Supported (only if ADLS Gen1 is base storage layer) |
A single public connection to ADLS Gen1 is supported.
In this release, the Designer Cloud Powered by Trifacta platform supports integration with the default store only. Extra stores are not supported.
If the base storage layer has been set to WASB, you can follow these instructions to set up read-only access to ADLS Gen1.
Note
To enable read-only access to ADLS Gen1, do not set the base storage layer to adl
. The base storage layer for ADLS read-write access must remain wasbs
.
The Designer Cloud Powered by Trifacta platform has already been installed and integrated with an Azure Databricks cluster. See Configure for Azure Databricks.
ADL must be set as the base storage layer for the Designer Cloud Powered by Trifacta platform instance. See Set Base Storage Layer.
Before you integrate with Azure ADLS Gen1, you must create the Designer Cloud Powered by Trifacta platform as a registered application. See Configure for Azure.
The following properties should already be specified in the Admin Settings page. Please verify that the following have been set:
azure.applicationId
azure.secret
azure.directoryId
The above properties are needed for this configuration. For more information, see Configure for Azure.
An Azure Key Vault has already been set up and configured for use by the Designer Cloud Powered by Trifacta platform. For more information, see Configure for Azure.
Authentication to ADLS storage is supported for the following modes, which are described in the following section.
Mode | Description |
---|---|
System | All users authenticate to ADLS using a single system key/secret combination. This combination is specified in the following parameters, which you should have already defined:
These properties define the registered application in Azure Active Directory. System authentication mode uses the registered application identifier as the service principal for authentication to ADLS Gen1. All users have the same permissions in ADLS Gen1. For more information on these settings, see Configure for Azure. |
User | Per-user mode allows individual users to authenticate to ADLS Gen1 through their Azure Active Directory login. Note Additional configuration for AD SSO is required. Details are below. |
Steps:
Please complete the following steps to specify the ADLS Gen1 access mode.
You can apply this change through the Admin Settings Page (recommended) or
trifacta-conf.json
. For more information, see Platform Configuration Methods.Set the following parameter to the preferred mode (
system
oruser
):"azure.adl.mode": "<your_preferred_mode>",
Save your changes.
When access to ADLS is requested, the platform uses the combination of Azure directory ID, Azure application ID, and Azure secret to complete access.
After defining the properties in the Designer Cloud Powered by Trifacta platform, system mode access requires no additional configuration.
In user mode, a user ID hash is generated from the Key Vault key/secret and the user's AD login. This hash is used to generate the access token, which is stored in the Key Vault.
Note
User mode access to ADLS requires Single Sign On (SSO) to be enabled for integration with Azure Active Directory. For more information, see Configure SSO for Azure AD.
You must configure the platform to use the ADL storage protocol when accessing.
Note
Per earlier configuration, base storage layer must be set to adl
for read/write access to ADLS. See Set Base Storage Layer.
Steps:
You can apply this change through the Admin Settings Page (recommended) or
trifacta-conf.json
. For more information, see Platform Configuration Methods.Locate the following parameter and change its value to
adl
:"webapp.storageProtocol": "adl",
Set the following parameter to
false
:"hdfs.enabled": false,
Save your changes and restart the platform.
You must define the base storage location and supported protocol for storing data on ADLS.
Note
You can specify only one storage location for ADLS.
Steps:
You can apply this change through the Admin Settings Page (recommended) or
trifacta-conf.json
. For more information, see Platform Configuration Methods.Locate the following configuration block. Specify the listed changes:
"fileStorage": { "defaultBaseUris": [ "<baseURIOfYourLocation>" ], "whitelist": ["adl"] }
Parameter
Description
defaultBaseUris
A comma-separated list of protocols that are permitted to read and write with ADLS storage.
Note
The
adl://
protocol identifier must be included.Example value:
adl://<YOUR_STORE_NAME>.azuredatalakestore.net
whitelist
For each supported protocol, this array must contain a top-level path to the location where Designer Cloud Powered by Trifacta platform files can be stored. These files include uploads, samples, and temporary storage used during job execution.
Note
This array of values must include
adl
.Save your changes and restart the platform.
Restart services. See Start and Stop the Platform.
After the configuration has been specified, an ADLS connection appears in the Import Data page. Select it to begin navigating for data sources.
Specify ADLS Gen1 Path:
In the ADLS Gen1 browser, you can specify an explicit path to resources. Click the Pencil icon, paste the path value, and click Go.
/trifacta/input/username@example.com
You should paste the following in the Path textbox:
hdfs://trifacta/input/username@example.com
Note
When inserting values directly into the Path textbox, you must use the hdfs://
protocol identifier. Do not use the adl://
protocol identifier.
Tip
You can retrieve your home directory from your profile. See Storage Config Page.
Try running a simple job from the Trifacta Application. For more information, see Verify Operations.
For additional troubleshooting information, see ADLS Gen2 Access.
The Designer Cloud Powered by Trifacta platform can use ADLS for the following reading and writing tasks:
Creating Datasets from ADLS Files: You can read in from a data source stored in ADLS. A source may be a single ADLS file or a folder of identically structured files. See Reading from Sources in ADSL below.
Reading Datasets: When creating a dataset, you can pull your data from another dataset defined in ADLS. See Creating Datasets below.
Writing Job Results: After a job has been executed, you can write the results back to ADLS. See Writing Job Results below.
In the Trifacta Application, ADLS is accessed through the ADLS browser.
Note
When the Designer Cloud Powered by Trifacta platform executes a job on a dataset, the source data is untouched. Results are written to a new location, so that no data is disturbed by the process.
Read/Write Access: Your cluster administrator must configure read/write permissions to locations in ADLS. Please see the ADLS documentation.
Warning
Avoid using
/trifacta/uploads
for reading and writing data. This directory is used by the Trifacta Application.Your cluster administrator should provide a place or mechanism for raw data to be uploaded to your datastore.
Your cluster administrator should provide a writeable home output directory for you, which you can review. See Storage Config Page.
Depending on the security features you've enabled, the technical methods by which Alteryx users access ADLS may vary. For more information, see ADLS Gen1 Access.
Your cluster administrator should provide raw data or locations and access for storing raw data within ADLS. All Alteryx users should have a clear understanding of the folder structure within ADLS where each individual can read from and write their job results.
Users should know where shared data is located and where personal data can be saved without interfering with or confusing other users.
Note
The Designer Cloud Powered by Trifacta platform does not modify source data in ADLS. Sources stored in ADLS are read without modification from their source locations, and sources that are uploaded to the platform are stored in /trifacta/uploads
.
You can create a dataset from one or more files stored in ADLS.
Wildcards:
You can parameterize your input paths to import source files as part of the same imported dataset. For more information, see Overview of Parameterization.
Folder selection:
When you select a folder in ADLS to create your dataset, you select all files in the folder to be included. Notes:
This option selects all files in all sub-folders. If your sub-folders contain separate datasets, you should be more specific in your folder selection.
All files used in a single dataset must be of the same format and have the same structure. For example, you cannot mix and match CSV and JSON files if you are reading from a single directory.
When a folder is selected from ADLS, the following file types are ignored:
*_SUCCESS
and*_FAILED
files, which may be present if the folder has been populated by the running environment.If you have stored files in ADLS that begin with an underscore (
_
), these files cannot be read during batch transformation and are ignored. Please rename these files through ADLS so that they do not begin with an underscore.
When creating a dataset, you can choose to read data in from a source stored from ADLS or from a local file.
ADLS sources are not moved or changed.
Local file sources are uploaded to
/trifacta/uploads
where they remain and are not changed.
Data may be individual files or all of the files in a folder. For more information, see Reading from Sources in ADLS above.
In the Import Data page, click the ADLS tab. See Import Data Page.
When your job results are generated, they can be stored back in ADLS for you at the location defined for your user account.
The ADLS location is available through the Publishing dialog in the Output Destinations tab of the Job Details page. See Publishing Dialog.
Each set of job results must be stored in a separate folder within your ADLS output home directory.
For more information on your output home directory, see Storage Config Page.
Warning
If your deployment is using ADLS, do not use the trifacta/uploads
directory. This directory is used for storing uploads and metadata, which may be used by multiple users. Manipulating files outside of the Trifacta Application can destroy other users' data. Please use the tools provided through the interface for managing uploads from ADLS.
Users can specify a default output home directory and, during job execution, an output directory for the current job.
Access to results:
Depending on how the platform is integrated with ADLS, other users may or may not be able to access your job results.
If user mode is enabled, results are written to ADLS through the ADLS account configured for your use. Depending on the permissions of your ADLS account, you may be the only person who can access these results.
If user mode is not enabled, then each Alteryx user writes results to ADLS using a shared account. Depending on the permissions of that account, your results may be visible to all platform users.
As part of writing job results, you can choose to create a new dataset, so that you can chain together data wrangling tasks.
Note
When you create a new dataset as part of your job results, the file or files are written to the designated output location for your user account. Depending on how your cluster permissions are configured, this location may not be accessible to other users.
Supported Versions: n/a
Supported Environments:
Operation | Designer Cloud Powered by Trifacta Enterprise Edition | Amazon | Microsoft Azure |
---|---|---|---|
Read | Not supported | Not supported | Supported |
Write | Not supported | Not supported | Supported (only if ADLS Gen1 is base storage layer) |
Create New Connection: n/a
Note
A single public connection to ADLS Gen1 is supported.