Configure for Hadoop
The Designer Cloud Powered by Trifacta platform supports integration with a number of Hadoop distributions, using a range of components within each distribution. This page provides information on the set of configuration tasks that you need to complete to integrate the platform with your Hadoop environment.
Before You Begin
Key deployment considerations
Hadoop cluster: The Hadoop cluster should already be installed and operational. As part of the install preparation, you should have prepared the Hadoop platform for integration with the Designer Cloud Powered by Trifacta platform. See Prepare Hadoop for Integration with the Platform.
For more information on the components supported in your Hadoop distribution, See Install Reference.
Storage: on-premises, cloud, or hybrid.
The Designer Cloud Powered by Trifacta platform can interact with storage that is in the local environment, in the cloud, or in some combination. How your storage is deployed affects your configuration scenarios. See Storage Deployment Options.
Base storage layer: You must configure one storage platform to be the base storage layer. Details are described later.
Note
Some deployments require that you select a specific base storage layer.
Warning
After you have defined the base storage layer, it cannot be changed. Please review your Storage Deployment Options carefully. The required configuration is described later.
Hadoop versions
The Designer Cloud Powered by Trifacta platform supports integration only with the versions of Hadoop that are supported for your version of the platform.
Note
The versions of your Hadoop software and the libraries in use by the Designer Cloud Powered by Trifacta platform must match. Unless specifically directed by Alteryx Support, integration with your Hadoop cluster using a set of Hadoop libraries from a different version of Hadoop is not supported.
For more information, see Product Support Matrix.
Platform configuration
After the Designer Cloud Powered by Trifacta platform and its databases have been installed, you can perform platform configuration. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json
. For more information, see Platform Configuration Methods.
Note
Some platform configuration is required, regardless of your deployment. See Required Platform Configuration.
Required Configuration for Hadoop
Please complete the following sections to configure the platform to work with Hadoop.
Specify Alteryx user
Note
Where possible, you should define or select a user with a userID value greater than 1000. In some environments, lower userID values can result in failures when running jobs on Hadoop.
Set the Hadoop username [hadoop.user
(default=trifacta
)]
for the Designer Cloud Powered by Trifacta platform to use for executing jobs:
"hdfs.username": [hadoop.user],
If the Alteryx software is installed in a Kerberos environment, additional steps are required, which are described later.
Data storage
The Designer Cloud Powered by Trifacta platform supports access to the following Hadoop storage layers:
HDFS
S3
Set the base storage layer
At this time, you should define the base storage layer from the platform.
The platform requires that one backend datastore be configured as the base storage layer. This base storage layer is used for storing uploaded data and writing results and profiles. Please complete the following steps to set the base storage layer for the Designer Cloud Powered by Trifacta platform.
Warning
You cannot change the base storage layer after it has been set. You must uninstall and reinstall the platform to change it.
Steps:
You can apply this change through the Admin Settings Page (recommended) or
trifacta-conf.json
. For more information, see Platform Configuration Methods.Locate the following parameter and set it to the value for your base storage layer:
"webapp.storageProtocol": "hdfs",
Save your changes and restart the platform.
Note
To complete the integration with the base storage layer, additional configuration is required.
Required configuration for each type of storage is described below.
S3
The Designer Cloud Powered by Trifacta platform can integrate with an S3 bucket:
If you are using HDFS as the base storage layer, you can integrate with S3 for read-only access.
Base storage layer requires read-write access.
Note
If you are integrating with S3, additional configuration is required. Instead of completing the HDFS configuration below, please enable read-write access to S3. See S3 Access in the Configuration Guide.
HDFS
If output files are to be written to an HDFS environment, you must configure the Designer Cloud Powered by Trifacta platform to interact with HDFS.
Hadoop Distributed File Service (HDFS) is a distributed file system that provides read-write access to large datasets in a Hadoop cluster. For more information, see http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html.
Warning
If your deployment is using HDFS, do not use the trifacta/uploads
directory. This directory is used for storing uploads and metadata, which may be used by multiple users. Manipulating files outside of the Trifacta Application can destroy other users' data. Please use the tools provided through the interface for managing uploads from HDFS.
Note
Use of HDFS in safe mode is not supported.
Below, replace the value for [hadoop.user
(default=trifacta
)]
with the value appropriate for your environment. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json
. For more information, see Platform Configuration Methods.
"hdfs": { "username": "[hadoop.user]", ... "namenode": { "host": "hdfs.example.com", "port": 8080 }, },
Parameter | Description |
---|---|
username | Username in the Hadoop cluster to be used by the Designer Cloud Powered by Trifacta platform for executing jobs. |
namenode.host | Host name of namenode in the Hadoop cluster. You may reference multiple namenodes. |
namenode.port | Port to use to access the namenode. You may reference multiple namenodes. Note Default values for the port number depend on your Hadoop distribution. See System Ports in the Planning Guide. |
Individual users can configure the HDFS directory where exported results are stored.
Note
Multiple users cannot share the same home directory.
See Storage Config Page in the User Guide.
Access to HDFS is supported over one of the following protocols:
See WebHDFS below.
See HttpFS below.
WebHDFS
If you are using HDFS, it is assumed that WebHDFS has been enabled on the cluster. Apache WebHDFS enables access to an HDFS instance over HTTP REST APIs. For more information, see https://hadoop.apache.org/docs/r1.0.4/webhdfs.html.
The following properties can be modified:
"webhdfs": { ... "version": "/webhdfs/v1", "host": "", "port": 50070, "httpfs": false },
Parameter | Description |
---|---|
version | Path to locally installed version of WebHDFS. Note For |
host | Hostname for the WebHDFS service. Note If this value is not specified, then the expected host must be defined in |
port | Port number for WebHDFS. The default value is Note The default port number for SSL to WebHDFS is |
httpfs | To use HttpFS instead of WebHDFS, set this value to |
Steps:
Set
webhdfs.host
to be the hostname of the node that hosts WebHDFS.Set
webhdfs.port
to be the port number over which WebHDFS communicates. The default value is50070
. For SSL, the default value is50470
.Set
webhdfs.httpfs
to false.For
hdfs.namenodes
, you must set thehost
andport
values to point to the active namenode for WebHDFS.
HttpFS
You can configure the Designer Cloud Powered by Trifacta platform to use the HttpFS service to communicate with HDFS, in addition to WebHDFS.
Note
HttpFS serves as a proxy to WebHDFS. When HttpFS is enabled, both services are required.
In some cases, HttpFS is required:
High availability requires HttpFS.
Your secured HDFS user account has access restrictions.
If your environment meets any of the above requirements, you must enable HttpFS. For more information, see Enable HttpFS in the Configuration Guide.
Configure ResourceManager settings
Configure the following:
"yarn.resourcemanager.host": "hadoop", "yarn.resourcemanager.port": 8032,
Note
Do not modify the other host/port settings unless you have specific information requiring the modifications.
For more information, see System Ports in the Planning Guide.
Specify distribution client bundle
The Designer Cloud Powered by Trifacta platform ships with client bundles supporting a number of major Hadoop distributions. You must configure the jarfile for the distribution to use. These distributions are stored in the following directory:
/opt/trifacta/hadoop-deps
Configure the bundle distribution property (hadoopBundleJar
) in platform configuration. Examples:
Hadoop Distribution |
|
---|---|
Cloudera |
|
Cloudera Data Platform |
|
where:
x.y
is the major-minor build number (e.g. 5.4)
Note
The path must be specified relative to the install directory.
Tip
If there is no bundle for the distribution you need, you might try the one that is the closest match in terms of Apache Hadoop baseline. For example, CDH5 is based on Apache 2.3.0, so that client bundle will probably run ok against a vanilla Apache Hadoop 2.3.0 installation. For more information, see Alteryx Support.
Cloudera distribution
Some additional configuration is required. See Configure for Cloudera in the Configuration Guide.
Default Hadoop job results format
For smaller datasets, the platform recommends using the Trifacta Photon running environment.
For larger datasets, if the size information is unavailable, the platform recommends by default that you run the job on the Hadoop cluster. For these jobs, the default publishing action for the job is specified to run on the Hadoop cluster, generating the output format defined by this parameter. Publishing actions, including output format, can always be changed as part of the job specification.
As needed, you can change this default format. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json
. For more information, see Platform Configuration Methods.
"webapp.defaultHadoopFileFormat": "csv",
Accepted values: csv
, json
, avro
, pqt
For more information, see Run Job Page in the User Guide.
Additional Configuration for Hadoop
Authentication
Kerberos
The Designer Cloud Powered by Trifacta platform supports integration with Kerberos security. The platform can utilize Kerberos' secure impersonation to broker interactions with the Hadoop environment.
Single Sign-On
The Designer Cloud Powered by Trifacta platform can integrate with your SSO platform to manage authentication to the Trifacta Application. See Configure SSO for AD-LDAP.
Hadoop KMS
If you are using Hadoop KMS to encrypt data transfers to and from the Hadoop cluster, additional configuration is required. See Configure for KMS.
Hive access
Apache Hive is a data warehouse service for querying and managing large datasets in a Hadoop environment using a SQL-like querying language. For more information, see https://hive.apache.org/.
See Configure for Hive.
High availability environment
You can integrate the platform with the Hadoop cluster's high availability configuration, so that the Designer Cloud Powered by Trifacta platform can match the failover configuration for the cluster.
Note
If you are deploying high availability failover, you must use HttpFS, instead of WebHDFS, for communicating with HDFS, which is described in a previous section.
For more information, see Enable Integration with Cluster High Availability.
After you have performed the base installation of the Designer Cloud Powered by Trifacta platform, please complete the following steps if you are integrating with a Hadoop cluster.
Apply cluster configuration files via symlink
If the Designer Cloud Powered by Trifacta platform is being installed on an edge node of the cluster, you can create a symlink from a local directory to the source cluster files so that they are automatically updated as needed.
Navigate to the following directory on the Trifacta node:
cd /opt/trifacta/conf/hadoop-site
Create a symlink for each of the Hadoop Client Configuration files referenced in the previous steps. Example:
ln -s /etc/hadoop/conf/core-site.xml core-site.xml
Repeat the above steps for each of the Hadoop Client Configuration files.
Modify Alteryx configuration changes
To apply this configuration change, login as an administrator to the Trifacta node. Then, edit
trifacta-conf.json
. For more information, see Platform Configuration Methods.HDFS: Change the host and port information for HDFS as needed. Please apply the port numbers for your distribution:
"hdfs.namenode.host": "<namenode>", "hdfs.namenode.port": <hdfs_port_num> "hdfs.yarn.resourcemanager": { "hdfs.yarn.webappPort": 8088, "hdfs.yarn.adminPort": 8033, "hdfs.yarn.host": "<resourcemanager_host>", "hdfs.yarn.port": <resourcemanager_port>, "hdfs.yarn.schedulerPort": 8030
Save your changes and restart the platform.
Configure Snappy publication
If you are publishing using Snappy compression, you may need to perform the following additional configuration.
Steps:
Verify that the
snappy
andsnappy-devel
packages have been installed on the Trifacta node. For more information, see https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/NativeLibraries.html.From the Trifacta node, execute the following command:
hadoop checknative
The above command identifies where the native libraries are located on the Trifacta node.
Cloudera:
On the cluster, locate the
libsnappy.so
file. Verify that this file has been installed on all nodes of the cluster, including the Trifacta node. Retain the path to the file on the Trifacta node.You can apply this change through the Admin Settings Page (recommended) or
trifacta-conf.json
. For more information, see Platform Configuration Methods.Locate the
spark.props
configuration block. Insert the following properties and values inside the block:"spark.driver.extraLibraryPath": "/path/to/file", "spark.executor.extraLibraryPath": "/path/to/file",
Save your changes and restart the platform.
Verify that the
/tmp
directory has the proper permissions for publication. For more information, see Supported File Formats.
Debugging
You can review system services and download log files through the Trifacta Application.