S3 Access
Below are instructions on how to configure Designer Cloud Powered by Trifacta Enterprise Edition to point to S3.
Simple Storage Service (S3) is an online data storage service provided by Amazon, which provides low-latency access through web services. For more information, see https://aws.amazon.com/s3/.
Base Storage Layer
If base storage layer is S3: you can enable read/write access to S3.
If base storage layer is not S3: you can enable read-only access to S3.
Limitations
The Designer Cloud Powered by Trifacta platform only supports running S3-enabled instances over AWS.
Access to AWS S3 Regional Endpoints through internet protocol is required. If the machine hosting the Designer Cloud Powered by Trifacta platform is in a VPC with no internet access, a VPC endpoint enabled for S3 services is required. The Designer Cloud Powered by Trifacta platform does not support access to S3 through a proxy server.
Write access requires using S3 as the base storage layer. See Set Base Storage Layer.
Note
Spark 2.3.0 jobs may fail on S3-based datasets due to a known incompatibility. For details, see https://github.com/apache/incubator-druid/issues/4456.
If you encounter this issue, please set spark.version
to 2.1.0
in platform configuration. For more information, see Admin Settings Page.
Prerequisites
On the Trifacta node, you must install the Oracle Java Runtime Environment for Java 8. Other versions of the JRE are not supported. For more information on the JRE, see http://www.oracle.com/technetwork/java/javase/downloads/index.html.
If IAM instance role is used for S3 access, it must have access to resources at the bucket level.
Required AWS Account Permissions
For more information, see Required AWS Account Permissions.
Configuration
Depending on your S3 environment, you can define:
read access to S3
access to additional S3 buckets
S3 as base storage layer
Write access to S3
S3 bucket that is the default write destination
Define base storage layer
The base storage layer is the default platform for storing results.
Required for:
Write access to S3
Connectivity to Redshift
Warning
The base storage layer for your Alteryx instance is defined during initial installation and cannot be changed afterward.
If S3 is the base storage layer, you must also define the default storage bucket to use during initial installation, which cannot be changed at a later time. See Define default S3 Write bucket below.
For more information on the various options for storage, see Storage Deployment Options.
For more information on setting the base storage layer, see Set Base Storage Layer.
Enable read access to S3
When read access is enabled, Alteryx users can explore S3 buckets for creating datasets.
Note
When read access is enabled, Alteryx users have automatic access to all buckets to which the specified S3 user has access. You may want to create a specific user account for S3 access.
Note
Data that is mirrored from one S3 bucket to another might inherit the permissions from the bucket where it is owned.
Steps:
You apply this change through the Workspace Settings Page. For more information, see Platform Configuration Methods.
Set the following property to
enabled
:Enable S3 Connectivity
Save your changes.
In the S3 configuration section, set
enabled=true
, which allows Alteryx users to browse S3 buckets through the Trifacta Application.Specify the AWS
key
andsecret
values for the user to access S3 storage.
Configure file storage protocols and locations
The Designer Cloud Powered by Trifacta platform must be provided the list of protocols and locations for accessing S3.
Steps:
You can apply this change through the Admin Settings Page (recommended) or
trifacta-conf.json
. For more information, see Platform Configuration Methods.Locate the following parameters and set their values according to the table below:
"fileStorage.whitelist": ["s3"], "fileStorage.defaultBaseUris": ["s3:///"],
Parameter
Description
filestorage.whitelist
A comma-separated list of protocols that are permitted to access S3.
Note
The protocol identifier
"s3"
must be included in this list.filestorage.defaultBaseUris
For each supported protocol, this parameter must contain a top-level path to the location where Designer Cloud Powered by Trifacta platform files can be stored. These files include uploads, samples, and temporary storage used during job execution.
Note
A separate base URI is required for each supported protocol. You may only have one base URI for each protocol.
Note
For S3, three slashes at the end are required, as the third one is the end of the path value. This value is used as the base URI for all S3 connections created in Designer Cloud Powered by Trifacta Enterprise Edition.
Example:
s3:///
The above example is the most common example, as it is used as the base URI for all S3 connections that you create. Do not add a bucket value to the above URI.
Save your changes and restart the platform.
Java VFS service
Use of SFTP connections requires the Java VFS service in the Designer Cloud Powered by Trifacta platform.
Note
This service is enabled by default.
For more information on configuring this service, see Configure Java VFS Service.
S3 access modes
The Designer Cloud Powered by Trifacta platform supports the following modes for access S3. You must choose one access mode and then complete the related configuration steps.
Note
Avoid switching between user mode and system mode, which can disable user access to data. At install mode, you should choose your preferred mode.
System mode
(default) Access to S3 buckets is enabled and defined for all users of the platform. All users use the same AWS access key, secret, and default bucket.
System mode - read-only access
For read-only access, the key, secret, and default bucket must be specified in configuration.
Note
Please verify that the AWS account has all required permissions to access the S3 buckets in use. The account must have the ListAllMyBuckets ACL among its permissions.
Steps:
You can apply this change through the Admin Settings Page (recommended) or
trifacta-conf.json
. For more information, see Platform Configuration Methods.Locate the following parameters:
Parameters
Description
aws.s3.key
Set this value to the AWS key to use to access S3.
aws.s3.secret
Set this value to the secret corresponding to the AWS key provided.
aws.s3.bucket.name
Set this value to the name of the S3 bucket from which users may read data.
Note
Bucket names are not validated.
Note
Additional buckets may be specified. See below.
Save your changes.
User mode
Optionally, access to S3 can be defined on a per-user basis. This mode allows administrators to define access to specific buckets using various key/secret combinations as a means of controlling permissions.
Note
When this mode is enabled, individual users must have AWS configuration settings applied to their account, either by an administrator or by themselves. The global settings in this section do not apply in this mode.
To enable:
You apply this change through the Workspace Settings Page. For more information, see Platform Configuration Methods.
Verify that the following setting has been set to
enabled
:Enable S3 Connectivity
You can apply this change through the Admin Settings Page (recommended) or
trifacta-conf.json
. For more information, see Platform Configuration Methods.Please verify that the setting below has been configured:
"aws.mode": "user",
Additional configuration is required for per-user authentication.
You can choose to enable session tags to leverage your existing S3 permission scheme.
For more information, see Configure AWS Per-User Authentication.
Note
If you have enabled user mode for S3 access, you must create and deploy an encryption key file. For more information, see Create Encryption Key File.
Note
If you have enabled user
access mode, you can skip the following sections, which pertain to the system
access mode, and jump to the Redshift Connections section below.
System mode - additional configuration
The following sections apply only to system
access mode.
Define default S3 write bucket
When S3 is defined as the base storage layer, write access to S3 is enabled. The Designer Cloud Powered by Trifacta platform attempts to store outputs in the designated default S3 bucket.
Note
This bucket must be set during initial installation. Modifying it at a later time is not recommended and can result in inaccessible data in the platform.
Note
Bucket names cannot have underscores in them. See http://docs.aws.amazon.com/AmazonS3/latest/dev/BucketRestrictions.html.
Steps:
Define S3 to be the base storage layer. See Set Base Storage Layer.
Enable read access.
Specify a value for
aws.s3.bucket.name
which defines the S3 bucket where data is written. Do not include a protocol identifier. For example, if your bucket address iss3://MyOutputBucket
, the value to specify is the following:MyOutputBucket
Note
Bucket names are not validated.
Note
Specify the top-level bucket name only. There should not be any backslashes in your entry.
Note
This bucket also appears as a read-access bucket if the specified S3 user has access.
Adding additional S3 buckets
When read access is enabled, all S3 buckets of which the specified user is the owner appear in the Trifacta Application. You can also add additional S3 buckets from which to read.
Note
Additional buckets are accessible only if the specified S3 user has read privileges.
Note
Bucket names cannot have underscores in them.
Steps:
You can apply this change through the Admin Settings Page (recommended) or
trifacta-conf.json
. For more information, see Platform Configuration Methods.Locate the following parameter:
aws.s3.extraBuckets
:In the Admin Settings page, specify the extra buckets as a comma-separated string of additional S3 buckets that are available for storage. Do not put any quotes around the string. Whitespace between string values is ignored.
In
trifacta-conf.json
, specify theextraBuckets
array as a comma-separated list of buckets as in the following:"extraBuckets": ["MyExtraBucket01","MyExtraBucket02","MyExtraBucket03"]
Note
Specify the top-level bucket name only. There should not be any backslashes in your entry.
Note
Bucket names are not validated.
Specify the
extraBuckets
array as a comma-separated list of buckets as in the following:"extraBuckets": ["MyExtraBucket01","MyExtraBucket02","MyExtraBucket03"]
These values are mapped to the following bucket addresses:
s3://MyExtraBucket01 s3://MyExtraBucket02 s3://MyExtraBucket03
S3 Configuration
Configuration reference
You apply this change through the Workspace Settings Page. For more information, see Platform Configuration Methods.
Setting | Description |
---|---|
Enable S3 Connectivity | When set to |
You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json
. For more information, see Platform Configuration Methods.
"aws.s3.bucket.name": "<BUCKET_FOR_OUTPUT_IF_WRITING_TO_S3>" "aws.s3.key": "<AWS_KEY>", "aws.s3.secret": "<AWS_SECRET>", "aws.s3.extraBuckets": ["<ADDITIONAL_BUCKETS_TO_SHOW_IN_FILE_BROWSER>"]
Setting | Description |
---|---|
bucket.name | Set this value to the name of the S3 bucket to which you are writing.
|
key | Access Key ID for the AWS account to use. Note This value cannot contain a slash ( |
secret | Secret Access Key for the AWS account. |
extraBuckets | Add references to any additional S3 buckets to this comma-separated array of values. The S3 user must have read access to these buckets. |
Enable use of server-side encryption
You can configure the Designer Cloud Powered by Trifacta platform to publish data on S3 when a server-side encryption policy is enabled. SSE-S3 and SSE-KMS methods are supported. For more information, see http://docs.aws.amazon.com/AmazonS3/latest/dev/serv-side-encryption.html.
Notes:
When encryption is enabled, all buckets to which you are writing must share the same encryption policy. Read operations are unaffected.
To enable, please specify the following parameters.
You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json
. For more information, see Platform Configuration Methods.
Server-side encryption method
"aws.s3.serverSideEncryption": "none",
Set this value to the method of encryption used by the S3 server. Supported values:
Note
Lower case values are required.
sse-s3
sse-kms
none
Server-side KMS key identifier
When KMS encryption is enabled, you must specify the AWS KMS key ID to use for the server-side encryption.
"aws.s3.serverSideKmsKeyId": "",
Notes:
Access to the key:
Access must be provided to the authenticating user.
The AWS IAM role must be assigned to this key.
Encrypt/Decrypt permissions for the specified KMS key ID:
Permissions must be assigned to the authenticating user.
The AWS IAM role must be given these permissions.
For more information, see https://docs.aws.amazon.com/kms/latest/developerguide/key-policy-modifying.html .
The format for referencing this key is the following:
"arn:aws:kms:<regionId>:<acctId>:key/<keyId>"
You can use an AWS alias in the following formats. The format of the AWS-managed alias is the following:
"alias/aws/s3"
The format for a custom alias is the following:
"alias/<FSR>"
where:
<FSR>
is the name of the alias for the entire key.
Save your changes and restart the platform.
Configure S3 filewriter
The following configuration can be applied through the Hadoopsite-config.xml
file. If your installation does not have a copy of this file, you can insert the properties listed in the steps below intotrifacta-conf.json
to configure the behavior of the S3 filewriter.
Steps:
You can apply this change through the Admin Settings Page (recommended) or
trifacta-conf.json
. For more information, see Platform Configuration Methods.Locate the
filewriter.hadoopConfig
block, where you can insert the following Hadoop configuration properties:"filewriter": { max: 16, "hadoopConfig": { "fs.s3a.buffer.dir": "/tmp", "fs.s3a.fast.upload": "true" }, ... }
Property
Description
fs.s3a.buffer.dir
Specifies the temporary directory on the Trifacta node to use for buffering when uploading to S3. If
fs.s3a.fast.upload
is set tofalse
, this parameter is unused.Note
This directory must be accessible to the Batch Job Runner process during job execution.
fs.s3a.fast.upload
Set to
true
to enable buffering in blocks.When set to
false
, buffering in blocks is disabled. For a given file, the entire object is buffered to the disk of the Trifacta node. Depending on the size and volume of your datasets, the node can run out of disk space.Save your changes and restart the platform.
Create Redshift Connection
For more information, see Amazon Redshift Connections.
Create Additional S3 Connections
Creating additional S3 connections is not required. After you define S3 as your base storage layer, you can create user-specific access to S3 buckets through Trifacta Application. For more information, see External S3 Connections.
Additional Configuration for S3
The following parameters can be configured through the Designer Cloud Powered by Trifacta platform to affect the integration with S3. You may or may not need to modify these values for your deployment.
You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json
. For more information, see Platform Configuration Methods.
Parameter | Description |
---|---|
aws.s3.endpoint | This value should be the S3 endpoint DNS name value. Note Do not include the protocol identifier. Example value: s3.us-east-1.amazonaws.com If your S3 deployment is either of the following:
Then, you can specify this setting to point to the S3 endpoint for Java/Spark services. For more information on this location, seehttps://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region. |
Testing
Restart services. See Start and Stop the Platform.
Try running a simple job from the Trifacta Application. For more information, see Verify Operations.
Troubleshooting
Profiling consistently fails for S3 sources of data
If you are executing visual profiles of datasets sourced from S3, you may see an error similar to the following in the batch-job-runner.log
file:
01:19:52.297 [Job 3] ERROR com.trifacta.hadoopdata.joblaunch.server.BatchFileWriterWorker - BatchFileWriterException: Batch File Writer unknown error: {jobId=3, why=bound must be positive} 01:19:52.298 [Job 3] INFO com.trifacta.hadoopdata.joblaunch.server.BatchFileWriterWorker - Notifying monitor for job 3 with status code FAILURE
This issue is caused by improperly configuring buffering when writing to S3 jobs. The specified local buffer cannot be accessed as part of the batch job running process, and the job fails to write results to S3.
Solution:
You may do one of the following:
Use a valid temp directory when buffering to S3.
Disable buffering to directory completely.
Steps:
You can apply this change through the Admin Settings Page (recommended) or
trifacta-conf.json
. For more information, see Platform Configuration Methods.Locate the following, where you can insert either of the following Hadoop configuration properties:
"filewriter": { max: 16, "hadoopConfig": { "fs.s3a.buffer.dir": "/tmp", "fs.s3a.fast.upload": false }, ... }
Property
Description
fs.s3a.buffer.dir
Specifies the temporary directory on the Trifacta node to use for buffering when uploading to S3. If
fs.s3a.fast.upload
is set tofalse
, this parameter is unused.fs.s3a.fast.upload
When set to
false
, buffering is disabled.Save your changes and restart the platform.
Spark local directory has no space
During execution of a Spark job, you may encounter the following error:
org.apache.hadoop.util.DiskChecker$DiskErrorException: No space available in any of the local directories.
Solution:
Restart Alteryx services, which may free up some temporary space.
Use the steps in the preceding solution to reassign a temporary directory for Spark to use (
fs.s3a.buffer.dir
).