Amazon S3 Connections
Note
This feature may not be available in all product editions. For more information on available features, see Compare Editions.
You can create a single, global connection to your default S3 bucket through the Trifacta Application. This connection type enables workspace users to access S3.
Simple Storage Service (S3)is an online data storage service provided by Amazon, which provides low-latency access through web services. For more information, see https://aws.amazon.com/s3/.
Note
A single, global connection to S3 is supported for workspace mode only. In per-user mode, individual users must configure their own access to S3.
Trifacta Application supports S3 authentication for 1) the entire workspace (workspace mode) or 2) by individual users (user mode).
The preferred mode of authentication can be specified by a workspace administrator in the Admin console. For more information, see AWS Account Page.
If you have configured your workspace to support user mode authentication, then individual users must supply their S3 authentication credentials through their User Profile. For more information, see Storage Page.
Tip
After you have specified a default Amazon S3 connection, you can connect to additional S3 buckets through a different connection type. For more information, see External S3 Connections.
Prerequisites
Before you begin, please verify that your Alteryx environment meets the following requirements:
Integration: Your workspace is connected to a running environment supported by your product edition.
Verify that
Enable S3 Connectivity
has been enabled in the Workspace Settings Page.
Information to acquire
Before you specify this connection, you should acquire the following information. For more information on the permissions required by the Alteryx Analytics Cloud, see Required AWS Account Permissions.
Authentication methods
Tip
Credentials may be available through your S3 administrator.
You must choose one of the following authentication methods and acquire the listed information below.
IAM role: Use a cross-account (IAM) role to define the AWS resources, including S3, to which the Trifacta Application has access. For more information, see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html.
Tip
When you choose to create this connection type, instructions are provided in the connection window for how to create and apply the IAM policies and roles for the connection.
Access keys: Acquire the Access Key ID and Secret Key for the S3 bucket to which you are connecting. For more information, see https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html.
S3 bucket names
Acquire the name of the default S3 bucket.
If you are planning to use this connection to connect to additional S3 buckets, you must acquire their names, too.
Encryption
If your buckets are encrypted, you must acquire the encryption method that is used.
Only one encryption method can be specified per connection.
Limitations
After you have created this connection, it does not appear as a connection object in the Connections page.
Publishing the output to multi-part files is not supported.
Note
For some file formats, like Parquet, multi-part files are the default output.
Publishing the output using compression option is not supported for Trifacta Photon jobs.
Note
If you need to generate an output using compression to this S3 bucket, you can run the job on another running environment.
Create Connection
You can create this S3 connection through the application.
Note
You can create a single, global connection of this type. This connection is available to all workspace users.
Steps:
Login to the application.
In the left navigation bar, click the Connections icon.
In the Create Connection page, click the Amazon S3 card.
Authentication method
Use cross-account role: Select this option if you are using or plan to create IAM policies in an IAM role to apply to this connection.
Choose an S3 bucket: Enter the name of the default S3 bucket for this connection. See below.
Create an IAM policy: Follow the listed steps to create an IAM policy that the Alteryx Analytics Cloud can use to access S3. An IAM policy is a set of permissions applied to a set of AWS assets, such as an S3 bucket.
Create an IAM role: Follow the steps to create an IAM role to which you must apply the IAM policy.
IAM roles are assigned to AWS user accounts. Roles access policies, which grant access privileges.
Copy the IAM Role ARN (Amazon Resource Name) from the created role and paste it into the connection window.
Use access keys: Select this option if you have a key-secret combination to provide access to the S3 bucket or buckets.
AWS Access Key ID: Paste in the AWS access key.
AWS Secret Access ID: Paste in the secret identifier for the access key.
Choose an S3 bucket: Enter the name of the default S3 bucket for this connection. See below.
Default S3 bucket
When the connection is first accessed for browsing, the contents of this bucket are displayed. If this value is not provided, then the list of available buckets based on the key/secret combination is displayed when browsing through the connection.
Note
To see the list of available buckets, the connecting user must have the getBucketList permission. If that permission is not present and no default bucket is listed, then the user cannot browse S3.
Storage and encryption
Additional S3 buckets: If these credentials enable access to additional S3 buckets, you can specify them as a comma-separated list of bucket names:
myBucket1,myBucket2,myBucket3
Encryption type: If server-side encryption has been enabled on your bucket, you can select theserver-side encryption policy to use when writing to the bucket. SSE-S3 and SSE-KMS methods are supported. For more information, see http://docs.aws.amazon.com/AmazonS3/latest/dev/serv-side-encryption.html.
Server Side Kms key Id: When KMS encryption is enabled, you must specify the AWS KMS key ID to use for the server-side encryption. For more information, see "Server Side KMS Key Identifier" below.
Click Save.
Note
After you have created this connection, it does not appear in the Connections page. To modify this connection, select User menu > Admin console > AWS Account. See AWS Account Page.
Server Side KMS Key Identifier
When KMS encryption is enabled, you must specify the AWS KMS key ID to use for the server-side encryption.
Access to the key:
Access must be provided to the authenticating user.
The AWS IAM role must be assigned to this key.
Encrypt/Decrypt permissions for the specified KMS key ID:
Permissions must be assigned to the authenticating user.
The AWS IAM role must be given these permissions.
For more information, see https://docs.aws.amazon.com/kms/latest/developerguide/key-policy-modifying.html.
The format for referencing this key is the following:
"arn:aws:kms:<regionId>:<acctId>:key/<keyId>"
You can use an AWS alias in the following formats. The format of the AWS-managed alias is the following:
"alias/aws/s3"
The format for a custom alias is the following:
"alias/<FSR>"
where:
<FSR>
is the name of the alias for the entire key.
Create via API
For more information, see Designer Cloud Powered by Trifacta: API Reference docs
Java VFS Service
The Java VFS Service has been modified to handle an optional connection ID, enabling S3 URLs with connection ID and credentials. The other connection details are fetched through the Trifacta Application to create the required URL and configuration.
// sample URI s3://bucket-name/path/to/object?connectionId=136 // sample java-vfs-service CURL request with s3 curl -H 'x-trifacta-person-workspace-id: 1' -X GET 'http://localhost:41917/vfsList?uri=s3://bucket-name/path/to/object?connectionId=136'
Testing
For more information, see Verify Operations.
Using S3 Connections
Uses of S3
The Alteryx Analytics Cloud can use S3 for the following tasks:
Creating Datasets from S3 Files: You can read in source data stored in S3. An imported dataset may be a single S3 file or a folder of identically structured files. See the Reading from sources in S3 below.
Reading Datasets: When creating a dataset, you can pull your data from a source in S3. See Creating Datasets below.
Writing Results: After a job has been executed, you can write the results back to S3.
In the Trifacta Application, S3 is accessed through the S3 browser. See S3 Browser.
Note
When Trifacta Application executes a job on a dataset, the source data is untouched. Results are written to a new location, so that no data is disturbed by the process.
Before you begin using S3
Warning
Avoid using /trifacta/uploads
for reading and writing data. This directory is used by the Trifacta Application.
Your administrator should provide a writeable home output directory for you. This directory location is available through your user profile. See Storage Page.
Secure access
Your administrator can grant access on a per-user basis or for the entire workspace.
The Alteryx Analytics Cloud utilizes an S3 key and secret to access your S3 instance. These keys must enable read/write access to the appropriate directories in the S3 instance.
Note
If you disable or revoke your S3 access key, you must update the S3 keys for each user or for the entire system.
Storing data in S3
Your administrator should provide raw data or locations and access for storing raw data within S3. All Alteryx users should have a clear understanding of the folder structure within S3 where each individual can read from and write results.
Users should know where shared data is located and where personal data can be saved without interfering with or confusing other users.
The Trifacta Application stores the results of each job in a separate folder in S3.
Note
The Alteryx Analytics Cloud does not modify source data in S3. Source data stored in S3 is read without modification from source locations.
Reading from sources in S3
You can create an imported dataset from one or more files stored in S3.
Note
Import of glaciered objects is not supported.
Wildcards:
You can parameterize your input paths to import source files as part of the same imported dataset. For more information, see Overview of Parameterization.
Folder selection:
When you select a folder in S3 to create your dataset, you select all files in the folder to be included.
Notes:
This option selects all files in all sub-folders and bundles them into a single dataset. If your sub-folders contain separate datasets, you should be more specific in your folder selection.
All files used in a single imported dataset must be of the same format and have the same structure. For example, you cannot mix and match CSV and JSON files if you are reading from a single directory.
When a folder is selected from S3, the following file types are ignored:
*_SUCCESS
and*_FAILED
files, which may be present if the folder has been populated by the running environment.
Note
If you have a folder and file with the same name in S3, search only retrieves the file. You can still navigate to locate the folder.
Creating datasets
When creating a dataset, you can choose to read data in from a source stored from S3 or local file.
S3 sources are not moved or changed.
Local file sources are uploaded to
/trifacta/uploads
where they remain and are not changed.
Data may be individual files or all of the files in a folder. In the Import Data page, click the S3 tab. See Import Data Page.
Writing results
When you run a job, you can specify the S3 bucket and file path where the generated results are written. By default, the output is generated in your default bucket and default output home directory.
Each set of results must be stored in a separate folder within your S3 output home directory.
For more information on your output home directory, see Storage Page.