AWS Databricks Admin Setup

After you've completed the initial Databricks workspace configuration, follow this setup guide to provision AWS Databricks workspaces for Alteryx One Platform users.

重要

You must first configure your base storage environment to S3 and disable ADS before setting up Databricks workspaces. Go to プライベートデータストレージとしてのAWS S3 to learn more.

Workspace Details

Enter a unique Workspace Name under Workspace Details. The Service URL automatically populates.

Cluster for Spark Jobs

Alteryx One uses this cluster configuration to schedule import from or publish to Databricks via Spark Jobs. Alteryx One creates a new cluster based on these details:

Select a Cluster Policy. This defines the limits on the attributes available during cluster creation. For an unrestricted policy (default), leave this option blank. Refer to the later Cluster Policy Requirements section for details on these options.
Select between Private Data Storage Credentials or Instance Profile for the S3 Auth Mode. This determines the mode with which you communicate with AWS S3. Refer to the later S3 Access Configuration section for details on these options.
If you selected Instance Profile, enter an Instance Profile ARN.
Select a Driver Node Type.
Select a Worker Node Type. These are the AWS EC2 instance types to use for launching cluster nodes. You can also select a pool if you have one available rather than standalone instances. To reduce workflow job run latency, use pools with a reasonable number of warm, idle instances.
Enter the Minimum Workers and Maximum Workers for the Databricks job cluster. Every cluster starts with the minimum number of workers provisioned. More workers dynamically add to the cluster if required based on workload, up to the maximum.

Cluster Policy Requirements

With Databricks, you can create policies with a specific set of restrictions on cluster configurations. Select one of these policies from the Cluster Policy dropdown. You can choose between Unrestricted Policy or Other Policy.

Unrestricted Policy (Recommended): The unrestricted policy grants you the freedom to define any cluster configuration you desire without limitations. This is the default policy.

Other Policy: If you select a policy from the dropdown, ensure that the selected cluster policy permits the chosen cluster configuration. Additionally, Alteryx One provides some default configurations while creating a job cluster. Make sure that the chosen policy allows the following default configurations:

{
	"spark_version":"12.2.x-scala2.12",
	"runtime_engine": "photon",
	"cluster_type": "JOB",
	"cluster_log_conf.path": "/trifacta/logs",
	"autoterminationMinutes": 60,
	"enable_local_disk_encryption": false,
	"aws_attributes.availability":"SPOT_WITH_FALLBACK",
	"aws_attributes.ebs_volume_count": 0,
	"aws_attributes.ebs_volume_size": 0,
	"aws_attributes.ebs_volume_type": "NONE",
	"aws_attributes.first_on_demand": 1,
	"aws_attributes.spot_bid_price_percent":100
}

注記

The default configuration provided might change with future releases. Therefore, it is recommended to not define any default configuration in the cluster policy.

During Databricks workspace creation, Alteryx One performs basic cluster policy validation, but the actual validation takes place during job execution. If the configuration doesn’t match the cluster policy, the Databricks job will fail with a validation error indicating a configuration mismatch.

S3 Access Configuration

To support workflow runs with any sources/destinations that aren’t Databricks tables, Databricks clusters require access to the configured S3 bucket in Alteryx One as the default storage bucket. There are two ways to provide this access:

Instance Profile

When you select this mode, it is mandatory for the admin to also select an Instance Profile ARN. The Instance Profile ARN MUST have read and write access to the S3 bucket that Alteryx One uses as the default storage bucket.

To configure an instance profile with the required permissions, go to the Databricks tutorial.

This is a secure and recommended option for authorizing S3 access to Databricks: No sensitive S3 credentials exchange between Alteryx One and Databricks.

Private Data Storage Credentials

When you select this mode, Alteryx One attempts to use the same credentials that you provided:

If you configured Alteryx One with an AWS key-secret…
- You don’t need additional configuration. The key and secret pass to the Databricks cluster. The job uses the key and secret to access the S3 bucket.

注意

This is NOT a recommended method for authorizing S3 access to Databricks.

If you configured Alteryx One with an AWS cross-account IAM role…
- When using a cross-account IAM role, it’s mandatory for the admin to also select an Instance Profile ARN. Alteryx One uses the identity of the instance profile to securely impersonate the configured IAM role. To configure an instance profile with the required permissions, go to the Databricks tutorial. The instance profile doesn’t need S3 access permissions. However, it does require permission to assume any cross-account IAM role associated with Alteryx One. Use these permissions and trust relationships:

注記

Replace <accountid> with your AWS account ID.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": [
          "arn:aws:iam::<account-id>:role/<ROLE_1>",
          "arn:aws:iam::<account-id>:role/<ROLE_2>",
          "arn:aws:iam::<account-id>:role/<ROLE_3>"
        ]
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

The cross-account role also needs a new trust relationship to be assumed by the instance profile above. This is in addition to the trust relationship it already requires with Alteryx One. Use these permissions and trust relationships:

注記

Replace <accountid> with your AWS account ID.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "<aws_account_id>"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringLike": {
          "sts:ExternalId": [
            "<external_id>"
          ]
        }
      }
    },
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::<account-id>:role/<INSTANCE_PROFILE_ROLE>"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringLike": {
          "sts:ExternalId": [
            "<external_id>"
          ]
        }
      }
    }
  ]
}

This is a secure and recommended option for authorizing S3 access to Databricks: No sensitive S3 credentials exchange between Alteryx One and Databricks.

Cluster for Photon Jobs

This is a long-running cluster required to browse, preview, and import Databricks tables as datasets in Alteryx One. The cluster must meet these requirements to show up as an option:

Run in shared-access mode.
Use Databricks runtime version 12.2 LTS.

Once you’ve determined your Photon cluster, select Save.

You've now configured your Databricks workspace for use in Alteryx One.

To Edit or Delete your Databricks workspace, select the 3-dot menu next to your workspace.

Use Databricks for Workflow Execution

After you’ve configured at least 1 Databricks workspace, you can enable the Databricks runtime for workflows in Admin Console > Settings > Job execution > Spark Engine. This replaces the scalable runtime used for executing workflows from EMR Spark to Databricks.

Once you’ve switched the engine, Databricks becomes available as a workflow job run option for users who’ve registered a personal access token against at least 1 Databricks workspace.

When you run a full workflow, Alteryx One launches a dedicated job cluster using the Databricks configuration defined by the admin (for example the driver/worker node type and auto-scaling configuration). Every workflow job run gets a dedicated cluster. Workflow job run clusters only last for the duration of the run and then automatically terminate afterward. Alteryx One never shares these clusters between users or different workflow runs.

このセクションの内容: