Dataproc Engine Setup Guide
Connect your Alteryx Analytics Cloud (AAC) workspace to your Dataproc Serverless account to enable the Dataproc Engine. Dataproc is a distributed Spark engine that can run your Designer Cloud workflows if you’re workspace is set up with GCS as Private Data Storage. Follow these steps to enable the Dataproc engine in your AAC workspace…
Prerequisites
You must be a Workspace Admin in AAC.
Your AAC workspace must be set up with GCS as Private Data Storage.
A GCP service account to run Dataproc batches (jobs).
Have administrative access to the target GCP project.
Create a VPC network for all the regions you want to use.
Set the constraint
constraints/compute.requireOsLogin
tofalse
in the project you want to use.
Dataproc Engine Setup Guide
Follow these steps to enable the Dataproc engine in your AAC workspace…
GCP Service Accounts
There are 2 types of service accounts that you need…
Base storage service account for GCS. Note that you only need this account if you use workspace mode. AAC uses this account to access GCS during design time and creates Dataproc batches. The account must have permission to create and monitor Dataproc batches. These are the recommended roles…
Note
If you use user mode, AAC doesn’t use the base storage service account. Instead, AAC uses your SSO identity to launch the Dataproc batch. However, you need the same roles as listed for the base storage service account.
Dataproc Editor (
roles/dataproc.editor
) in the project you want to execute Dataproc.Service Account User (
roles/iam.serviceAccountUser
) in the Dataproc service account. For more information, go to the GCS roles documentation.
Dataproc service account. AAC passes this service account as an argument when creating a Dataproc batch. It must have the Dataproc Worker role (
roles/dataproc.worker
) in the project it’s executing in.
GCP Project Configuration
Set the constraint constraints/compute.requireOsLogin
to false
in the Google Cloud Platform (GCP) project you want to use. For more information, go to the GCS policy documentation.
VPC Network Configuration
You must have a VCP network set up to run Dataproc jobs. For more information on how to configure this network, go to the Dataproc Serverless documentation.
Complete Setup
The workspace admin can configure Dataproc for their workspace using the admin console.
Go to workspace Admin section > Data Warehouses > Dataproc section.
Fill configuration form
Project ID | The Dataproc batch is executed within this Google project. |
|
VPC Network Name | A VPC Network is used (in this case, a network with auto-subnets is used, so the subnet name does not need to be explicitly specified. If the network is configured with custom subnets, the subnet name must also be specified in the form). |
|
Region | Region where the Dataproc batch is executed. |
|
Service Account Name | Service Account used to run the Dataproc batch. This is specified as a parameter at launch time, and is not necessarily the same service account as the base storage. |
|