Dataprep In-VPC Execution
This section describes how you can configure Dataprep by Trifacta to operate within your enterprise's virtual private cloud (VPC).
TheTrifacta Applicationruns in your VPC in the Google Cloud Platform. No additional configuration is required.
Dataflow
Optionally, you can configure Dataflow jobs to be executed within your VPC. When enabled, data remains in your VPC during full execution of the job.
Note
Previewing and sampling use the default network settings.
To enable in-VPC execution, the VPC network mode must be set to custom
, and additional VPC properties must be provided. In-VPC job execution can be configured per-user or per-output:
Per-user: For more information, see User Execution Settings Page.
Per-output: For more information, see Runtime Dataflow Execution Settings.
Note
Per-output settings override any settings specified in your preferences.
Running Jobs
Note
This feature may not be available in all product editions. For more information on available features, see Compare Editions.
Note
When jobs are migrated from execution from the platform VPC to your enterprise VPC, you may incur additional jobs to execute each job.
Job Types
By default, Trifacta Photon and connectivity jobs execute in the Alteryx VPC. As needed, you can configure these jobs to run in your VPC.
Tip
Service accounts may be used for execution of these jobs where possible.
Tip
All job types supported for in-VPC execution are supported for manual and scheduled execution.
Job Type | Description |
---|---|
Batch job processing | For execution of batch jobs within your VPC, you must perform the configuration, including specifying the appropriate service accounts to use. After configuration, these jobs are automatically executed within your VPC. |
Trifacta Photon | These jobs are transformation and quick scan sampling jobs that execute in memory. This type of job execution is suitable for small- to medium-sized jobs. |
Connectivity | If your data source or publishing target is a relational or API-based source, some or all of the job occurs through the connectivity framework. Tip If connectivity jobs have been enabled for execution in your environment, then BigQuery connectivity is enabled, including publishing and using BigQuery for running transformation jobs, using the appropriate service account. |
Connectivity - design time | In-VPC execution supports connection from the design time functions of the Trifacta Application to an in-VPC data service instance. This connection to the data service allows for testing connections, viewing table and schema information, and collecting initial samples from datasources hosted within your VPC. Note When this feature is enabled, SSH tunneling for connections does not work. |
Conversion | Ingestion jobs of datasources that need to be converted, such as binary formats like PDF, XLSX, and Google Sheets, can be executed within your VPC. Note Google Sheets conversion jobs use user credentials within the project, even if service accounts are enabled. |
For these job types, there are two types of configuration:
Configuration Type | Description |
---|---|
Basic | Uses the GKE default namespace and default node pool. See below. |
Advanced | User-configured GKE namespace and user-specified node pool. See Dataprep In-VPC Execution - Advanced. |
Details on these configuration methods are provided below.
Limitations
The following limitations apply to this release. These limitations may change in the future:
A running job is permitted to execute for no more than 1 hour.
For this release, only regions in the U.S. and Europe are supported.
Prerequisites
Before you begin, please verify that your VPC environment has the following:
The project owner must perform configuration in Dataprep by Trifacta as part of this setup.
A GKE cluster is available for transformation jobs to use.
Your GKE cluster must have a public endpoint.
Use VPC-native clusters. Routes-based clusters are not supported.
If using a GKE cluster with private nodes, a Cloud NAT (network address table) must be available in your VPC to access the Alteryx image registry.
Workload identity must be enabled on the GKE cluster. Additional configuration for Dataprep by Trifacta is described later.
The use of service accounts (Compute Engine or Companion Service Accounts) is required to run jobs in your VPC.
Use of individual user credentials is not supported for Workload Identity.
Access to the following tools:
gloud
command line interface (CLI)kubectl
openssl
base64
Acquire from Alteryx:
IP address for authorized control plane access.
Enable
In-VPC execution must be enabled by an administrator. In the Dataprep Settings page, you can enable the following settings.
Setting | Description |
---|---|
In-VPC execution | Enables general in-VPC execution, which includes execution of the following job types:
|
In-VPC Conversion job execution | Enables execution of conversion jobs within your VPC. Note This setting is available when In-VPC Execution has been enabled. |
In-VPC Data-Service communication | Enables design-time connectivity jobs to be executed within your VPC. Note This setting is available when In-VPC Execution has been enabled. |
Note
The Scheduling feature must also be enabled for the project.
For more information, see Dataprep Project Settings Page.
Basic configuration
Please complete the following steps for the Basic configuration.
Google Cloud IAM Service Account
This Service Account is assigned to the nodes in the GKE node pool and is configured to have minimal privileges.
Following are variables listed in the configuration steps. They can be modified based on your requirements and supported values:
Variable | Description |
---|---|
| Default service account name |
myproject | Name of your Google project |
myregion | Your Google Cloud region |
Please execute the following commands from the gcloud
CLI:
Note
Below, some values are too long for a single line. Single lines that overflow to additional lines are marked with a \
. The backslash should not be included if the line is used as input.
gcloud iam service-accounts create trifacta-service-account \ --display-name="Service Account for running Trifacta Remote jobs" gcloud projects add-iam-policy-binding myproject \ --member "serviceAccount:trifacta-service-account@myproject.iam.gserviceaccount.com" \ --role roles/logging.logWriter gcloud projects add-iam-policy-binding myproject \ --member "serviceAccount:trifacta-service-account@myproject.iam.gserviceaccount.com" \ --role roles/monitoring.metricWriter gcloud projects add-iam-policy-binding myproject \ --member "serviceAccount:trifacta-service-account@myproject.iam.gserviceaccount.com" \ --role roles/monitoring.viewer gcloud projects add-iam-policy-binding myproject \ --member "serviceAccount:trifacta-service-account@myproject.iam.gserviceaccount.com" \ --role roles/stackdriver.resourceMetadata.writer gcloud projects add-iam-policy-binding myproject \ --member "serviceAccount:trifacta-service-account@myproject.iam.gserviceaccount.com" \ --role roles/artifactregistry.reader
Verification steps:
Command:
gcloud projects get-iam-policy myproject --flatten="bindings[].members" --format="table(bindings.role)" --filter="bindings.members:serviceAccount:trifacta-service-account@myproject.iam.gserviceaccount.com"
The output should look like the following:
ROLE roles/artifactregistry.reader roles/logging.logWriter roles/monitoring.metricWriter roles/monitoring.viewer roles/stackdriver.resourceMetadata.writer
Router and NAT
The following configuration is required for Internet access to acquire assets from Dataprep by Trifacta, if the GKE cluster has private nodes.
Note
Below, some values are too long for a single line. Single lines that overflow to additional lines are marked with a \
. The backslash should not be included if the line is used as input.
gcloud compute routers create myproject-myregion \ --network myproject-network \ --region=myregion gcloud compute routers nats create myproject-myregion \ --router=myproject-myregion \ --auto-allocate-nat-external-ips \ --nat-all-subnet-ip-ranges \ --enable-logging
Verification Steps:
You can verify that the router NAT was created in the Google Cloud Platform Console: https://console.cloud.google.com/net-services/nat/list.
GKE cluster
This configuration creates the GKE cluster for use in executing jobs. This cluster must be created in the VPC/sub-network that has access to your datasources, such as your databases and Cloud Storage.
In the following, please replace w.x.y.z
with the IP address provided to you by Alteryx for authorized control plane access.
The Pod address range limits the maximum size of the cluster. See https://cloud.google.com/kubernetes-engine/docs/how-to/flexible-pod-cidr.
For more information about the available zones in
node-locations
, please see https://console.cloud.google.com/compute/zones.If you don’t have Quotas on accounts, you must reconfigure the node size to fit inside your quota, or your cluster may not start.
Note
Below, some values are too long for a single line. Single lines that overflow to additional lines are marked with a \
. The backslash should not be included if the line is used as input.
gcloud container clusters create "trifacta-cluster" \ --project "myproject" \ --region "myregion" \ --no-enable-basic-auth \ --cluster-version "1.20.8-gke.900" \ --release-channel "None" \ --machine-type "n1-standard-16" \ --image-type "COS_CONTAINERD" \ --disk-type "pd-standard" \ --disk-size "100" \ --metadata disable-legacy-endpoints=true \ --service-account "trifacta-service-account@myproject.iam.gserviceaccount.com" \ --max-pods-per-node "110" \ --num-nodes "1" \ --logging=SYSTEM,WORKLOAD \ --monitoring=SYSTEM \ --enable-ip-alias \ --network "projects/myproject/global/networks/myproject-network" \ --subnetwork "projects/myproject/regions/myregion/subnetworks/myproject-subnet-myregion" \ --no-enable-intra-node-visibility \ --default-max-pods-per-node "110" \ --enable-autoscaling \ --min-nodes "0" \ --max-nodes "3" \ --enable-master-authorized-networks \ --master-authorized-networks w.x.y.z/32 \ --addons HorizontalPodAutoscaling,HttpLoadBalancing,GcePersistentDiskCsiDriver \ --no-enable-autoupgrade \ --enable-autorepair \ --max-surge-upgrade 1 \ --max-unavailable-upgrade 0 \ --workload-pool "myproject.svc.id.goog" \ --enable-private-nodes \ --enable-shielded-nodes \ --shielded-secure-boot \ --node-locations "myregion-a","myregion-b","myregion-c" \ --master-ipv4-cidr=10.1.0.0/28 \ --enable-binauthz
Verification Steps:
You can verify that the cluster was created through the Google Cloud Platform Console: https://console.cloud.google.com/kubernetes/list/overview.
Use the following command to set up configuration to connect to the new cluster:
gcloud container clusters get-credentials trifacta-cluster --region myregion --project myproject
The following commands whitelist the Cloud shell for use on the cluster:
After you have acquired access, you can whitelist the following account:
cat <<EOF | kubectl apply -f - apiVersion: v1 kind: ServiceAccount automountServiceAccountToken: false metadata: namespace: default name: trifacta-job-runner EOF
You can whitelist the following role using the appropriate definition below:
Use the following if you are enabling design-time connectivity to a remote data service instance:
cat <<EOF | kubectl apply -n data-system-job-namespace -f - apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: trifacta-job-runner-role rules: - apiGroups: [""] resources: ["secrets"] verbs: ["list", "create", "delete"] - apiGroups: [""] resources: ["pods"] verbs: ["list", "update"] - apiGroups: [""] resources: ["pods/log", "pods/portforward"] verbs: ["get", "list", "create"] - apiGroups: ["batch"] resources: ["jobs"] verbs: ["get", "create", "delete", "watch"] - apiGroups: [""] resources: ["serviceaccounts"] verbs: ["list", "get"] - apiGroups: [""] resources: ["configmaps"] verbs: ["patch", list", "get", "create"] - apiGroups: [""] resources: ["services"] verbs: ["create", "list", "get"] - apiGroups: ["apps"] resources: ["deployments"] verbs: ["patch", "create", "list", "get"] EOF
Specify the following role bindings and cluster roles:
cat <<EOF | kubectl apply -f - apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: trifacta-job-runner-rb subjects: - kind: ServiceAccount name: trifacta-job-runner namespace: default roleRef: kind: Role name: trifacta-job-runner-role apiGroup: rbac.authorization.k8s.io EOF
cat <<EOF | kubectl apply -f - apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: node-list-role rules: - apiGroups: [""] resources: ["nodes"] verbs: ["list"] EOF
cat <<EOF | kubectl apply -f - kind: ClusterRoleBinding apiVersion: rbac.authorization.k8s.io/v1 metadata: name: node-list-rb subjects: - kind: ServiceAccount name: trifacta-job-runner namespace: default roleRef: kind: ClusterRole name: node-list-role apiGroup: rbac.authorization.k8s.io EOF
cat <<EOF | kubectl apply -f - apiVersion: v1 kind: ServiceAccount automountServiceAccountToken: false metadata: name: trifacta-pod-sa EOF
Node pool - diff
For basic configuration, Trifacta Photon uses the default
node pool. No additional configuration is required.
Kubernetes namespace - diff
For basic configuration, Trifacta Photon uses the default
namespace. No additional configuration is required.
Kubernetes Service Accounts - diff
Variable | Description |
---|---|
trifacta-job-runner | Service Account used by Dataprep by Trifacta externally to launch jobs into the GKE cluster. |
trifacta-pod-sa | Service Account assigned to the job pod running in the GKE cluster. |
Please execute the following commands:
cat <<EOF | kubectl apply -f - apiVersion: v1 kind: ServiceAccount automountServiceAccountToken: false metadata: namespace: default name: trifacta-job-runner EOF
cat <<EOF | kubectl apply -f - apiVersion: v1 kind: Secret metadata: name: trifacta-job-runner-secret annotations: kubernetes.io/service-account.name: trifacta-job-runner type: kubernetes.io/service-account-token EOF
cat <<EOF | kubectl apply -f - apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: trifacta-job-runner-role rules: - apiGroups: [""] resources: ["secrets"] verbs: ["create", "delete"] - apiGroups: [""] resources: ["pods"] verbs: ["list"] - apiGroups: [""] resources: ["pods/log"] verbs: ["get"] - apiGroups: ["batch"] resources: ["jobs"] verbs: ["get", "create", "delete", "watch"] - apiGroups: [""] resources: ["serviceaccounts"] verbs: ["list", "get"] EOF
cat <<EOF | kubectl apply -f - apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: trifacta-job-runner-rb subjects: - kind: ServiceAccount name: trifacta-job-runner namespace: default roleRef: kind: Role name: trifacta-job-runner-role apiGroup: rbac.authorization.k8s.io EOF
cat <<EOF | kubectl apply -f - apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: node-list-role rules: - apiGroups: [""] resources: ["nodes"] verbs: ["list"] EOF
cat <<EOF | kubectl apply -f - kind: ClusterRoleBinding apiVersion: rbac.authorization.k8s.io/v1 metadata: name: node-list-rb subjects: - kind: ServiceAccount name: trifacta-job-runner namespace: default roleRef: kind: ClusterRole name: node-list-role apiGroup: rbac.authorization.k8s.io EOF
cat <<EOF | kubectl apply -f - apiVersion: v1 kind: ServiceAccount automountServiceAccountToken: false metadata: name: trifacta-pod-sa EOF
Credential encryption keys
The following commands create the encryption keys for credentials:
Note
Below, some values are too long for a single line. Single lines that overflow to additional lines are marked with a \
. The backslash should not be included if the line is used as input.
openssl genrsa -out private_key.pem 2048 openssl pkcs8 -topk8 -inform PEM -outform DER -in private_key.pem -out private_key.der -nocrypt openssl rsa -in private_key.pem -pubout -outform DER -out public_key.der base64 -i public_key.der > public_key.der.base64 base64 -i private_key.der > private_key.der.base64 kubectl create secret generic trifacta-credential-encryption -n default \ --from-file=privateKey=private_key.der.base64
Dataprep by Trifacta application configuration
After you have completed the above configuration, you must configure the Trifacta Application based on the commands that you have executed.
Steps:
Login to the Trifacta Application as a project owner.
Select Admin console > VPC runtime settings.
Please complete the following configuration. For more information on these settings, see VPC Runtime Settings Page.
Kubernetes cluster tab
Setting | Command or Value |
---|---|
Master URL | Command: gcloud container clusters describe trifacta-cluster --zone=myregion --format="value(endpoint)" Returns: This command returns a URL that looks similar to the following: https://34.0.0.0 |
OAuth token | Command: kubectl get secret/trifacta-job-runner-secret -o json | jq -r '.data.token' | base64 --decode |
Cluster CA certificate | Command: gcloud container clusters describe trifacta-cluster --zone=myregion --format="value(masterAuth.clusterCaCertificate)" |
Service account name | Value: |
Public key | Insert the contents of: To acquire this value: cat public_key.der.base64 Note To process Google Sheets data in your VPC, this value is required. Otherwise, it is optional. |
Private key secret name | Value: Note To process Google Sheets data in your VPC, this value is required. The private key must be accessible within your VPC. Otherwise, this value is optional. |
Photon tab
Setting | Command or Value |
---|---|
Kubernetes namespace | Value: To acquire the namespace value: kubectl get namespace |
CPU, memory - request, limits | Adjust as needed. Note CPU and memory requests and limits should be lower than the CPU and memory that can be allocated on the GKE node. |
Node selector, tolerations | Values: Node selector = "" Node tolerations = "" |
Connectivity tab
Setting | Command or Value |
---|---|
Kubernetes namespace | default |
CPU, memory - request, limits | Adjust defaults, if necessary. |
Node selector, tolerations | Node selector = "" Node tolerations = "" |
To test your configuration, click Test. A success message should be displayed.
To save your configuration, click Save.
Conversion tab
Setting | Command or Value |
---|---|
Kubernetes namespace | default |
CPU, memory - request, limits | Adjust defaults, if necessary. |
Node selector, tolerations | Node selector = "" Node tolerations = "" |
To test your configuration, click Test. A success message should be displayed.
To save your configuration, click Save.
Configure Workload Identity
Note
This feature may not be available in all product editions. For more information on available features, see Compare Editions.
Google access tokens are valid for 1 hour. Some jobs can be long running. To protect against timeouts during these jobs and to support security practices recommended by Google, Dataprep by Trifacta supports the use of Workload Identity, which is Google's recommended approach for accessing Google APIs.
For more information on Workload Identity, see https://cloud.google.com/kubernetes-engine/docs/concepts/workload-identity.
For more information on enabling Workload Identity in your project, see https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity.
Note
Workload Identity is required for running jobs on a GKE cluster, which is required for In-VPC job execution.
Note
Workload Identity requires the use of Compute Engine or Companion service accounts. Use of individual user credentials is not supported. For more information, see Google Service Account Management.
Warning
For each Companion Service Account assigned to a user in Dataprep by Trifacta:
A new Kubernetes ServiceAccount must be created on the GKE cluster.
Note
This step must be completed by your Google Cloud Platform administrator.
Using Workload Identity, the Companion Service Account must be bound to the newly created Kubernetes ServiceAccount.
The following assumes that a Companion Service Account named allAccess@myproject.iam.gserviceaccount.com already exists:
// Create a new Kubernetes ServiceAccount on the GKE cluster with an annotation to bind it to the allAccess@myproject.iam.gserviceaccount.com Companion ServiceAccount. cat <<EOF | kubectl apply -f - apiVersion: v1 kind: ServiceAccount automountServiceAccountToken: false metadata: annotations: iam.gke.io/gcp-service-account: allAccess@myproject.iam.gserviceaccount.com name: trifacta-pod-sa-allaccess EOF // Allow the Kubernetes ServiceAccount to impersonate the Google IAM ServiceAccount by adding an IAM policy binding between the two service accounts. This binding allows the Kubernetes ServiceAccount to act as the IAM ServiceAccount. gcloud iam service-accounts add-iam-policy-binding \ --project <project_name> --role roles/iam.workloadIdentityUser \ --member "serviceAccount:<project_name>.svc.id.goog[default/trifacta-pod-sa-allaccess]" \ allAccess@myproject.iam.gserviceaccount.com
Wait a couple of minutes for the binding to take effect.
Note
For relational connectivity, additional configuration is required. Search for data-system
in Dataprep In-VPC Execution - Advanced.
Testing
You can use the following command to watch the Kubernetes clusters for job execution and to check active pods:
kubectl get pods -n default -w
To get details on a specific pod:
kubectl describe <podId>
Then, run a job in Trifacta Photon through the Trifacta Application. If the job runs successfully, then the configuration has been properly applied. See Run Job Page.