Skip to main content

Dataprep In-VPC Execution

This section describes how you can configure Dataprep by Trifacta to operate within your enterprise's virtual private cloud (VPC).

TheTrifacta Applicationruns in your VPC in the Google Cloud Platform. No additional configuration is required.

Dataflow

Optionally, you can configure Dataflow jobs to be executed within your VPC. When enabled, data remains in your VPC during full execution of the job.

Note

Previewing and sampling use the default network settings.

To enable in-VPC execution, the VPC network mode must be set to custom, and additional VPC properties must be provided. In-VPC job execution can be configured per-user or per-output:

Running Jobs

Note

This feature may not be available in all product editions. For more information on available features, see Compare Editions.

Note

When jobs are migrated from execution from the platform VPC to your enterprise VPC, you may incur additional jobs to execute each job.

Job Types

By default, Trifacta Photon and connectivity jobs execute in the Alteryx VPC. As needed, you can configure these jobs to run in your VPC.

Tip

Service accounts may be used for execution of these jobs where possible.

Tip

All job types supported for in-VPC execution are supported for manual and scheduled execution.

Job Type

Description

Batch job processing

For execution of batch jobs within your VPC, you must perform the configuration, including specifying the appropriate service accounts to use. After configuration, these jobs are automatically executed within your VPC.

Trifacta Photon

These jobs are transformation and quick scan sampling jobs that execute in memory. This type of job execution is suitable for small- to medium-sized jobs.

Connectivity

If your data source or publishing target is a relational or API-based source, some or all of the job occurs through the connectivity framework.

Tip

If connectivity jobs have been enabled for execution in your environment, then BigQuery connectivity is enabled, including publishing and using BigQuery for running transformation jobs, using the appropriate service account.

Connectivity - design time

In-VPC execution supports connection from the design time functions of the Trifacta Application to an in-VPC data service instance. This connection to the data service allows for testing connections, viewing table and schema information, and collecting initial samples from datasources hosted within your VPC.

Note

When this feature is enabled, SSH tunneling for connections does not work.

Conversion

Ingestion jobs of datasources that need to be converted, such as binary formats like PDF, XLSX, and Google Sheets, can be executed within your VPC.

Note

Google Sheets conversion jobs use user credentials within the project, even if service accounts are enabled.

For these job types, there are two types of configuration:

Configuration Type

Description

Basic

Uses the GKE default namespace and default node pool. See below.

Advanced

User-configured GKE namespace and user-specified node pool. See Dataprep In-VPC Execution - Advanced.

Details on these configuration methods are provided below.

Limitations

The following limitations apply to this release. These limitations may change in the future:

  • A running job is permitted to execute for no more than 1 hour.

  • For this release, only regions in the U.S. and Europe are supported.

Prerequisites

Before you begin, please verify that your VPC environment has the following:

  • The project owner must perform configuration in Dataprep by Trifacta as part of this setup.

  • A GKE cluster is available for transformation jobs to use.

    • Your GKE cluster must have a public endpoint.

    • Use VPC-native clusters. Routes-based clusters are not supported.

    • If using a GKE cluster with private nodes, a Cloud NAT (network address table) must be available in your VPC to access the Alteryx image registry.

  • Workload identity must be enabled on the GKE cluster. Additional configuration for Dataprep by Trifacta is described later.

  • The use of service accounts (Compute Engine or Companion Service Accounts) is required to run jobs in your VPC.

    • Use of individual user credentials is not supported for Workload Identity.

  • Access to the following tools:

    • gloud command line interface (CLI)

    • kubectl

    • openssl

    • base64

Acquire from Alteryx:

  • IP address for authorized control plane access.

Enable

In-VPC execution must be enabled by an administrator. In the Dataprep Settings page, you can enable the following settings.

Setting

Description

In-VPC execution

Enables general in-VPC execution, which includes execution of the following job types:

  • Trifacta Photon jobs

  • Batch processing jobs

  • Connectivity jobs

In-VPC Conversion job execution

Enables execution of conversion jobs within your VPC.

Note

This setting is available when In-VPC Execution has been enabled.

In-VPC Data-Service communication

Enables design-time connectivity jobs to be executed within your VPC.

Note

This setting is available when In-VPC Execution has been enabled.

Note

The Scheduling feature must also be enabled for the project.

For more information, see Dataprep Project Settings Page.

Basic configuration

Please complete the following steps for the Basic configuration.

Google Cloud IAM Service Account

This Service Account is assigned to the nodes in the GKE node pool and is configured to have minimal privileges.

Following are variables listed in the configuration steps. They can be modified based on your requirements and supported values:

Variable

Description

trifacta-service-account

Default service account name

myproject

Name of your Google project

myregion

Your Google Cloud region

Please execute the following commands from the gcloud CLI:

Note

Below, some values are too long for a single line. Single lines that overflow to additional lines are marked with a \. The backslash should not be included if the line is used as input.

gcloud iam service-accounts create trifacta-service-account \
--display-name="Service Account for running Trifacta Remote jobs"

gcloud projects add-iam-policy-binding myproject \
--member "serviceAccount:trifacta-service-account@myproject.iam.gserviceaccount.com" \
--role roles/logging.logWriter


gcloud projects add-iam-policy-binding myproject \
--member "serviceAccount:trifacta-service-account@myproject.iam.gserviceaccount.com" \
--role roles/monitoring.metricWriter


gcloud projects add-iam-policy-binding myproject \
--member "serviceAccount:trifacta-service-account@myproject.iam.gserviceaccount.com" \
--role roles/monitoring.viewer


gcloud projects add-iam-policy-binding myproject \
--member "serviceAccount:trifacta-service-account@myproject.iam.gserviceaccount.com" \
--role roles/stackdriver.resourceMetadata.writer

gcloud projects add-iam-policy-binding myproject \
--member "serviceAccount:trifacta-service-account@myproject.iam.gserviceaccount.com" \
--role roles/artifactregistry.reader

Verification steps:

Command:

gcloud projects get-iam-policy myproject --flatten="bindings[].members" --format="table(bindings.role)" --filter="bindings.members:serviceAccount:trifacta-service-account@myproject.iam.gserviceaccount.com"

The output should look like the following:

ROLE
roles/artifactregistry.reader
roles/logging.logWriter
roles/monitoring.metricWriter
roles/monitoring.viewer
roles/stackdriver.resourceMetadata.writer

Router and NAT

The following configuration is required for Internet access to acquire assets from Dataprep by Trifacta, if the GKE cluster has private nodes.

Note

Below, some values are too long for a single line. Single lines that overflow to additional lines are marked with a \. The backslash should not be included if the line is used as input.

gcloud compute routers create myproject-myregion \
--network myproject-network \
--region=myregion

gcloud compute routers nats create myproject-myregion \
--router=myproject-myregion \
--auto-allocate-nat-external-ips \
--nat-all-subnet-ip-ranges \
--enable-logging

Verification Steps:

You can verify that the router NAT was created in the Google Cloud Platform Console: https://console.cloud.google.com/net-services/nat/list.

GKE cluster

This configuration creates the GKE cluster for use in executing jobs. This cluster must be created in the VPC/sub-network that has access to your datasources, such as your databases and Cloud Storage.

In the following, please replace w.x.y.z with the IP address provided to you by Alteryx for authorized control plane access.

Note

Below, some values are too long for a single line. Single lines that overflow to additional lines are marked with a \. The backslash should not be included if the line is used as input.

gcloud container clusters create "trifacta-cluster" \
--project "myproject" \
--region "myregion" \
--no-enable-basic-auth \
--cluster-version "1.20.8-gke.900" \
--release-channel "None" \
--machine-type "n1-standard-16" \
--image-type "COS_CONTAINERD" \
--disk-type "pd-standard" \
--disk-size "100" \
--metadata disable-legacy-endpoints=true \
--service-account "trifacta-service-account@myproject.iam.gserviceaccount.com" \
--max-pods-per-node "110" \
--num-nodes "1" \
--logging=SYSTEM,WORKLOAD \
--monitoring=SYSTEM \
--enable-ip-alias \
--network "projects/myproject/global/networks/myproject-network" \
--subnetwork "projects/myproject/regions/myregion/subnetworks/myproject-subnet-myregion" \
--no-enable-intra-node-visibility \
--default-max-pods-per-node "110" \
--enable-autoscaling \
--min-nodes "0" \
--max-nodes "3" \
--enable-master-authorized-networks \
--master-authorized-networks w.x.y.z/32 \
--addons HorizontalPodAutoscaling,HttpLoadBalancing,GcePersistentDiskCsiDriver \
--no-enable-autoupgrade \
--enable-autorepair \
--max-surge-upgrade 1 \
--max-unavailable-upgrade 0 \
--workload-pool "myproject.svc.id.goog" \
--enable-private-nodes \
--enable-shielded-nodes \
--shielded-secure-boot \
--node-locations "myregion-a","myregion-b","myregion-c" \
--master-ipv4-cidr=10.1.0.0/28 \
--enable-binauthz 

Verification Steps:

You can verify that the cluster was created through the Google Cloud Platform Console: https://console.cloud.google.com/kubernetes/list/overview.

Switch to new cluster

Use the following command to set up configuration to connect to the new cluster:

gcloud container clusters get-credentials trifacta-cluster --region myregion --project myproject

The following commands whitelist the Cloud shell for use on the cluster:

  1. After you have acquired access, you can whitelist the following account:

    cat <<EOF | kubectl apply -f -
    apiVersion: v1
    kind: ServiceAccount
    automountServiceAccountToken: false
    metadata:
      namespace: default
      name: trifacta-job-runner
    EOF
  2. You can whitelist the following role using the appropriate definition below:

    1. Use the following if you are enabling design-time connectivity to a remote data service instance:

      cat <<EOF | kubectl apply -n data-system-job-namespace -f -
      apiVersion: rbac.authorization.k8s.io/v1
      kind: Role
      metadata:
        name: trifacta-job-runner-role
      rules:
      - apiGroups: [""]
        resources: ["secrets"]
        verbs: ["list", "create", "delete"]
      - apiGroups: [""]
        resources: ["pods"]
        verbs: ["list", "update"]
      - apiGroups: [""]
        resources: ["pods/log", "pods/portforward"]
        verbs: ["get", "list", "create"]
      - apiGroups: ["batch"]
        resources: ["jobs"]
        verbs: ["get", "create", "delete", "watch"]
      - apiGroups: [""]
        resources: ["serviceaccounts"]
        verbs: ["list", "get"]
      - apiGroups: [""]
        resources: ["configmaps"]
        verbs: ["patch", list", "get", "create"]
      - apiGroups: [""]
        resources: ["services"]
        verbs: ["create", "list", "get"]
      - apiGroups: ["apps"]
        resources: ["deployments"]
        verbs: ["patch", "create", "list", "get"]
      EOF
  3. Specify the following role bindings and cluster roles:

    cat <<EOF | kubectl apply -f -
    apiVersion: rbac.authorization.k8s.io/v1
    kind: RoleBinding
    metadata:
      name: trifacta-job-runner-rb
    subjects:
    - kind: ServiceAccount
      name: trifacta-job-runner
      namespace: default
    roleRef:
      kind: Role
      name: trifacta-job-runner-role
      apiGroup: rbac.authorization.k8s.io
    EOF
    cat <<EOF | kubectl apply -f -
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRole
    metadata:
      name: node-list-role
    rules:
    - apiGroups: [""]
      resources: ["nodes"]
      verbs: ["list"]
    EOF
    cat <<EOF | kubectl apply -f -
    kind: ClusterRoleBinding
    apiVersion: rbac.authorization.k8s.io/v1
    metadata:
      name: node-list-rb
    subjects:
    - kind: ServiceAccount
      name: trifacta-job-runner
      namespace: default
    roleRef:
      kind: ClusterRole
      name: node-list-role
      apiGroup: rbac.authorization.k8s.io
    EOF
    cat <<EOF | kubectl apply -f -
    apiVersion: v1
    kind: ServiceAccount
    automountServiceAccountToken: false
    metadata:
      name: trifacta-pod-sa
    EOF

Node pool - diff

For basic configuration, Trifacta Photon uses the default node pool. No additional configuration is required.

Kubernetes namespace - diff

For basic configuration, Trifacta Photon uses the default namespace. No additional configuration is required.

Kubernetes Service Accounts - diff

Variable

Description

trifacta-job-runner

Service Account used by Dataprep by Trifacta externally to launch jobs into the GKE cluster.

trifacta-pod-sa

Service Account assigned to the job pod running in the GKE cluster.

Please execute the following commands:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ServiceAccount
automountServiceAccountToken: false
metadata:
  namespace: default
  name: trifacta-job-runner
EOF
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Secret
metadata:
 name: trifacta-job-runner-secret
 annotations:
  kubernetes.io/service-account.name: trifacta-job-runner
type: kubernetes.io/service-account-token
EOF
cat <<EOF | kubectl apply -f -
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: trifacta-job-runner-role
rules:
- apiGroups: [""]
  resources: ["secrets"]
  verbs: ["create", "delete"]
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["list"]
- apiGroups: [""]
  resources: ["pods/log"]
  verbs: ["get"]
- apiGroups: ["batch"]
  resources: ["jobs"]
  verbs: ["get", "create", "delete", "watch"]
- apiGroups: [""]
  resources: ["serviceaccounts"]
  verbs: ["list", "get"]
EOF
cat <<EOF | kubectl apply -f -
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: trifacta-job-runner-rb
subjects:
- kind: ServiceAccount
  name: trifacta-job-runner
  namespace: default
roleRef:
  kind: Role
  name: trifacta-job-runner-role
  apiGroup: rbac.authorization.k8s.io
EOF
cat <<EOF | kubectl apply -f -
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: node-list-role
rules:
- apiGroups: [""]
  resources: ["nodes"]
  verbs: ["list"]
EOF
cat <<EOF | kubectl apply -f -
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: node-list-rb
subjects:
- kind: ServiceAccount
  name: trifacta-job-runner
  namespace: default
roleRef:
  kind: ClusterRole
  name: node-list-role
  apiGroup: rbac.authorization.k8s.io
EOF
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ServiceAccount
automountServiceAccountToken: false
metadata:
  name: trifacta-pod-sa
EOF

Credential encryption keys

The following commands create the encryption keys for credentials:

Note

Below, some values are too long for a single line. Single lines that overflow to additional lines are marked with a \. The backslash should not be included if the line is used as input.

openssl genrsa -out private_key.pem 2048


openssl pkcs8 -topk8 -inform PEM -outform DER -in private_key.pem -out private_key.der -nocrypt


openssl rsa -in private_key.pem -pubout -outform DER -out public_key.der

base64 -i public_key.der > public_key.der.base64

base64 -i private_key.der > private_key.der.base64


kubectl create secret generic trifacta-credential-encryption -n default \
--from-file=privateKey=private_key.der.base64

Dataprep by Trifacta application configuration

After you have completed the above configuration, you must configure the Trifacta Application based on the commands that you have executed.

Steps:

  1. Login to the Trifacta Application as a project owner.

  2. Select Admin console > VPC runtime settings.

Please complete the following configuration. For more information on these settings, see VPC Runtime Settings Page.

Kubernetes cluster tab

Setting

Command or Value

Master URL

Command:

gcloud container clusters describe trifacta-cluster --zone=myregion --format="value(endpoint)"

Returns:

This command returns a URL that looks similar to the following:

https://34.0.0.0

OAuth token

Command:

kubectl get secret/trifacta-job-runner-secret -o json | jq -r '.data.token' | base64 --decode

Cluster CA certificate

Command:

gcloud container clusters describe trifacta-cluster --zone=myregion --format="value(masterAuth.clusterCaCertificate)"

Service account name

Value: trifacta-pod-sa

Public key

Insert the contents of: public_key.der.base64.

To acquire this value:

cat public_key.der.base64

Note

To process Google Sheets data in your VPC, this value is required. Otherwise, it is optional.

Private key secret name

Value: trifacta-credential-encryption

Note

To process Google Sheets data in your VPC, this value is required. The private key must be accessible within your VPC. Otherwise, this value is optional.

Photon tab

Setting

Command or Value

Kubernetes namespace

Value: default

To acquire the namespace value:

kubectl get namespace

CPU, memory - request, limits

Adjust as needed.

Note

CPU and memory requests and limits should be lower than the CPU and memory that can be allocated on the GKE node.

Node selector, tolerations

- diff

Values:

Node selector = ""
Node tolerations = ""

Connectivity tab

Setting

Command or Value

Kubernetes namespace

default

CPU, memory - request, limits

Adjust defaults, if necessary.

Node selector, tolerations

Node selector = ""
    Node tolerations = ""
  • To test your configuration, click Test. A success message should be displayed.

  • To save your configuration, click Save.

Conversion tab

Setting

Command or Value

Kubernetes namespace

default

CPU, memory - request, limits

Adjust defaults, if necessary.

Node selector, tolerations

Node selector = ""
    Node tolerations = ""
  • To test your configuration, click Test. A success message should be displayed.

  • To save your configuration, click Save.

Configure Workload Identity

Note

This feature may not be available in all product editions. For more information on available features, see Compare Editions.

Google access tokens are valid for 1 hour. Some jobs can be long running. To protect against timeouts during these jobs and to support security practices recommended by Google, Dataprep by Trifacta supports the use of Workload Identity, which is Google's recommended approach for accessing Google APIs.

Note

Workload Identity is required for running jobs on a GKE cluster, which is required for In-VPC job execution.

Note

Workload Identity requires the use of Compute Engine or Companion service accounts. Use of individual user credentials is not supported. For more information, see Google Service Account Management.

Warning

This section describes how to bind a Companion Service Account to a Kubernetes ServiceAccount on the GKE cluster using Workload Identity. These steps need to be modified if you are binding a Compute Engine service account.

For each Companion Service Account assigned to a user in Dataprep by Trifacta:

  1. A new Kubernetes ServiceAccount must be created on the GKE cluster.

    Note

    This step must be completed by your Google Cloud Platform administrator.

  2. Using Workload Identity, the Companion Service Account must be bound to the newly created Kubernetes ServiceAccount.

The following assumes that a Companion Service Account named allAccess@myproject.iam.gserviceaccount.com already exists:

// Create a new Kubernetes ServiceAccount on the GKE cluster with an annotation to bind it to the allAccess@myproject.iam.gserviceaccount.com Companion ServiceAccount.

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ServiceAccount
automountServiceAccountToken: false
metadata:

  annotations:
    iam.gke.io/gcp-service-account: allAccess@myproject.iam.gserviceaccount.com

  name: trifacta-pod-sa-allaccess
EOF    


// Allow the Kubernetes ServiceAccount to impersonate the Google IAM ServiceAccount by adding an IAM policy binding between the two service accounts. This binding allows the Kubernetes ServiceAccount to act as the IAM ServiceAccount.
gcloud iam service-accounts add-iam-policy-binding \
  --project <project_name>
  --role roles/iam.workloadIdentityUser \
  --member "serviceAccount:<project_name>.svc.id.goog[default/trifacta-pod-sa-allaccess]" \

allAccess@myproject.iam.gserviceaccount.com

Wait a couple of minutes for the binding to take effect.

Note

For relational connectivity, additional configuration is required. Search for data-system in Dataprep In-VPC Execution - Advanced.

Testing

You can use the following command to watch the Kubernetes clusters for job execution and to check active pods:

kubectl get pods -n default -w

To get details on a specific pod:

kubectl describe <podId>

Then, run a job in Trifacta Photon through the Trifacta Application. If the job runs successfully, then the configuration has been properly applied. See Run Job Page.