Dataprep In-VPC Execution - Advanced

This section details advanced configuration options for processing for Dataprep by Trifacta within your enterprise VPC.

Please complete the following steps for the Advanced setup, which provides finer-grained controls over the cluster and job execution settings.

For more information on basic setup, see Dataprep In-VPC Execution.

Google Cloud IAM Service Account - diff

This Service Account is assigned to the nodes in the GKE node pool and is configured to have minimal privileges.

Following are variables listed in the configuration steps. They can be modified based on your requirements and supported values:

Variable	Description
`trifacta-service-account`	Default service account name
myproject	Name of your Google project
myregion	Your Google Cloud region

Please execute the following commands from the gcloud CLI:

Note

Below, some values are too long for a single line. Single lines that overflow to additional lines are marked with a \. The backslash should not be included if the line is used as input.

gcloud iam service-accounts create trifacta-service-account \
--display-name="Service Account for running Trifacta Remote jobs"

gcloud projects add-iam-policy-binding myproject \
--member "serviceAccount:trifacta-service-account@myproject.iam.gserviceaccount.com" \
--role roles/logging.logWriter

gcloud projects add-iam-policy-binding myproject \
--member "serviceAccount:trifacta-service-account@myproject.iam.gserviceaccount.com" \
--role roles/monitoring.metricWriter


gcloud projects add-iam-policy-binding myproject \
--member "serviceAccount:trifacta-service-account@myproject.iam.gserviceaccount.com" \
--role roles/monitoring.viewer


gcloud projects add-iam-policy-binding myproject \
--member "serviceAccount:trifacta-service-account@myproject.iam.gserviceaccount.com" \
--role roles/stackdriver.resourceMetadata.writer

Verification steps:

Command:

gcloud projects get-iam-policy myproject --flatten="bindings[].members" --format="table(bindings.role)" --filter="bindings.members:serviceAccount:trifacta-service-account@myproject.iam.gserviceaccount.com"

The output should look like the following:

ROLE
roles/artifactregistry.reader
roles/logging.logWriter
roles/monitoring.metricWriter
roles/monitoring.viewer
roles/stackdriver.resourceMetadata.writer

Limitations

For design time connectivity, you should install the design time service image on the default node pool.

Router and NAT

The following configuration is required for Internet access to acquire assets from Dataprep by Trifacta, if the GKE cluster has private nodes.

Note

Below, some values are too long for a single line. Single lines that overflow to additional lines are marked with a \. The backslash should not be included if the line is used as input.

gcloud compute routers create myproject-myregion \
--network myproject-network \
--region=myregion

gcloud compute routers nats create myproject-myregion \
--router=myproject-myregion \
--auto-allocate-nat-external-ips \
--nat-all-subnet-ip-ranges \
--enable-logging

Verification Steps:

You can verify that the router NAT was created in the Google Cloud Platform Console: https://console.cloud.google.com/net-services/nat/list.

GKE cluster

This configuration creates the GKE cluster for use in executing jobs. This cluster must be created in the VPC/sub-network that has access to your datasources, such as your databases and Cloud Storage.

In the following, please replace w.x.y.z with the IP address provided to you by Alteryx for authorized control plane access.

The Pod address range limits the maximum size of the cluster. See https://cloud.google.com/kubernetes-engine/docs/how-to/flexible-pod-cidr.
For more information about the available zones in node-locations, please see https://console.cloud.google.com/compute/zones.
If you don’t have Quotas on accounts, you must reconfigure the node size to fit inside your quota, or your cluster may not start.

Note

Below, some values are too long for a single line. Single lines that overflow to additional lines are marked with a \. The backslash should not be included if the line is used as input.

gcloud container clusters create "trifacta-cluster" \
--project "myproject" \
--region "myregion" \
--no-enable-basic-auth \
--cluster-version "1.20.8-gke.900" \
--release-channel "None" \
--machine-type "n1-standard-16" \
--image-type "COS_CONTAINERD" \
--disk-type "pd-standard" \
--disk-size "100" \
--metadata disable-legacy-endpoints=true \
--service-account "trifacta-service-account@myproject.iam.gserviceaccount.com" \
--max-pods-per-node "110" \
--num-nodes "1" \
--logging=SYSTEM,WORKLOAD \
--monitoring=SYSTEM \
--enable-ip-alias \
--network "projects/myproject/global/networks/myproject-network" \
--subnetwork "projects/myproject/regions/myregion/subnetworks/myproject-subnet-myregion" \
--no-enable-intra-node-visibility \
--default-max-pods-per-node "110" \
--enable-autoscaling \
--min-nodes "0" \
--max-nodes "3" \
--enable-master-authorized-networks \
--master-authorized-networks x.y.z.w/32 \
--addons HorizontalPodAutoscaling,HttpLoadBalancing,GcePersistentDiskCsiDriver \
--no-enable-autoupgrade \
--enable-autorepair \
--max-surge-upgrade 1 \
--max-unavailable-upgrade 0 \
--workload-pool "myproject.svc.id.goog" \
--enable-private-nodes \
--enable-shielded-nodes \
--shielded-secure-boot \
--node-locations "myregion-a","myregion-b","myregion-c" \
--master-ipv4-cidr=10.1.0.0/28 \
--enable-binauthz

Verification Steps:

You can verify that the cluster was created through the Google Cloud Platform Console: https://console.cloud.google.com/kubernetes/list/overview.

Switch to new cluster

Use the following command to switch to the new GKE cluster that you just created:

gcloud container clusters get-credentials trifacta-cluster --region myregion --project myproject

Node pool - diff

Please complete the following configuration to specify a non-default node pool. In this example, the value is photon-job-pool:

gcloud container node-pools create photon-job-pool \
--cluster trifacta-cluster \
--enable-autorepair \
--no-enable-autoupgrade \
--image-type=COS_CONTAINERD \
--machine-type=n1-standard-16 \
--max-surge-upgrade 1 \
--max-unavailable-upgrade=0 \
--node-locations=myregion-a,myregion-b,myregion-c \
--node-taints=jobType=photon:NoSchedule \
--node-version=1.20.8-gke.900 \
--num-nodes=1 \
--shielded-integrity-monitoring \
--shielded-secure-boot \
--workload-metadata=GKE_METADATA  \
--enable-autoscaling \
--max-nodes=10 \
--min-nodes=1 \
--region=myregion \
--service-account=trifacta-service-account@myproject.iam.gserviceaccount.com

You can use the following command to get the list of available node pools for your cluster:

gcloud container node-pools list --cluster trifacta-cluster --region=myregion

Kubernetes namespace - diff

Please complete the following configuration to specify a non-default namespace. In this example, the value is photon-job-namespace:

kubectl create namespace photon-job-namespace

Kubernetes Service Accounts

Variable	Description
trifacta-job-runner	Service Account used by Dataprep by Trifacta externally to launch jobs into the GKE cluster.
trifacta-pod-sa	Service Account assigned to the job pod running in the GKE cluster.

Please execute the following commands:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ServiceAccount
automountServiceAccountToken: false
metadata:
  namespace: default
  name: trifacta-job-runner
EOF

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Secret
metadata:
 name: trifacta-job-runner-secret
 annotations:
  kubernetes.io/service-account.name: trifacta-job-runner
type: kubernetes.io/service-account-token
EOF

cat <<EOF | kubectl apply -n data-system-job-namespace -f -
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: trifacta-job-runner-role
rules:
- apiGroups: [""]
  resources: ["secrets"]
  verbs: ["list", "create", "delete"]
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["list", "update"]
- apiGroups: [""]
  resources: ["pods/log", "pods/portforward"]
  verbs: ["get", "list", "create"]
- apiGroups: ["batch"]
  resources: ["jobs"]
  verbs: ["get", "create", "delete", "watch"]
- apiGroups: [""]
  resources: ["serviceaccounts"]
  verbs: ["list", "get"]
- apiGroups: [""]
  resources: ["configmaps"]
  verbs: ["patch", list", "get", "create"]
- apiGroups: [""]
  resources: ["services"]
  verbs: ["create", "list", "get"]
- apiGroups: ["apps"]
  resources: ["deployments"]
  verbs: ["patch", "create", "list", "get"]
EOF

cat <<EOF | kubectl apply -n photon-job-namespace -f -
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: trifacta-job-runner-rb
subjects:
- kind: ServiceAccount
  name: trifacta-job-runner
  namespace: default
roleRef:
  kind: Role
  name: trifacta-job-runner-role
  apiGroup: rbac.authorization.k8s.io
EOF

cat <<EOF | kubectl apply -n photon-job-namespace -f -
apiVersion: v1
kind: ServiceAccount
automountServiceAccountToken: false
metadata:
  name: trifacta-pod-sa
EOF

Credential encryption keys

The following commands create the encryption keys for credentials:

Note

Below, some values are too long for a single line. Single lines that overflow to additional lines are marked with a \. The backslash should not be included if the line is used as input.

openssl genrsa -out private_key.pem 2048

openssl pkcs8 -topk8 -inform PEM -outform DER -in private_key.pem -out private_key.der -nocrypt


openssl rsa -in private_key.pem -pubout -outform DER -out public_key.der

base64 -i public_key.der > public_key.der.base64

base64 -i private_key.der > private_key.der.base64


kubectl create secret generic trifacta-credential-encryption -n photon-job-namespace \
--from-file=privateKey=private_key.der.base64

Connectivity Jobs

Please complete the following steps to enable use of your preferred service account, project, and region settings for execution of connectivity jobs in your VPC.

Node pool

gcloud container node-pools create data-system-job-pool \
--cluster=trifacta-cluster \
--enable-autorepair \
--no-enable-autoupgrade \
--image-type=COS_CONTAINERD \
--machine-type=n1-standard-16 \
--max-surge-upgrade=1 \
--max-unavailable-upgrade=0 \
--node-locations=us-central1-a,us-central1-b,us-central1-c \
--node-taints=jobType=dataSystem:NoSchedule \
--node-version=1.22.7-gke.1300 \
--num-nodes=1 \
--shielded-integrity-monitoring \
--shielded-secure-boot \
--workload-metadata=GKE_METADATA  \
--enable-autoscaling \
--max-nodes=10 \
--min-nodes=1 \
--region=us-central1 \
--service-account=trifacta-service-account@myproject.iam.gserviceaccount.com

Kubernetes namespace

kubectl create namespace data-system-job-namespace

The following allows the Kubernetes ServiceAccount to impersonate the Google IAM ServiceAccount by adding an IAM policy binding between the two service accounts. This binding allows the Kubernetes ServiceAccount to act as the IAM ServiceAccount for the namespace defined above.

gcloud iam service-accounts add-iam-policy-binding \
  --role roles/iam.workloadIdentityUser \
  --member "serviceAccount:myproject.svc.id.goog[data-system-job-namespace/trifacta-pod-sa]" \
  gcs-access-service-account@myproject.iam.gserviceaccount.com

Kubernetes roles and rolebindings

cat <<EOF | kubectl apply -n data-system-job-namespace -f -
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: trifacta-job-runner-role
rules:
- apiGroups: [""]
  resources: ["secrets"]
  verbs: ["create", "delete"]
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["list"]
- apiGroups: [""]
  resources: ["pods/log"]
  verbs: ["get"]
- apiGroups: ["batch"]
  resources: ["jobs"]
  verbs: ["get", "create", "delete", "watch"]
- apiGroups: [""]
  resources: ["serviceaccounts"]
  verbs: ["list", "get"]
EOF

cat <<EOF | kubectl apply -n data-system-job-namespace -f -
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: trifacta-job-runner-rb
subjects:
- kind: ServiceAccount
  name: trifacta-job-runner
  namespace: default
roleRef:
  kind: Role
  name: trifacta-job-runner-role
  apiGroup: rbac.authorization.k8s.io
EOF

cat <<EOF | kubectl apply -n data-system-job-namespace -f -
apiVersion: v1
kind: ServiceAccount
automountServiceAccountToken: false
metadata:
  name: trifacta-pod-sa
EOF

Create secret

Create a secret to store the private key in the Connectivity/DataSystem job namespace.

kubectl create secret generic trifacta-credential-encryption -n data-system-job-namespace \
    --from-file=privateKey=private_key.der.base64

Conversion Jobs

Please complete the following steps to enable use of your preferred service account, project, and region settings for execution of conversion jobs in your VPC.

Node pool

gcloud container node-pools create convert-job-pool \
--cluster=trifacta-cluster \
--enable-autorepair \
--no-enable-autoupgrade \
--image-type=COS_CONTAINERD \
--machine-type=n1-standard-16 \
--max-surge-upgrade=1 \
--max-unavailable-upgrade=0 \
--node-locations=us-central1-a,us-central1-b,us-central1-c \
--node-taints=jobType=conversion:NoSchedule \
--node-version=1.22.7-gke.1300 \
--num-nodes=1 \
--shielded-integrity-monitoring \
--shielded-secure-boot \
--workload-metadata=GKE_METADATA  \
--enable-autoscaling \
--max-nodes=10 \
--min-nodes=1 \
--region=us-central1 \
--service-account=trifacta-service-account@myproject.iam.gserviceaccount.com

Kubernetes namespace

kubectl create namespace convert-job-namespace

gcloud iam service-accounts add-iam-policy-binding \
  --role roles/iam.workloadIdentityUser \
  --member "serviceAccount:myproject.svc.id.goog[convert-job-namespace/trifacta-pod-sa]" \
  gcs-access-service-account@myproject.iam.gserviceaccount.com

Kubernetes roles and rolebindings

cat <<EOF | kubectl apply -n convert-job-namespace -f -
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: trifacta-job-runner-role
rules:
- apiGroups: [""]
  resources: ["secrets"]
  verbs: ["create", "delete"]
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["list"]
- apiGroups: [""]
  resources: ["pods/log"]
  verbs: ["get"]
- apiGroups: ["batch"]
  resources: ["jobs"]
  verbs: ["get", "create", "delete", "watch"]
- apiGroups: [""]
  resources: ["serviceaccounts"]
  verbs: ["list", "get"]
EOF

cat <<EOF | kubectl apply -n convert-job-namespace -f -
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: trifacta-job-runner-rb
subjects:
- kind: ServiceAccount
  name: trifacta-job-runner
  namespace: default
roleRef:
  kind: Role
  name: trifacta-job-runner-role
  apiGroup: rbac.authorization.k8s.io
EOF

cat <<EOF | kubectl apply -n convert-job-namespace -f -
apiVersion: v1
kind: ServiceAccount
automountServiceAccountToken: false
metadata:
  name: trifacta-pod-sa
EOF

Create secret

Create a secret to store the private key in the Conversion job namespace.

kubectl create secret generic trifacta-credential-encryption -n convert-job-namespace \
    --from-file=privateKey=private_key.der.base64

Dataprep by Trifacta application configuration

After you have completed the above configuration, you must populate the following values in the Trifacta Application based on the commands that you execute below.

Steps:

Login to the Trifacta Application as a project owner.
Select Admin console > VPC runtime settings.
Please complete the following configuration. For more information, see VPC Runtime Settings Page.

Kubernetes cluster tab

Setting	Command or Value
Master URL	Command: gcloud container clusters describe trifacta-cluster --zone=myregion --format="value(endpoint)"
OAuth token	Command: kubectl get secret/trifacta-job-runner-secret -o json \| jq -r '.data.token' \| base64 --decode
Cluster CA certificate	Command: gcloud container clusters describe trifacta-cluster --zone=myregion --format="value(masterAuth.clusterCaCertificate)"
Service account name - diff?	Value: `trifacta-pod-sa`
Public key	Insert the contents of: `public_key.der.base64`. To acquire this value: cat public_key.der.base64 Note To process Google Sheets data in your VPC, this value is required. Otherwise, this value is optional.
Private key secret name	Value: `trifacta-credential-encryption` Note To process Google Sheets data in your VPC, this value is required. The private key must be accessible within your VPC. Otherwise, this value is optional.

Photon tab

Setting	Command or Value
Namespace	Value: `photon-job-namespace` To acquire the namespace value: kubectl get namespace
CPU, memory - request, limits	Adjust as needed.
Node selector, tolerations - diff	Values: Node selector = "{"cloud.google.com/gke-nodepool": "photon-job-pool"}"Node selector = "{\"cloud.google.com/gke-nodepool\": \"data-system-job-pool\"}" Node tolerations = "[{"effect":"NoSchedule","key":"jobType","operator":"Equal","value":"photon"}]"

Setting

Command or Value

Namespace

Value: photon-job-namespace

To acquire the namespace value:

kubectl get namespace

CPU, memory - request, limits

Adjust as needed.

Node selector, tolerations - diff

Values:

Node selector = "{"cloud.google.com/gke-nodepool": "photon-job-pool"}"Node selector = "{\"cloud.google.com/gke-nodepool\": \"data-system-job-pool\"}"
Node tolerations = "[{"effect":"NoSchedule","key":"jobType","operator":"Equal","value":"photon"}]"

To test your configuration, click Test. A success message should be displayed.
To save your configuration, click Save.

If you have tested and saved your configuration, you should be able to run a Trifacta Photon job in your VPC. See "Testing" below.

Connectivity tab

Setting	Command or Value
Kubernetes namespace	data-system-job-namespace
CPU, memory - request, limits	Adjust defaults, if necessary.
Node selector, tolerations	Node selector = "{"cloud.google.com/gke-nodepool": "photon-job-pool"}"Node selector = "{\"cloud.google.com/gke-nodepool\": \"data-system-job-pool\"}" Node tolerations = "[{"effect":"NoSchedule","key":"jobType","operator":"Equal","value":"photon"}]"

Setting

Command or Value

Kubernetes namespace

data-system-job-namespace

CPU, memory - request, limits

Adjust defaults, if necessary.

Node selector, tolerations

Node selector = "{"cloud.google.com/gke-nodepool": "photon-job-pool"}"Node selector = "{\"cloud.google.com/gke-nodepool\": \"data-system-job-pool\"}"
Node tolerations = "[{"effect":"NoSchedule","key":"jobType","operator":"Equal","value":"photon"}]"

To test your configuration, click Test. A success message should be displayed.
To save your configuration, click Save.

If you have tested and saved your configuration, you should be able to run a connectivity job to pull in data in your VPC. See "Testing" below.

Conversion tab

Setting	Command or Value
Kubernetes namespace	convert-job-namespace
CPU, memory - request, limits	Adjust defaults, if necessary.
Node selector, tolerations	Node selector = "{"cloud.google.com/gke-nodepool": "photon-job-pool"}"Node selector = "{\"cloud.google.com/gke-nodepool\": \"data-system-job-pool\"}" Node tolerations = "[{"effect":"NoSchedule","key":"jobType","operator":"Equal","value":"photon"}]"

Setting

Command or Value

Kubernetes namespace

convert-job-namespace

CPU, memory - request, limits

Adjust defaults, if necessary.

Node selector, tolerations

Node selector = "{"cloud.google.com/gke-nodepool": "photon-job-pool"}"Node selector = "{\"cloud.google.com/gke-nodepool\": \"data-system-job-pool\"}"
Node tolerations = "[{"effect":"NoSchedule","key":"jobType","operator":"Equal","value":"photon"}]"

To test your configuration, click Test. A success message should be displayed.
To save your configuration, click Save.

If you have tested and saved your configuration, you should be able to run a conversion job to import datasources that require conversion from within your VPC. See "Testing" below.

Testing

For more information, see Dataprep In-VPC Execution.

Dataprep In-VPC Execution - Advanced

Google Cloud IAM Service Account - diff

Limitations

Router and NAT

GKE cluster

Node pool - diff

Kubernetes namespace - diff

Kubernetes Service Accounts

Credential encryption keys

Connectivity Jobs

Conversion Jobs

Dataprep by Trifacta application configuration

Testing

Search results