Dataprep In-VPC Execution - Advanced
This section details advanced configuration options for processing for Dataprep by Trifacta within your enterprise VPC.
Please complete the following steps for the Advanced setup, which provides finer-grained controls over the cluster and job execution settings.
For more information on basic setup, see Dataprep In-VPC Execution.
Google Cloud IAM Service Account - diff
This Service Account is assigned to the nodes in the GKE node pool and is configured to have minimal privileges.
Following are variables listed in the configuration steps. They can be modified based on your requirements and supported values:
Variable | Description |
---|---|
| Default service account name |
myproject | Name of your Google project |
myregion | Your Google Cloud region |
Please execute the following commands from the gcloud
CLI:
Note
Below, some values are too long for a single line. Single lines that overflow to additional lines are marked with a \
. The backslash should not be included if the line is used as input.
gcloud iam service-accounts create trifacta-service-account \ --display-name="Service Account for running Trifacta Remote jobs" gcloud projects add-iam-policy-binding myproject \ --member "serviceAccount:trifacta-service-account@myproject.iam.gserviceaccount.com" \ --role roles/logging.logWriter gcloud projects add-iam-policy-binding myproject \ --member "serviceAccount:trifacta-service-account@myproject.iam.gserviceaccount.com" \ --role roles/monitoring.metricWriter gcloud projects add-iam-policy-binding myproject \ --member "serviceAccount:trifacta-service-account@myproject.iam.gserviceaccount.com" \ --role roles/monitoring.viewer gcloud projects add-iam-policy-binding myproject \ --member "serviceAccount:trifacta-service-account@myproject.iam.gserviceaccount.com" \ --role roles/stackdriver.resourceMetadata.writer
Verification steps:
Command:
gcloud projects get-iam-policy myproject --flatten="bindings[].members" --format="table(bindings.role)" --filter="bindings.members:serviceAccount:trifacta-service-account@myproject.iam.gserviceaccount.com"
The output should look like the following:
ROLE roles/artifactregistry.reader roles/logging.logWriter roles/monitoring.metricWriter roles/monitoring.viewer roles/stackdriver.resourceMetadata.writer
Limitations
For design time connectivity, you should install the design time service image on the default node pool.
Router and NAT
The following configuration is required for Internet access to acquire assets from Dataprep by Trifacta, if the GKE cluster has private nodes.
Note
Below, some values are too long for a single line. Single lines that overflow to additional lines are marked with a \
. The backslash should not be included if the line is used as input.
gcloud compute routers create myproject-myregion \ --network myproject-network \ --region=myregion gcloud compute routers nats create myproject-myregion \ --router=myproject-myregion \ --auto-allocate-nat-external-ips \ --nat-all-subnet-ip-ranges \ --enable-logging
Verification Steps:
You can verify that the router NAT was created in the Google Cloud Platform Console: https://console.cloud.google.com/net-services/nat/list.
GKE cluster
This configuration creates the GKE cluster for use in executing jobs. This cluster must be created in the VPC/sub-network that has access to your datasources, such as your databases and Cloud Storage.
In the following, please replace w.x.y.z
with the IP address provided to you by Alteryx for authorized control plane access.
The Pod address range limits the maximum size of the cluster. See https://cloud.google.com/kubernetes-engine/docs/how-to/flexible-pod-cidr.
For more information about the available zones in
node-locations
, please see https://console.cloud.google.com/compute/zones.If you don’t have Quotas on accounts, you must reconfigure the node size to fit inside your quota, or your cluster may not start.
Note
Below, some values are too long for a single line. Single lines that overflow to additional lines are marked with a \
. The backslash should not be included if the line is used as input.
gcloud container clusters create "trifacta-cluster" \ --project "myproject" \ --region "myregion" \ --no-enable-basic-auth \ --cluster-version "1.20.8-gke.900" \ --release-channel "None" \ --machine-type "n1-standard-16" \ --image-type "COS_CONTAINERD" \ --disk-type "pd-standard" \ --disk-size "100" \ --metadata disable-legacy-endpoints=true \ --service-account "trifacta-service-account@myproject.iam.gserviceaccount.com" \ --max-pods-per-node "110" \ --num-nodes "1" \ --logging=SYSTEM,WORKLOAD \ --monitoring=SYSTEM \ --enable-ip-alias \ --network "projects/myproject/global/networks/myproject-network" \ --subnetwork "projects/myproject/regions/myregion/subnetworks/myproject-subnet-myregion" \ --no-enable-intra-node-visibility \ --default-max-pods-per-node "110" \ --enable-autoscaling \ --min-nodes "0" \ --max-nodes "3" \ --enable-master-authorized-networks \ --master-authorized-networks x.y.z.w/32 \ --addons HorizontalPodAutoscaling,HttpLoadBalancing,GcePersistentDiskCsiDriver \ --no-enable-autoupgrade \ --enable-autorepair \ --max-surge-upgrade 1 \ --max-unavailable-upgrade 0 \ --workload-pool "myproject.svc.id.goog" \ --enable-private-nodes \ --enable-shielded-nodes \ --shielded-secure-boot \ --node-locations "myregion-a","myregion-b","myregion-c" \ --master-ipv4-cidr=10.1.0.0/28 \ --enable-binauthz
Verification Steps:
You can verify that the cluster was created through the Google Cloud Platform Console: https://console.cloud.google.com/kubernetes/list/overview.
Use the following command to switch to the new GKE cluster that you just created:
gcloud container clusters get-credentials trifacta-cluster --region myregion --project myproject
Node pool - diff
Please complete the following configuration to specify a non-default node pool. In this example, the value is photon-job-pool
:
gcloud container node-pools create photon-job-pool \ --cluster trifacta-cluster \ --enable-autorepair \ --no-enable-autoupgrade \ --image-type=COS_CONTAINERD \ --machine-type=n1-standard-16 \ --max-surge-upgrade 1 \ --max-unavailable-upgrade=0 \ --node-locations=myregion-a,myregion-b,myregion-c \ --node-taints=jobType=photon:NoSchedule \ --node-version=1.20.8-gke.900 \ --num-nodes=1 \ --shielded-integrity-monitoring \ --shielded-secure-boot \ --workload-metadata=GKE_METADATA \ --enable-autoscaling \ --max-nodes=10 \ --min-nodes=1 \ --region=myregion \ --service-account=trifacta-service-account@myproject.iam.gserviceaccount.com
You can use the following command to get the list of available node pools for your cluster:
gcloud container node-pools list --cluster trifacta-cluster --region=myregion
Kubernetes namespace - diff
Please complete the following configuration to specify a non-default namespace. In this example, the value is photon-job-namespace
:
kubectl create namespace photon-job-namespace
Kubernetes Service Accounts
Variable | Description |
---|---|
trifacta-job-runner | Service Account used by Dataprep by Trifacta externally to launch jobs into the GKE cluster. |
trifacta-pod-sa | Service Account assigned to the job pod running in the GKE cluster. |
Please execute the following commands:
cat <<EOF | kubectl apply -f - apiVersion: v1 kind: ServiceAccount automountServiceAccountToken: false metadata: namespace: default name: trifacta-job-runner EOF
cat <<EOF | kubectl apply -f - apiVersion: v1 kind: Secret metadata: name: trifacta-job-runner-secret annotations: kubernetes.io/service-account.name: trifacta-job-runner type: kubernetes.io/service-account-token EOF
cat <<EOF | kubectl apply -n data-system-job-namespace -f - apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: trifacta-job-runner-role rules: - apiGroups: [""] resources: ["secrets"] verbs: ["list", "create", "delete"] - apiGroups: [""] resources: ["pods"] verbs: ["list", "update"] - apiGroups: [""] resources: ["pods/log", "pods/portforward"] verbs: ["get", "list", "create"] - apiGroups: ["batch"] resources: ["jobs"] verbs: ["get", "create", "delete", "watch"] - apiGroups: [""] resources: ["serviceaccounts"] verbs: ["list", "get"] - apiGroups: [""] resources: ["configmaps"] verbs: ["patch", list", "get", "create"] - apiGroups: [""] resources: ["services"] verbs: ["create", "list", "get"] - apiGroups: ["apps"] resources: ["deployments"] verbs: ["patch", "create", "list", "get"] EOF
cat <<EOF | kubectl apply -n photon-job-namespace -f - apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: trifacta-job-runner-rb subjects: - kind: ServiceAccount name: trifacta-job-runner namespace: default roleRef: kind: Role name: trifacta-job-runner-role apiGroup: rbac.authorization.k8s.io EOF
cat <<EOF | kubectl apply -n photon-job-namespace -f - apiVersion: v1 kind: ServiceAccount automountServiceAccountToken: false metadata: name: trifacta-pod-sa EOF
Credential encryption keys
The following commands create the encryption keys for credentials:
Note
Below, some values are too long for a single line. Single lines that overflow to additional lines are marked with a \
. The backslash should not be included if the line is used as input.
openssl genrsa -out private_key.pem 2048 openssl pkcs8 -topk8 -inform PEM -outform DER -in private_key.pem -out private_key.der -nocrypt openssl rsa -in private_key.pem -pubout -outform DER -out public_key.der base64 -i public_key.der > public_key.der.base64 base64 -i private_key.der > private_key.der.base64 kubectl create secret generic trifacta-credential-encryption -n photon-job-namespace \ --from-file=privateKey=private_key.der.base64
Connectivity Jobs
Please complete the following steps to enable use of your preferred service account, project, and region settings for execution of connectivity jobs in your VPC.
gcloud container node-pools create data-system-job-pool \ --cluster=trifacta-cluster \ --enable-autorepair \ --no-enable-autoupgrade \ --image-type=COS_CONTAINERD \ --machine-type=n1-standard-16 \ --max-surge-upgrade=1 \ --max-unavailable-upgrade=0 \ --node-locations=us-central1-a,us-central1-b,us-central1-c \ --node-taints=jobType=dataSystem:NoSchedule \ --node-version=1.22.7-gke.1300 \ --num-nodes=1 \ --shielded-integrity-monitoring \ --shielded-secure-boot \ --workload-metadata=GKE_METADATA \ --enable-autoscaling \ --max-nodes=10 \ --min-nodes=1 \ --region=us-central1 \ --service-account=trifacta-service-account@myproject.iam.gserviceaccount.com
kubectl create namespace data-system-job-namespace
The following allows the Kubernetes ServiceAccount to impersonate the Google IAM ServiceAccount by adding an IAM policy binding between the two service accounts. This binding allows the Kubernetes ServiceAccount to act as the IAM ServiceAccount for the namespace defined above.
gcloud iam service-accounts add-iam-policy-binding \ --role roles/iam.workloadIdentityUser \ --member "serviceAccount:myproject.svc.id.goog[data-system-job-namespace/trifacta-pod-sa]" \ gcs-access-service-account@myproject.iam.gserviceaccount.com
cat <<EOF | kubectl apply -n data-system-job-namespace -f - apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: trifacta-job-runner-role rules: - apiGroups: [""] resources: ["secrets"] verbs: ["create", "delete"] - apiGroups: [""] resources: ["pods"] verbs: ["list"] - apiGroups: [""] resources: ["pods/log"] verbs: ["get"] - apiGroups: ["batch"] resources: ["jobs"] verbs: ["get", "create", "delete", "watch"] - apiGroups: [""] resources: ["serviceaccounts"] verbs: ["list", "get"] EOF cat <<EOF | kubectl apply -n data-system-job-namespace -f - apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: trifacta-job-runner-rb subjects: - kind: ServiceAccount name: trifacta-job-runner namespace: default roleRef: kind: Role name: trifacta-job-runner-role apiGroup: rbac.authorization.k8s.io EOF cat <<EOF | kubectl apply -n data-system-job-namespace -f - apiVersion: v1 kind: ServiceAccount automountServiceAccountToken: false metadata: name: trifacta-pod-sa EOF
Create a secret to store the private key in the Connectivity/DataSystem job namespace.
kubectl create secret generic trifacta-credential-encryption -n data-system-job-namespace \ --from-file=privateKey=private_key.der.base64
Conversion Jobs
Please complete the following steps to enable use of your preferred service account, project, and region settings for execution of conversion jobs in your VPC.
gcloud container node-pools create convert-job-pool \ --cluster=trifacta-cluster \ --enable-autorepair \ --no-enable-autoupgrade \ --image-type=COS_CONTAINERD \ --machine-type=n1-standard-16 \ --max-surge-upgrade=1 \ --max-unavailable-upgrade=0 \ --node-locations=us-central1-a,us-central1-b,us-central1-c \ --node-taints=jobType=conversion:NoSchedule \ --node-version=1.22.7-gke.1300 \ --num-nodes=1 \ --shielded-integrity-monitoring \ --shielded-secure-boot \ --workload-metadata=GKE_METADATA \ --enable-autoscaling \ --max-nodes=10 \ --min-nodes=1 \ --region=us-central1 \ --service-account=trifacta-service-account@myproject.iam.gserviceaccount.com
kubectl create namespace convert-job-namespace
The following allows the Kubernetes ServiceAccount to impersonate the Google IAM ServiceAccount by adding an IAM policy binding between the two service accounts. This binding allows the Kubernetes ServiceAccount to act as the IAM ServiceAccount for the namespace defined above.
gcloud iam service-accounts add-iam-policy-binding \ --role roles/iam.workloadIdentityUser \ --member "serviceAccount:myproject.svc.id.goog[convert-job-namespace/trifacta-pod-sa]" \ gcs-access-service-account@myproject.iam.gserviceaccount.com
cat <<EOF | kubectl apply -n convert-job-namespace -f - apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: trifacta-job-runner-role rules: - apiGroups: [""] resources: ["secrets"] verbs: ["create", "delete"] - apiGroups: [""] resources: ["pods"] verbs: ["list"] - apiGroups: [""] resources: ["pods/log"] verbs: ["get"] - apiGroups: ["batch"] resources: ["jobs"] verbs: ["get", "create", "delete", "watch"] - apiGroups: [""] resources: ["serviceaccounts"] verbs: ["list", "get"] EOF cat <<EOF | kubectl apply -n convert-job-namespace -f - apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: trifacta-job-runner-rb subjects: - kind: ServiceAccount name: trifacta-job-runner namespace: default roleRef: kind: Role name: trifacta-job-runner-role apiGroup: rbac.authorization.k8s.io EOF cat <<EOF | kubectl apply -n convert-job-namespace -f - apiVersion: v1 kind: ServiceAccount automountServiceAccountToken: false metadata: name: trifacta-pod-sa EOF
Create a secret to store the private key in the Conversion job namespace.
kubectl create secret generic trifacta-credential-encryption -n convert-job-namespace \ --from-file=privateKey=private_key.der.base64
Dataprep by Trifacta application configuration
After you have completed the above configuration, you must populate the following values in the Trifacta Application based on the commands that you execute below.
Steps:
Login to the Trifacta Application as a project owner.
Select Admin console > VPC runtime settings.
Please complete the following configuration. For more information, see VPC Runtime Settings Page.
Setting | Command or Value |
---|---|
Master URL | Command: gcloud container clusters describe trifacta-cluster --zone=myregion --format="value(endpoint)" |
OAuth token | Command: kubectl get secret/trifacta-job-runner-secret -o json | jq -r '.data.token' | base64 --decode |
Cluster CA certificate | Command: gcloud container clusters describe trifacta-cluster --zone=myregion --format="value(masterAuth.clusterCaCertificate)" |
Service account name - diff? | Value: |
Public key | Insert the contents of: To acquire this value: cat public_key.der.base64 Note To process Google Sheets data in your VPC, this value is required. Otherwise, this value is optional. |
Private key secret name | Value: Note To process Google Sheets data in your VPC, this value is required. The private key must be accessible within your VPC. Otherwise, this value is optional. |
Setting | Command or Value |
---|---|
Namespace | Value: To acquire the namespace value: kubectl get namespace |
CPU, memory - request, limits | Adjust as needed. |
Node selector, tolerations - diff | Values: Node selector = "{"cloud.google.com/gke-nodepool": "photon-job-pool"}"Node selector = "{\"cloud.google.com/gke-nodepool\": \"data-system-job-pool\"}" Node tolerations = "[{"effect":"NoSchedule","key":"jobType","operator":"Equal","value":"photon"}]" |
To test your configuration, click Test. A success message should be displayed.
To save your configuration, click Save.
If you have tested and saved your configuration, you should be able to run a Trifacta Photon job in your VPC. See "Testing" below.
Setting | Command or Value |
---|---|
Kubernetes namespace | data-system-job-namespace |
CPU, memory - request, limits | Adjust defaults, if necessary. |
Node selector, tolerations | Node selector = "{"cloud.google.com/gke-nodepool": "photon-job-pool"}"Node selector = "{\"cloud.google.com/gke-nodepool\": \"data-system-job-pool\"}" Node tolerations = "[{"effect":"NoSchedule","key":"jobType","operator":"Equal","value":"photon"}]" |
To test your configuration, click Test. A success message should be displayed.
To save your configuration, click Save.
If you have tested and saved your configuration, you should be able to run a connectivity job to pull in data in your VPC. See "Testing" below.
Setting | Command or Value |
---|---|
Kubernetes namespace | convert-job-namespace |
CPU, memory - request, limits | Adjust defaults, if necessary. |
Node selector, tolerations | Node selector = "{"cloud.google.com/gke-nodepool": "photon-job-pool"}"Node selector = "{\"cloud.google.com/gke-nodepool\": \"data-system-job-pool\"}" Node tolerations = "[{"effect":"NoSchedule","key":"jobType","operator":"Equal","value":"photon"}]" |
To test your configuration, click Test. A success message should be displayed.
To save your configuration, click Save.
If you have tested and saved your configuration, you should be able to run a conversion job to import datasources that require conversion from within your VPC. See "Testing" below.
Testing
For more information, see Dataprep In-VPC Execution.