This repo contains a reference implementation of an end to end data tokenization solution designed to migrate sensitive data in BigQuery. Please check out the links below for reference guides:
- Concept & Overview.
- Create & Manage Cloud DLP Configurations.
- Automated Dataflow Pipeline to De-identify PII Dataset.
- Validate Dataset in BigQuery and Re-identify using Dataflow.
Run the following commands to trigger an automated deployment in your GCP project. Script handles following topics:
-
Create a bucket ({project-id}-demo-data) in us-central1 and uploads a sample dataset with mock PII data.
-
Create a BigQuery dataset in US (demo_dataset) to store the tokenized data.
-
Create a KMS wrapped key(KEK) by creating an automatic TEK (Token Encryption Key).
-
Create DLP inspect and re-identification template with the KEK and crypto based transformations identified in this section of the guide
-
Trigger an automated Dataflow pipeline by passing all the required parameters e.g: data, configuration & dataset name.
-
Please allow 5-10 mins for the deployment to be completed.
gcloud config set project <project_id>
sh deploy-data-tokeninzation-solution.sh
You can run some quick validations in BigQuery table to check on tokenized data.
For re-identification (getting back the original data in a Pub/Sub topic), please follow this instruction here.
This part of the repo provides a reference implementation to process large scale files for any DLP transformation like Inspect, Deidentify or ReIdentify. Solution can be used for CSV / Avro files stored in either GCS or AWS S3 bucket. It uses State and Timer API for efficient batching to process the files in optimal manner.
gradle spotlessApply
gradle build
gradle run -DmainClass=com.google.swarm.tokenization.DLPTextToBigQueryStreamingV2 -Pargs=" --region=<region> --project=<projct_id> --streaming --enableStreamingEngine --tempLocation=gs://<bucket>/temp --numWorkers=1 --maxNumWorkers=2 --runner=DataflowRunner --filePattern=gs://<path>.csv --dataset=<name> --inspectTemplateName=<inspect_template> --deidentifyTemplateName=<deid_tmplate> --DLPMethod=DEID"
export AWS_ACCESS_KEY="<access_key>"
export AWS_SECRET_KEY="<secret_key>"
export AWS_CRED="{\"@type\":\"AWSStaticCredentialsProvider\",\"awsAccessKeyId\":\"${AWS_ACCESS_KEY}\",\"awsSecretKey\":\"${AWS_SECRET_KEY}\"}"
gradle spotlessApply
gradle build
// inspect is default as DLP Method; For deid: --DLPMethod=DEID
gradle run -DmainClass=com.google.swarm.tokenization.DLPTextToBigQueryStreamingV2 -Pargs="--region=<region> --project=<project_id> --streaming --enableStreamingEngine --tempLocation=gs://<bucket>/temp --numWorkers=1 --maxNumWorkers=2 --runner=DataflowRunner --filePattern=s3://<bucket>>/file.csv --dataset=<name> --inspectTemplateName=<inspect_template> --deidentifyTemplateName=<deid_tmplate> --awsRegion=<aws_region> --awsCredentialsProvider=$AWS_CRED"
You can. use the pipeline to read from BgQuery table and publish the re-identification data in a secure pub sub topic.
Export the Standard SQL Query to read data from bigQuery One example from our solution guide:
export QUERY="select id,card_number,card_holders_name from \`${PROJECT_ID}.${BQ_DATASET_NAME}.100000CCRecords\` where safe_cast(credit_limit as int64)>100000 and safe_cast (age as int64)>50 group by id,card_number,card_holders_name limit 10"
Create a gcs file with the query:
export GCS_REID_QUERY_BUCKET=<name>
cat << EOF | gsutil cp - gs://${REID_QUERY_BUCKET}/reid_query.sql
${QUERY}
EOF
Run the pipeline by passing required parameters:
gradle run -DmainClass=com.google.swarm.tokenization.DLPTextToBigQueryStreamingV2 -Pargs="--region=<region> --project=<project_id> --streaming --enableStreamingEngine --tempLocation=gs://<bucket>/temp --numWorkers=5 --maxNumWorkers=10 --runner=DataflowRunner --tableRef=<project_id>:<dataset>.<table> --dataset=<dataset> --topic=projects/<project_id>/topics/<name> --autoscalingAlgorithm=THROUGHPUT_BASED --workerMachineType=n1-highmem-4 --deidentifyTemplateName=projects/<project_id>/deidentifyTemplates/<name> --DLPMethod=REID --keyRange=1024 --queryPath=gs://<gcs_reid_query_bucket>/reid_query.sql"
For Deid and Inspect:
For Reid:
You can use the gcloud command to trigger the pipeline using Dataflow flex template. Below is an example for de-identification transform from a S3 bucket.
gcloud beta dataflow flex-template run "dlp-s3-scanner-deid-demo" --project=<project_id> \
--region=<region> --template-file-gcs-location=gs://dataflow-dlp-solution-sample-data/dynamic_template_dlp_v2.json \
--parameters=^~^streaming=true~enableStreamingEngine=true~tempLocation=gs://<path>/temp~numWorkers=5~maxNumWorkers=5~runner=DataflowRunner~filePattern=<s3orgcspath>/filename.csv~dataset=<bq_dataset>~autoscalingAlgorithm=THROUGHPUT_BASED~workerMachineType=n1-highmem-8~inspectTemplateName=<inspect_template>~deidentifyTemplateName=<deid_template>~awsRegion=ca-central-1~awsCredentialsProvider=$AWS_CRED~batchSize=100000~DLPMethod=DEID
- take out first row as header before processing