Migrate Sensitive Data in BigQuery Using Dataflow & Cloud DLP

This repo contains a reference implementation of an end to end data tokenization solution designed to migrate sensitive data in BigQuery. Please check out the links below for reference guides:

Reference Architecture

Quick Start

Run the following commands to trigger an automated deployment in your GCP project. Script handles following topics:

Create a bucket ({project-id}-demo-data) in us-central1 and uploads a sample dataset with mock PII data.
Create a BigQuery dataset in US (demo_dataset) to store the tokenized data.
Create a KMS wrapped key(KEK) by creating an automatic TEK (Token Encryption Key).
Create DLP inspect and re-identification template with the KEK and crypto based transformations identified in this section of the guide
Trigger an automated Dataflow pipeline by passing all the required parameters e.g: data, configuration & dataset name.
Please allow 5-10 mins for the deployment to be completed.

gcloud config set project <project_id>
sh deploy-data-tokeninzation-solution.sh

You can run some quick validations in BigQuery table to check on tokenized data.

For re-identification (getting back the original data in a Pub/Sub topic), please follow this instruction here.

V2 Solution By Using In Built Java Beam Transform

This part of the repo provides a reference implementation to process large scale files for any DLP transformation like Inspect, Deidentify or ReIdentify. Solution can be used for CSV / Avro files stored in either GCS or AWS S3 bucket. It uses State and Timer API for efficient batching to process the files in optimal manner.

Build and Run

gradle spotlessApply

gradle build

gradle run -DmainClass=com.google.swarm.tokenization.DLPTextToBigQueryStreamingV2 -Pargs=" --region=<region> --project=<projct_id> --streaming --enableStreamingEngine --tempLocation=gs://<bucket>/temp --numWorkers=1 --maxNumWorkers=2 --runner=DataflowRunner --filePattern=gs://<path>.csv --dataset=<name>   --inspectTemplateName=<inspect_template> --deidentifyTemplateName=<deid_tmplate> --DLPMethod=DEID"

S3 Scanner

export AWS_ACCESS_KEY="<access_key>"
export AWS_SECRET_KEY="<secret_key>"
export AWS_CRED="{\"@type\":\"AWSStaticCredentialsProvider\",\"awsAccessKeyId\":\"${AWS_ACCESS_KEY}\",\"awsSecretKey\":\"${AWS_SECRET_KEY}\"}"

gradle spotlessApply

gradle build

// inspect is default as DLP Method; For deid: --DLPMethod=DEID
gradle run -DmainClass=com.google.swarm.tokenization.DLPTextToBigQueryStreamingV2 -Pargs="--region=<region> --project=<project_id> --streaming --enableStreamingEngine --tempLocation=gs://<bucket>/temp --numWorkers=1 --maxNumWorkers=2 --runner=DataflowRunner --filePattern=s3://<bucket>>/file.csv --dataset=<name>  --inspectTemplateName=<inspect_template> --deidentifyTemplateName=<deid_tmplate> --awsRegion=<aws_region> --awsCredentialsProvider=$AWS_CRED"

ReIdentification From BigQuery

You can. use the pipeline to read from BgQuery table and publish the re-identification data in a secure pub sub topic.

Export the Standard SQL Query to read data from bigQuery One example from our solution guide:

export QUERY="select id,card_number,card_holders_name from \`${PROJECT_ID}.${BQ_DATASET_NAME}.100000CCRecords\` where safe_cast(credit_limit as int64)>100000 and safe_cast (age as int64)>50 group by id,card_number,card_holders_name limit 10"

Create a gcs file with the query:

export GCS_REID_QUERY_BUCKET=<name>
cat << EOF | gsutil cp - gs://${REID_QUERY_BUCKET}/reid_query.sql
${QUERY}
EOF

Run the pipeline by passing required parameters:

gradle run -DmainClass=com.google.swarm.tokenization.DLPTextToBigQueryStreamingV2 -Pargs="--region=<region> --project=<project_id> --streaming --enableStreamingEngine --tempLocation=gs://<bucket>/temp --numWorkers=5 --maxNumWorkers=10 --runner=DataflowRunner --tableRef=<project_id>:<dataset>.<table> --dataset=<dataset> --topic=projects/<project_id>/topics/<name> --autoscalingAlgorithm=THROUGHPUT_BASED --workerMachineType=n1-highmem-4 --deidentifyTemplateName=projects/<project_id>/deidentifyTemplates/<name> --DLPMethod=REID --keyRange=1024 --queryPath=gs://<gcs_reid_query_bucket>/reid_query.sql"

Dataflow DAG

For Deid and Inspect:

For Reid:

Trigger Pipeline Using Public Image

You can use the gcloud command to trigger the pipeline using Dataflow flex template. Below is an example for de-identification transform from a S3 bucket.

gcloud beta dataflow flex-template run "dlp-s3-scanner-deid-demo" --project=<project_id> \
--region=<region> --template-file-gcs-location=gs://dataflow-dlp-solution-sample-data/dynamic_template_dlp_v2.json \
--parameters=^~^streaming=true~enableStreamingEngine=true~tempLocation=gs://<path>/temp~numWorkers=5~maxNumWorkers=5~runner=DataflowRunner~filePattern=<s3orgcspath>/filename.csv~dataset=<bq_dataset>~autoscalingAlgorithm=THROUGHPUT_BASED~workerMachineType=n1-highmem-8~inspectTemplateName=<inspect_template>~deidentifyTemplateName=<deid_template>~awsRegion=ca-central-1~awsCredentialsProvider=$AWS_CRED~batchSize=100000~DLPMethod=DEID

To Do

take out first row as header before processing

Name		Name	Last commit message	Last commit date
Latest commit History 308 Commits
diagrams		diagrams
gradle/wrapper		gradle/wrapper
src		src
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
build.gradle		build.gradle
create-df-template.sh		create-df-template.sh
create-dlp-template.sh		create-dlp-template.sh
create-kek.sh		create-kek.sh
default.profraw		default.profraw
deploy-data-tokeninzation-solution.sh		deploy-data-tokeninzation-solution.sh
deploy-s3-inspect-solution.sh		deploy-s3-inspect-solution.sh
dlp-demo-part-1-crypto-key.yaml		dlp-demo-part-1-crypto-key.yaml
dlp-demo-part-2-dlp-template.yaml		dlp-demo-part-2-dlp-template.yaml
dlp-demo-s3-gcs-inspect.yaml		dlp-demo-s3-gcs-inspect.yaml
dlp-tokenization_metadata		dlp-tokenization_metadata
gcs-s3-inspect-config.json		gcs-s3-inspect-config.json
gradlew		gradlew
gradlew.bat		gradlew.bat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Migrate Sensitive Data in BigQuery Using Dataflow & Cloud DLP

Table of Contents

Reference Architecture

Quick Start

V2 Solution By Using In Built Java Beam Transform

Build and Run

S3 Scanner

ReIdentification From BigQuery

Dataflow DAG

Trigger Pipeline Using Public Image

To Do

About

Releases

Packages

Languages

License

vamosraghava/dlp-dataflow-deidentification

Folders and files

Latest commit

History

Repository files navigation

Migrate Sensitive Data in BigQuery Using Dataflow & Cloud DLP

Table of Contents

Reference Architecture

Quick Start

V2 Solution By Using In Built Java Beam Transform

Build and Run

S3 Scanner

ReIdentification From BigQuery

Dataflow DAG

Trigger Pipeline Using Public Image

To Do

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages