Cloud
Blog
Developers & Practitioners
Intro to data science on
Google Cloud
February 1, 2022
Priyanka Vergadia
Lead Developer Advocate, Google
Polong Lin
Developer Advocate
While you likely know that data science
is the practice of making data useful,
you may not have a clear landscape
around the tools that can aid each
stage of the data science work!ow as
you use machine learning to tackle
your challenges.
Click to enlarge
Read on to discover the six broad
areas that are critical to the process of
making data useful, and some
corresponding Google Cloud products
and services for those areas.
The 6 steps of Data Sc…
Data engineering
Perhaps the greatest missed
oppo"unities in data science stem
from data that exists somewhere, but
hasn't been made accessible for use in
fu"her analysis. Laying the critical
foundation for downstream systems,
data engineering involves the
transpo"ing, shaping, and enriching of
data for the purposes of making it
available and accessible.
Data ingestion and data
preprocessing on Google
Cloud
Here we consider data ingestion as
moving data from one place to
another, and data preparation the
process of transformation,
augmentation, or enrichment prior to
consumption. Global scalability, high
throughput, real-time access, and
robustness are common challenges in
this stage. For scalable, real-time, and
batch data processing, look into
building data ingestion and
preprocessing pipelines with Data!ow,
a managed Apache Beam service.
There's a reason why Data!ow is
called the backbone of analytics on
Google Cloud.
If you're looking for a scalable
messaging system to help you ingest
data, consider Cloud Pub/Sub, a
global, horizontally scalable messaging
infrastructure. Cloud Pub/Sub was
built using the same infrastructure
component that enabled Google
products, including Ads, Search, and
Gmail, to handle hundreds of millions
of events per second.
If you want an easy way to automate
data movement to BigQuery, a
serverless data warehouse on Google
Cloud, look into the BigQuery Data
Transfer Service. For transferring data
to Cloud Storage, take a look at the
Storage Transfer Service. Or, for a no-
code data ingestion and
transformation tool, check out Data
Fusion, which has over 150
precon$gured connectors and
transformations. In addition to
Data!ow and Data Fusion for data
preparation, Spark users may want to
look at related products and features
for Spark on Google Cloud.
Data storage and data
cataloging on Google
Cloud
For structured data, consider a data
warehouse like BigQuery, or any of the
Cloud Databases (relational ones like
Cloud SQL and NoSQL ones like Cloud
BigTable and Cloud Firestore). For
unstructured data, you can always use
Cloud Storage. You may also want to
consider a data lake. For data
discovery, cataloging, and metadata
management, consider Data Catalog.
For a uni$ed solution, take a look at
Dataplex, which integrates a uni$ed
data management solution with an
integrated analytics experience.
Learn more about data
engineering on Google
Cloud
Explore the data engineering
learning path
Discover reference pa%erns
Get ce"i$ed by Google Cloud as a
Professional Data Engineer
Click to enlarge
Data Analysis
From descriptive statistics to
visualizations, data analysis is where
the value of data sta"s to appear.
Data exploration, data
preprocessing, and data
insights
Data exploration, a highly iterative
process, involves slicing and dicing
data via data preprocessing before
data insights can sta" to manifest
through visualizations or simply via
simple group-by, order-by operations.
One hallmark of this phase is that the
data scientist may not yet know which
questions to ask about the data. In this
somewhat ephemeral phase, a data
analyst or scientist has likely
uncovered some aha-moments, but
hasn't shared them yet. Once insights
are shared, the !ow enters the Insights
Activation stage, where those insights
become used to guide business
decisions, in!uence consumer
choices, or become embedded in
other applications or services.
On Google Cloud, there are many ways
to explore, preprocess, and uncover
insights in your data. If you are looking
for a notebook-based end-to-end
data science environment, check out
Ve"ex AI Workbench, which enables
you to access, analyze, and visualize
your entire data estate: from
structured data at the petabyte-scale
in SQL with BigQuery, to processing
data with Spark on Google Cloud and
its serverless, auto-scaling, and GPU
acceleration capabilities. As a uni$ed
data science environment, Ve"ex AI
Workbench also makes it easy to do
machine learning with TensorFlow,
PyTorch, and Spark, with built-in
MLOps capabilities.
Finally, if your focus is on analyzing
structured data from data warehouses
and insight activation for business
intelligence, you may want to also
consider using Looker, with its rich
interactive analytics, visualizations,
dashboarding tools, and Looker Blocks
to help you accelerate your time-to-
insight.
Learn more about data
analysis on Google Cloud
Learn about Ve"ex AI Workbench
for a Jupyter-based fully managed
notebook environment
Learn about how you can use
BigQuery for petabyte-scale data
analysis
Learn about Spark on Google
Cloud
Discover the data analyst learning
path
Explore reference pa%erns for
common analytics use cases
Model development
From linear regression to XGBoost,
from TensorFlow to PyTorch, the
model development stage is where
machine learning sta"s to provide new
ways of unlocking value from your
data. Experimentation is a strong
theme here, with data scientists
looking to accelerate iteration speed
between models without worrying
about infrastructure overhead or
context-switching between tools for
data analysis and tools for
productionizing models with MLOps.
To solve these challenges, once again,
as a Jupyter-based fully managed,
scalable, and enterprise-ready
environment, Ve"ex AI Workbench
makes it easy as the one-stop-shop
for data science, combining analytics
and machine learning, including Ve"ex
AI services. Apache Spark, XGBoost,
TensorFlow, and PyTorch are just some
of the frameworks suppo"ed on
Ve"ex AI Workbench. Ve"ex AI
Workbench makes managing the
underlying compute infrastructure
needed for model training easy with
the ability to scale ve"ically and
horizontally, and with idle timeouts and
auto shutdown capabilities to reduce
unnecessary costs. Notebooks
themselves can be used for
distributed training and
hyperparameter optimization, and they
include Git integration for version
control. Due to the signi$cant
reduction in context switching
required, data scientists can build and
train models 5x faster using Ve"ex AI
Workbench than when using traditional
notebooks.
With Ve"ex AI, custom models can be
trained and deployed using containers.
You can take advantage of pre-built
containers or custom containers to
train and deploy your models.
For low-code model development,
data analysts and data scientists can
use SQL with BigQuery ML to train and
deploy models (including XGBoost,
deep neural networks, and PCA),
directly using BigQuery's built-in
serverless, autoscaling capabilities.
Behind-the-scenes, BigQuery ML
leverages Ve"ex AI to enable
automated hyperparameter tuning,
and explainable AI. For no-code model
development, Ve"ex AI Training
provides a point-and-click inte&ace to
train powe&ul models using AutoML,
which comes in multiple !avors:
AutoML Tables, AutoML Image,
AutoML Text, AutoML Video, and
AutoML Translation.
Learn more about model
development on Google
Cloud
Learn about Ve"ex AI Workbench
for a Jupyter-based fully managed
notebook environment
Learn more about Ve"ex AI
ML engineering
Once a satisfactory model is
developed, the next step is to
incorporate all the activities of a well-
engineered application lifecycle,
including testing, deployment, and
monitoring. And all of those activities
should be as automated and robust as
possible.
Managed datasets and Feature Store
on Ve"ex AI provide shared
repositories for datasets and
engineered features, respectively,
which provide a single source of truth
for data and promote reuse and
collaboration within and across teams.
Ve"ex AI’s model serving capability
enables deployment of models with
multiple versions, automatic capacity
scaling, and user-speci$ed load
balancing. Finally, Ve"ex AI Model
Monitoring provides the ability to
monitor prediction requests !owing
into a deployed model and
automatically ale" model owners
whenever the production tra'c
deviates beyond user-de$ned
thresholds and previous historical
prediction requests.
MLOps is the industry term for
modern, well engineered ML services,
with scalability, monitoring, reliability,
automated CI/CD, and many other
characteristics and functions that are
now taken for granted in the
application domain. The ML
engineering features provided by
Ve"ex AI are informed by Google’s
extensive experience deploying and
operating internal ML services. Our
goal with Ve"ex AI is to provide
everyone with easy access to essential
MLOps services and best practices.
Learn more about ML
engineering and MLOps on
Google Cloud
Follow the guides, tutorials and
documentation for Ve"ex AI
Watch this video to learn more
about Ve"ex AI
Discover the data
scientist/machine learning
engineer learning path
Get ce"i$ed as a Professional ML
Engineer
Insights activation
The insights activation stage is where
your data has now become useful to
other teams and processes. You can
use Looker and Data Studio to enable
use cases in which data is used to
in!uence business decisions with
cha"s, repo"s, and ale"s.
Data can also in!uence customer
decisions and as a result increase
usage or decrease churn, for example.
Finally, the data can also be used by
other services to drive insights; these
services can run outside Google
Cloud, inside Google Cloud on Cloud
Run or Cloud Functions, and/or using
Apigee API Management as an
inte&ace.
Learn more about insights
activation on Google Cloud
Watch this video to learn about
building interactive ML apps using
Looker and Ve"ex AI
Learn about Looker, and Looker
solutions for eCommerce, Digital
Media and more
Discover a gallery of interactive
dashboards created with Data
Studio
Watch this video to understand the
di(erence between Cloud Run and
Cloud Functions
Orchestration
All of the capabilities discussed above
provide the key building blocks to a
modern data science solution, but a
practical application of those
capabilities requires orchestration to
automatically manage the !ow of data
from one service to another. This is
where a combination of data pipelines,
ML pipelines, and MLOps comes into
play. E(ective orchestration reduces
the amount of time that it takes to
reliably go from data ingestion to
deploying your model in production, in
a way that lets you monitor and
understand your ML system.
For data pipeline orchestration, Cloud
Composer and Cloud Scheduler are
both used to kick o( and maintain the
pipeline.
For ML pipeline orchestration, Ve"ex
AI Pipelines is a managed machine
learning service that enables you to
increase the pace at which you
experiment with and develop machine
learning models and the pace at which
you transition those models to
production. Ve"ex Pipelines is
serverless, which means that you don’t
need to deal with managing an
underlying GKE cluster or
infrastructure. It scales up when you
need it to, and you pay only for what
you use. In sho", it lets you just focus
on building your data science
pipelines.
Learn more about
orchestration on Google
Cloud
Read more about Cloud Composer
for Ai&low-based pipelines
Try some example notebooks on
Github with Ve"ex AI Pipelines
Learn di(erent ways to trigger
Ve"ex AI Pipeline runs
Read the whitepaper on
Practitioners Guide to MLOps: A
framework for continuous delivery
and automation of machine
learning
Summary
Google Cloud o(ers a complete suite
of data management, analytics, and
machine learning tools to generate
insights from data. Want to learn
more? Check out the following
resources:
Building the data science driven
organization from the $rst
principles
Special thanks to the following
contributors to this blogpost: Alok
Pa%ani, Brad Miro, Saeed
Aghabozorgi, Diptiman Raichaudhuri,
Reza Rokni.
AI & Machine Learning
: