Data Engineering For Machine Learning Pipelines From Python Libraries To ML P
Data Engineering For Machine Learning Pipelines From Python Libraries To ML P
This work is subject to copyright. All rights are solely and exclusively
licensed by the Publisher, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, reuse of
illustrations, recitation, broadcasting, reproduction on microfilms or in any
other physical way, and transmission or information storage and retrieval,
electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The publisher, the authors, and the editors are safe to assume that the advice
and information in this book are believed to be true and accurate at the date
of publication. Neither the publisher nor the authors or the editors give a
warranty, expressed or implied, with respect to the material contained
herein or for any errors or omissions that may have been made. The
publisher remains neutral with regard to jurisdictional claims in published
maps and institutional affiliations.
Introduction
This chapter is designed to get you started. As we begin our journey into advanced
data engineering practices, it is important to build a solid foundation in key concepts
and tools. This chapter is designed to provide a refresher on Python programming
and software version control using GitHub and a high-level review of structured
query language (SQL). While these may seem not connected and different from
each other, they play an important role in developing effective data engineering
pipelines. Python’s data ecosystem, software version controlling systems using Git,
and working with relational database systems using SQL will certainly solidify your
foundations and help you tackle data engineering challenges. These are challenging
and complex topics, especially reviewing them together; I hope you persist, learn,
and practice on your own.
By the end of the chapter, you will
Understand some of the key Python programming concepts relevant to advanced
software engineering practices.
Gain an appreciation toward software version control systems and git; create git
repositories; and perform essential version control operations like branching,
forking, and submitting pull requests.
Have enough knowledge to set up your own code repository in GitHub.
Review fundamental SQL concepts for querying and manipulating data in
database systems.
Identify advanced SQL concepts and best practices for writing efficient and
optimized SQL code.
Engineering data pipelines is not easy. There is a lot of complexity and it
involves leveraging analytical thinking. The complexity includes but is not limited
to varying data models, intricate business logic, tapping into various data sources,
and so many other tasks and projects that make up the profession. Often, data
engineers do build machine learning models to identify anomalies and build
advanced imputations on the data. I have taken some of the topics in my experience
that would serve value to the data engineer or software developer who is
engineering pipelines.
Python Programming
There are two major versions of the Python programming language, Python 2 and
Python 3. Python 2 has reached end of life for support. Python 3 is the de facto
language version for Python programming. The Python programming language
(version 3.*) is still the most preferred language for development, comes with
extensive community-backed data processing libraries, and is widely adopted. In
this section, let us take some time to look at some of the Python programming
concepts that may help, in addition to your basic programming knowledge. These
are discussed at a high level. I encourage you to perform further investigation into
these classes, methods, and arguments that can be passed into these methods.
F Strings
Python f strings allow you to easily insert variables and expressions within a print
string. When compared with traditional string formatting, f strings are a lot more
readable and easy to understand. You can create an f string by placing the character
F in front of the string, before quotes. The character is case insensitive, meaning you
can either insert an F or an f. Here is an example:
Now you can insert variables and expressions within the f string. We have
shown the preceding f string without inserts for demonstration purposes. Here is
how you can perform that:
Fruit = "berries"
Cost = 1.5
Number = 3
will return
Python Functions
In programming, functions are a set of instructions that can be executed together to
accomplish a specific task or operation. It may or may not accept inputs and may or
may not return output. Let us look at an example of a function:
def add(a,b):
c = a+b
return c
print(add(2,3)
it will return
Let us look at a bit more substantial function. Here, we have a function that
takes Celsius as input and calculates Fahrenheit as output:
def f_calc(celsius):
return ((9/5 * celsius) + 32)
This is the same as the following function. Both are syntactically equal. We will
discuss this way of defining functions in another section:
In the preceding code, we have passed a variable named celsius of type integer.
Inside the function we are calculating a value and returning that value in floating
point type. This is how functions can accept input parameters, process them, and
provide output.
def add(a,b):
c = a+b
return c
add(2,3)
it will return
What if we do not know how many input parameters we need? We can use
arguments here:
def add_args(*args):
total_sum = 0
for i in args:
total_sum += i
return total_sum
add_args(3,4,5,6,7)
it will return
25
Note You can also pass a list to the function and still obtain the functionality.
But with arguments, you can call an empty function and return none.
**kwargs
This is quite similar to arguments, and the difference is that you can have a key
associated with each item in the argument:
def add_kwargs(**kwargs):
total_sum = 0
for i in kwargs.values():
total_sum += i
return total_sum
Use of args and kwargs can be highly beneficial in enhancing the flexibility of a
function, complementing the decorator implementation and passing a varying
number of parameters as inputs.
Lambda Functions
In Python, lambdas are short for lambda expressions. They are inline functions that
evaluate one single expression. That single expression is evaluated when the
expression is called. Let us look at this with an illustration:
calculate_fahrenheit(27)
it will return
80.6
Decorators in Python
Decorators in Python are a way to add additional functionality to your existing
function. The decorator itself is a function. In the parameter column, it would take
another function, as an input parameter. In calculus terms, it is equal to the function
of a function. If math is not your cup of tea, then let’s look at a simple example.
Here is a test decorator:
def testdec(func):
def wrapper():
print("+++Beginning of Decorator Function+++")
func()
print("+++End of Decorator Function+++")
return wrapper
@testdec
def hello():
print("Hello!")
hello()
it will provide us
Type Hinting
Type hinting is a Python feature that lets you specify the data types of the variable
declarations, arguments in functions, and the variables that these functions return.
Type hints make it easier to understand what data types to expect and which
function returns what type of data. The idea behind type hinting is better readability,
code maintainability, and being able to debug and troubleshoot errors faster. Type
hints are not mandatory; you can still pass an argument without specifying the data
type (Python will still check them during runtime).
Here is a simple example demonstrating the type hinting functionality in Python:
Typing Module
The typing module supports type hints for advanced data structures with various
features. Here is an example for a dictionary:
Notice the usage of “Dict.” It serves the same purpose as “dict,” a dictionary
type offered by core Python. “Dict” does not provide additional functionality except
you have to specify the data type for both key and value aspects of a dictionary. If
you are using Python 3.9 or later, you can use either “Dict” or “dict.”
Let us look at another example called the Union, where a variable could be one
of few types:
In Python 3.10+, you can also use the | operator. Here are some examples:
Generators in Python
A generator is like a regular function in Python and uses a yield keyword instead of
a return keyword to return the results. The yield keyword helps return an iterator
that would provide a sequence of values when iterated over. The return keyword on
the other hand returns a one-time result of the function.
Let us look at a simple example:
def generator_square(n):
x = 0
while x <= n:
yield x*x
x += 1
for i in generator_square(10):
print(i)
Enumerate Functions
The Oxford dictionary of English defines enumerate as “to list or mention of a
number of things, one by one.” This is all you need to remember. The enumerate
function, when applied to a data type, returns an enumerate object. These enumerate
objects return the data type with an index that starts from 0 and the value
corresponding to the index.
Here is a basic illustration:
a = ['apples','oranges','melons','bananas']
print(a)
would yield
However, if you enumerate the same list, you would obtain something like this:
for i in enumerate(a):
print(i)
would yield
(0, 'apples')
(1, 'oranges')
(2, 'melons')
(3, 'bananas')
Here is an illustration for using a tuple with type hints for better readability:
print(var10)
([1, 'apple'], [2, 'orange'], [3, 'melon'], [4,
'banana'])
would yield
When it comes to a dictionary, the dictionary already has a key and it is not
necessary to add an index. However, it is not a limiting factor. We can still iterate
over a dictionary using enumerate but it needs to be over each (key, value) tuple. We
need to use the “items()” method to obtain each tuple from the dictionary.
Here is an illustration iterating over a dictionary:
sample_dict = {1:"apple",
2:"orange",3:"melon",4:"banana"}
print(sample_dict)
would yield
would yield
Let us use enumerate, where the variable “i” stands for index and variable “j”
stands for each key–value pair in a dictionary:
for i, j in enumerate(sample_dict.items()):
print(i,j)
would yield
0 (1, 'apple')
1 (2, 'orange')
2 (3, 'melon')
3 (4, 'banana')
would yield
0 1 apple
1 2 orange
2 3 melon
3 4 banana
name = "JohnDoe"
would yield
0 --> J
1 --> o
2 --> h
3 --> n
4 --> D
5 --> o
6 --> e
List Comprehension
List comprehension is a method of defining and creating a list through pythonic
expressions. List comprehension returns a list. It can be seen as a substitute for
lambda expressions.
Let us look at a simple example:
readings_in_km = [10,12,13,9,15,21,24,27]
print(readings_in_miles)
We were able to dynamically (and elegantly) create a list of readings that were in
miles from readings that were measured in kilometers.
Let us look at another illustration, where we square or cube a number based on
whether the number is odd or even. This would mean there needs to be an iterator
loop and a decision loop to check each item in the list. Here is how the code may
look without list comprehensions:
a: list[int] = [1,2,3,4,5,6,7,8,9,10]
b: list[int] =[]
for x in a:
if x % 2 == 0:
b.append(x**2)
else:
b.append(x**3)
print(b)
a: list[int] = [1,2,3,4,5,6,7,8,9,10]
Here is another example creating tuples within lists. Here is the regular way of
performing this:
for x in a_list:
for y in b_list:
if x+y <= 4:
var13.append((x,y))
print(var12)
would yield
[(1, 1), (1, 2), (1, 3), (2, 1), (2, 2), (3, 1)]
Random Module
The random module is Python’s inbuilt module that is used to generate random
numbers or pseudo-random numbers including integers, floats, and sequences. Let’s
look at some methods that you may utilize more often as part of building your data
pipelines.
random( )
This method returns a random float in the range between 0.0 and 1.0:
import random
var14 = random.random()
print(var14)
randint( )
This method returns an integer in the range between a minimum and a maximum
threshold supplied by the user:
import random
print(random.randint(1,33))
getrandbits( )
The getrandbits() method from the random library would generate a random integer
in the range between 0 and 2^k – 1. This is beneficial when you need random
integers bigger than randint or even in cases where you need to generate random
Boolean returns.
Let us look at both:
import random
print(random.getrandbits(20))
print(random.getrandbits(20))
print(random.getrandbits(20))
would yield
619869
239874
945215
print(bool(random.getrandbits(1)))
print(bool(random.getrandbits(1)))
print(bool(random.getrandbits(1)))
print(bool(random.getrandbits(1)))
would yield
True
False
True
True
As you can see, you can either generate the value in k random bits, or you can
cast it to a Boolean with a parameter of 1 and you would get a random Boolean
generator as well.
choice( )
The choice function selects a random element from a user-supplied sequence (such
as a list, a tuple, etc.):
import random
fruit = ['apple','banana','melon','orange']
print(random.choice(fruit))
would return
'apple'
print(random.choice(fruit))
'orange'
shuffle( )
The shuffle() method shuffles the sequence, as it changes the way the characters are
currently placed. For instance, we already have
print(fruit)
fruit = ['apple','banana','melon','orange']
Now
random.shuffle(fruit)
And so
print(fruit)
sample( )
The sample method from the random library returns k unique elements chosen from
a population list or a set. This method generates samples without replacement,
meaning once an item is chosen, it cannot be chosen again:
Import random
alpha = list("abcdefghijklmnopqrstuvwxyz")
print(alpha)
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k',
'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v',
'w', 'x', 'y', 'z']
print(random.sample(alpha, 7))
['n', 'v', 'x', 'u', 'p', 'i', 'k']
print(random.sample(alpha, 7))
['s', 'y', 'i', 't', 'k', 'w', 'u']
print(random.sample(alpha, 7))
['n', 'u', 'm', 'z', 'o', 'r', 'd']
print(random.sample(alpha, 7))
['e', 'y', 'a', 'i', 'o', 'c', 'z']
seed( )
The seed() method from the random library would set the seed for generating the
random numbers. And so, if you specify a seed with a certain integer, the idea is that
you would obtain similar random numbers and, in that, you can reproduce the same
output every time. It helps generate random samples in machine learning, simulation
models, etc.
Foundations of Git
Git is a distributed version control system, wherein there exists a remote copy and a
server copy of the same code. The code database is called a repository or simply
repo. You obtain a copy of a repo from the server and start working with it. The
process is called branching. Then, you can submit a proposal to merge your branch
to the centralized copy (master branch), which is commonly referred to as pull
request. Upon submission it goes to a person who is administering the repo, and
they will do what is called approving a pull request.
GitHub
Let us look at GitHub, one of the leading git platforms. GitHub is considered the
world’s largest distributed version control system that hosts and tracks changes of
software code while offering a plethora of features for development.
Alternatively you may wish to visit the website, git-scm, and obtain the git
package under the Downloads section. Here is the website:
https://git-scm.com/downloads
Once the download and installation is complete, let us proceed to set up the git
in our workstations. Here is how to get started.
Let us set up our username and our email as this information will be used by git
commit every time you commit changes to files. Here are the commands:
Remember to supply the unique username you had created during the
registration process and please use quotes for registering the username. For the
email, you need to provide the email you had utilized for the GitHub registration
process. And make sure to use the “--global” keyword option as git will always use
these credentials for any of your interactions with the git system. If you wish to
override this information with a different set of credentials, you may run these same
commands but without the “--global” option.
You can access and write files to your GitHub repository right from your own
terminal using Secure Shell protocol. You generate an SSH key and passphrase, sign
in to your GitHub account, and add the SSH key and passphrase you generated from
your terminal to the GitHub account. This way you have established a secured
handshake between your workstation and your GitHub account.
Let us generate a new SSH key:
For now, please skip (by pressing Enter) when asked for a passphrase among other
options. Let the default location be as is. Here is how that may look for you.
Let us start the SSH agent in your terminal:
This command will start the SSH daemon process in the background. Now, let
us add the SSH private key to the SSH agent using the following command:
ssh-add ~/.ssh/id_ed25519
Now open a new tab on your web browser and log in to your GitHub (in case
you were signed off); click the icon on the top-right corner of your screen and
choose the option called “Settings” from the list of options. Here is how that may
look:
Once you are inside the GitHub settings, navigate and click the option called “SSH
and GPG Keys” and choose the option “New SSH Key” on the top-right side of the
browser.
You will get an option to add the SSH public key that was created in the
previous steps. To perform that you will have to copy the SSH public key first.
Open your terminal and obtain the SSH public key using the following
command:
cat ~/.ssh/id_ed25519.pub
ssh-ed25519
DH37589D364RIUHFIUSDHsr4SFFsdfx4WFRoc9Ks/7AuOsNpYy4xkjH6X
youremail@emailprovider.com
Copy this public key and navigate to the web browser where you have GitHub
open—specifically, Settings ➤ SSH and GPG Keys—to add a new SSH key. Paste
the key that you just copied. Here is how that looks:
You can provide a title as you wish, something of your convenience, and leave the
key type as “Authentication Key.” Click Add SSH key.
Now that we have added the SSH public key, let us test our connection from our
shell to a remote GitHub server by using the following command:
ssh -T git@github.com
You may receive a warning. Choose yes. Here is how that may look for you:
mkdir firstrepo
cd firstrepo
Now, within the folder, initialize a git repository by using the following
command:
git init
The preceding command will create a hidden folder and initialize a database.
You can change the directory into the hidden folder and notice the files. Here is how
that may look for you:
It may be useful to note that a HEAD file is a text file that points to the current
branch, whereas the config file, another text file, stores local configuration about
your repository. Without changing anything exit off of the hidden folder.
If you want to know the history of the repository, then you may use this
command:
git log
You may not obtain any logs just yet as you recently created this repo. Another
useful command is the git status that helps obtain the status of your repo:
git status
This will provide the status of the current git repo. At the moment, we do not
have any files added or changed, so let us try to add a new file.
Here is a Python script that you can use:
import random
somefunct(5)
When you run the code, it will print five random integers between 0 and 100.
Now let us check to see if the status of our git repo has changed:
Now let us add the Python code to git so the git system can track the code’s
progress:
The git add begins to enable git tracking features on the file “mycode.py”; from
the workflow perspective, the git system has picked this file from your local
workstation and moved it to a staging area, where the git tracking features are
enabled.
Up until now, we have just moved the file from the local file system to a staging
area where git can identify this file and start tracking. This file is still not in the git
repository yet. If you type in “git log,” you would get a response something like
this:
It is true. The file has not been committed yet. So let us try to commit this file
using the following command:
This commit has moved the code into an actual git repository situated in the
local directory. The message in the quote is actually a commit message describing
what changes has been made in this commit.
If you would like to push your local git repository to a remote git repository,
then you would want the git push command. It is also called syncing, where you
would upload the local git repo with files to the remote git repo.
Note It is important to know that so far we have been working only on the local
workstation. This repository does not exist in the remote GitHub. Hence, we have
to push the current branch and set it as master. We will look at other examples
where we begin working with a remote branch.
def hello():
print("hello world")
You want to work on this project to add another function. Hence, you need to get
a copy of this existing code from the GitHub server to your local workstation. Well,
there are several ways to accomplish this. Let us look at them in detail.
Cloning
Cloning in git refers to obtaining a copy of this repository from the remote GitHub
server to your local workstation. The idea is to obtain a local copy and work locally.
This is usually done at the beginning of a new project or when joining an existing
project as a newest team member. Cloning is mandatory. Here are some examples:
Branching
A branch is a parallel version of the codebase from the main codebase called the
main branch (or master branch). Branching, as the name indicates, enables you to
work on new features and troubleshooting code and propose the changes back to the
main branch through pull request. Branching is for temporary changes and
collaboration.
Here’s how to create a new branch:
git clone git@github.com:sandbox-user-1/testrepo.git
git branch <name-of-your-branch>
By default, the pointer will be in the main branch. If you want to access the
development branch, you need to change to it manually. Here’s how:
or
Forking
A fork is also a copy of the repository except that you create your own main or
master branch within your git environment. Imagine the team has a master branch.
When you branch, you get a development branch (or whatever you want to name it).
When you fork, you get a master or main branch in your git account. Forking is
done in special cases, external projects, and independent work done in parallel.
In the GitHub web graphical user interface (GUI), you may open the repository
that needs to be forked and click the “Fork” button that is on the top-right corner.
Here is how that may look for you:
Note The Fork button is grayed, because I cannot “fork” my own repository
into my own account.
Pull request
Pull requests are a way of proposing changes to the current codebase. By proposing
a pull request, you have developed a feature, or you have completed some work in
the code and are now proposing that your code be merged into the main repo.
To create a pull request
1.
You already have a branch you are working with or create your own branch.
2.
You make the necessary changes to the code.
3. You are now going to commit the changes:
git add <list of files> or "."
git commit -m "<your message in quotes>"
4.
You would push these changes into the origin repository (your branch or your
fork):
git push origin <new-feature>
Now you are ready to perform a pull request. In the GitHub GUI, you would see
something pop up saying recently received a pull request. Simply click “Compare &
pull request.”
Now you will have to enter the title and message for this pull request. GitHub
will show you the changes made to the file by highlighting the new code in green.
You can click Create pull request.
Here is how that looks:
Once you create a pull request, you now have to merge the pull request to remote
main.
Here is how that may look:
Gitignore
Data pipelines sometimes contain sensitive information like passwords, temporary
files, and other information that should not be committed in the repository. We use
the “.gitignore” feature to intentionally ignore certain files.
Create a “.gitignore” file in the root directory of your folder, and specify various
files that GitHub should ignore.
Here is an illustration of a simple .gitignore that you can incorporate in your data
pipelines as well:
Once created, let us stage this file and add to the local repository:
We have only scratched the surface on git here. I encourage you to visit
http://git-scm.org to learn more about git features.
SQL Programming
SQL is at the heart of every data engineering profession. All our historical data and
enterprise systems are equipped with relational database systems. SQL is the way to
access, query, and work with data from these relational database systems. SQL
stands for structured query language. Let us look at some basic syntax. We will be
using Postgres database syntax to demonstrate some concepts and functionalities
that I feel would be relevant in everyday tasks.
which would return the table with all the rows and columns.
SELECT
fieldname1,
fieldname2
FROM
TableName
SELECT
fieldname1,
fieldname2
FROM
TableName
LIMIT
10
SELECT
product_id,
LENGTH(product_id) as charlength
FROM
productTable
WHERE
product_id LIKE 'H%' AND
LENGTH(product_id) BETWEEN 2 and 4
ORDER BY
charlength
SELECT
productTable.fieldname1,
productTable.fieldname2,
inventory.fieldname1,
inventory.fieldname4
FROM
productTable
LEFT JOIN
inventory
ON
productTable.id = inventory.product_id
Self-Join in SQL
Another important concept that helps you in your development work is the concept
of self-joins. It is basically joining a given table with the exact same table. Let’s say
you have an employee table with employee id, name, and manager id, where every
employee is assigned an id, their name, and the id of their supervisor. We need to
obtain a table where we have the employee name and manager name.
Here is how that may look:
SELECT
Emp.id,
Emp.name,
Mang.name
FROM
Employee Emp
INNER JOIN
Employee Mang
ON
Emp.id = Mang.manager_id
In the preceding code, we are referencing the same table with a different alias
and join based on their id value. There is also the concept of common table
expression (CTE), where the result of a query is temporarily persisted to perform a
specific operation with another query.
WITH productRaw AS (
SELECT
p_name AS name,
p_qty AS qty,
p_price AS price
FROM
productsTable
),
calculateGT AS (
SELECT
name,
qty,
price,
(qty*price) AS grand_total
FROM
productRaw
)
SELECT
name,
qty,
price,
grand_total
FROM
calculateGT
Of course, there may be optimized ways of calculating the grand total given
price and quantity of each product. The idea is to show how you can leverage
persisting results of multiple queries temporarily and utilize the result in generating
the final result table.
Views in SQL
In SQL, we can persist the results of these queries as objects within the database.
These are done in the form of views. There are two types of views, standard views
and materialized views.
Standard View
Standard views are views that you can set up within the database that display the
results of a specific query with multiple joins, filters, functions, etc. When you
query a standard view (instead of running a large blurb of complex SQL code), it
will always return the most recent results.
Here is how a standard view may look:
SELECT
*
FROM
prod_inventory_report
Materialized View
Materialized views store the results of the view in a physical table (standard views
do not physically store the results, but rather call the underlying query every time).
And so, you created a physical table in the name of materialized view, so the
performance is significantly better. However, you have to set up triggers to refresh
materialized views every so often.
Here is how a materialized view may look (similar syntax to a standard view):
In the case of temporary tables, both the schema and the data (if you insert data
after creation of this table) persist only during that given database session. When
you mention preserve rows, they are deleted at the end of the session. You can also
mention delete rows that would delete immediately.
SELECT
Department,
EmpID,
Salary,
avg(Salary) OVER (PARTITION BY Department)
FROM
Employees;
This example would obtain the fields and calculate the mean salary for each
department. The results may look like
As you can see, the window function computed the statistical average salary and
grouped by each department while retaining the table structure and adding a new
column for the computed average.
SELECT
Department,
EmpID,
Salary,
RANK() OVER (PARTITION BY Department ORDER BY
Salary DESC)
FROM
Employees;
Conclusion
So far, we have covered a lot of ground from refreshing Python programming to
getting hands-on with git and GitHub and reviewing the SQL essentials. The
concepts and techniques you have learned here will serve as a foundation to tackle
complex data engineering projects. Python’s data ecosystem will enable you to write
effective data pipelines for data transformation and analysis tasks. Git will allow
you to collaborate with your team, participate in collaborative software
development, and maintain a clean and documented history of your data pipeline as
it evolves. SQL skills will prove indispensable for working with the relational
database systems that form the foundation of almost all data architectures. I
encourage you to practice, explore, and apply what you have learned. Having said
that, let us venture into the exciting topic of data wrangling, in the next few
chapters.
© The Author(s), under exclusive license to APress Media, LLC, part of
Springer Nature 2024
P. K. NarayananData Engineering for Machine Learning
Pipelineshttps://doi.org/10.1007/979-8-8688-0602-5_2
Introduction
Data wrangling is the process of transforming, manipulating, and preparing
a dataset from disparate data sources. Data munging and data manipulation
are other terms that are used to describe the same process. In this chapter,
we will look at Pandas, the oldest Python data analytics library. My sincere
opinion is that Pandas library played a significant role in the adoption of
Python language for data analysis and manipulation tasks. In this chapter,
we will look at Pandas 2.0, a major release of Pandas, exploring its data
structures, handling missing values, performing data transformations,
combining multiple data objects, and other relevant topics.
The core data structures of Pandas 2.0, Series and DataFrame, and
how they can be used in data wrangling
Pandas is a Python library that was built for data manipulation and
transformation purposes. Pandas is built on top of NumPy, another Python
library for scientific and numerical computations.
Even if one is not using Pandas and NumPy in their production, the
knowledge of the data structures and methods will certainly help with
getting frequented with other data manipulation libraries. Pandas sets
standards for various other Python libraries.
Data Structures
Pandas essentially provides two data structures, series and data frames,
respectively. A series is an n-dimensional array, whereas a data frame is a
two-dimensional labeled data structure that supports tabular structure like
spreadsheets or an output of a SQL table.
Series
The series data structure enables one to handle both labeled and unlabeled
arrays of data. The data can be integers, floating point numbers, strings of
text, or any other Python object.
The methods provided by the Pandas library are called objects. To call the
Series object, one has to initialize the constructor and supply appropriate
parameters. An empty initialization would look like
pd.Series()
Here is a simple initialization of a series:
a = [1,2,3,4,5]
b = pd.Series(data=np.random.rand(5), index=a)
One can also generate Series using a Python dictionary. Here is an example:
simple_dict1 = {
"library" : "pandas",
"version" : 2.0,
"year" : 2023
}
simple_series = pd.Series(simple_dict1)
One can also get and set values in a given series. Here’s how to get a value:
simple_series["year"]
And we would obtain
2023
To modify a value in a series, one can also do the following:
simple_series["year"] = 2024
And we would obtain
simple_series
library pandas
version 2.0
year 2024
dtype: object
In this output, the dtype is NumPy’s data type object, which contains the
type of the data, in this case, an object.
Data Frame
While writing production code, one may generate a data frame from an
external dataset stored in a different format. However, it may be essential to
understand the key attributes of a data frame. Some of them are as follows:
Data: The data points or the actual data. It could be one or more Series
or dictionary data types. The data may contain integers, floats, strings,
or any other Python object.
Index: The row labels of the data points and the index of the
DataFrame itself.
Indexing
Indexing is simply a row reference in the data frame. Indexing helps
identify the Pandas data structure. One can also say that indexing is a way
of referring to a pointer that identifies a specific row or a subset of rows in
a Pandas data frame. It is a way of improving the performance of data
retrieval.
One can use both .loc and .iloc methods to retrieve specific rows of the data
frame. The .iloc method is a positional index and attempts to access rows
and columns by referencing integer-based location to select a position,
whereas the .loc is a label-based index and accesses the rows and columns
by labels.
Let’s say we have a data frame with indexes that are not in a specific order,
for instance:
idx_series = pd.Series(["peach", "caramel", "apples", "melon", "orange",
"grapes"], index=[74, 75, 76, 1, 2, 3])
This would generate a Pandas series as follows:
74 peach
75 caramel
76 apples
1 melon
2 orange
3 grapes
dtype: object
When using the loc method, Pandas would attempt to locate the label that is
passed as a parameter. Here, the label “1” has the value melon:
idx_series.loc[1]
Out[107]: 'melon'
In contrast, the iloc method would attempt to locate the integer position that
is passed as the parameter. The integer referencing starts with 0 and ends in
length of the list – 1. And so, the first row would yield caramel:
idx_series.iloc[1]
Out[108]: 'caramel'
Multi-indexing
Time delta can be defined as the difference in time lapsed between given
inputs. They are expressed in units of time, namely, seconds, minutes, days,
etc.
Let us initialize a series with our newly created time delta index.
CSV
CSV stands for comma-separated values; the CSV format is one of the
oldest and the most common types of storing data. It is also known as
tabular store, where data is stored in rows and columns.
The data is stored in a text file. Each line of the file corresponds to a row.
As the title suggests, the data points may be separated using commas. There
will be the same number of data points per row. It is expected that the first
line would contain the column definitions also known as header or
metadata. CSV data files are known for their simplicity and ability to parse
quickly.
where the usecols parameter would help select only the columns required
for the project.
JSON
HDF5
HDF5 stands for Hierarchical Data Format, version 5. Unlike CSV and
JSON, HDF5 is a compressed format, where data is optimized for efficient
storage, and provides faster read and write operations.
While working with huge amounts of data in HDF5, one can attempt to
access just the subset of the huge dataset without having to load the entire
dataset in memory. And so, HDF5 is physical memory friendly.
If the entire file contains a single data frame, then the key parameter can be
ignored.
Feather
Parquet
Here is an example:
import pandas as pd
import pyarrow
new_df = pd.read_parquet("location to file.parquet", engine='pyarrow')
print(df)
Here is another example that uses fastparquet as the loading engine:
import pandas as pd
import fastparquet
new_df = pd.read_parquet("location to file.parquet", engine='fastparquet')
print(df)
Here’s how to export a data frame to Parquet:
final_df.to_parquet('location to the destination file.parquet',
'compression=snappy')
There are options to choose the type of compression, like snappy, gzip, lz4,
zstd, etc. These are just various algorithms that attempt to reduce the file
size.
ORC
Avro
Avro is a file format that is row based. It is safe to say that Avro evolved as
part of Apache Hadoop for big data processing. Avro schema is defined in
JSON and supports binary encoding for faster access and efficient storage.
Binary encoding is a process of serializing and deserializing data for
processing and storing large amounts of data efficiently. Avro also provides
a container file, a special kind of a file that contains multiple data inside it
and enables storing various types of data together (image, audio, text, etc.).
Pandas does not have native support for handling Avro files; however, third
party libraries can be used to load the Avro data. Here is an example:
import fastavro
import pandas as pd
with open('location to Avro file', 'rb') as readAvro:
avro_read = fastavro.reader(readAvro)
avro_data = list(avro_read)
new_df = pd.Dataframe(avro_data)
print(new_df)
Pickle
Just like how one can serialize and deserialize data for efficient storage,
Python pickle enables serializing and deserializing a Python object. When a
Python object is said to be picked, the pickle module serializes the given
Python object before writing to a file. When it is said to be unpicked, the
file is deserialized and so gets reconstructed back to its original form.
Unlike other file formats, pickle is limited to Python language alone.
Here’s how to export a data frame to pickle:
pd.to_pickel(given_data_frame, "location to file.pkl")
Here’s how to read pickle into a data frame:
unpkl_df = pd.read_pickle("location to file.pkl")
Chunk Loading
Missing Values
Background
The most common idea is the error that may occur during the data entry
stages. During the process of manual data entry, it is not uncommon to
forget to enter a value; another case is when data values have been entered
incorrectly and, during quality checks, such data values are removed,
therefore carrying a missing value label for a given observation and a
variable.
Such data entry errors can also occur in automated machines or sensor
mechanisms. There may be certain cases when the equipment performing
reading may be under maintenance or any other issues that could have led
to not reading the value at a given point in time, leading to missing value
occurrences.
Missing values can also occur during the process of data cleaning and
transportation. When a filter or a conditional transformation is employed on
a dataset, the output of such operation may lead to missing values,
indicating those specific variables did not match the filtering criteria.
Similarly, when data is migrated or transported from a source system to a
staging layer in an analytical system, there may be cases of missing values
due to inconsistencies in data formats or lack of support for a specific
encoding type.
None
NaN
NaN stands for not a number. Not a number or NaN means the value is
missing at a given place. NaN has a float type. Here is an example.
NaT
NaT means not a time. NaT represents missing values for the DateTime
datatype.
NA
Pandas library offers few methods and treatments to work with missing
values. Let’s look at some of these.
isna( ) Method
You can also use “isnull()” in place of “isna()” for better readability.
notna( ) Method
The isna() method identifies missing values, while the notna() method
identifies non-missing values.
My take is that if you are looking to handle missing data, analyze missing
data, and perform statistical imputations, then my suggestion is the isna()
method. You can sum and count the number of missing values by using the
sum() and count() methods, respectively.
Use Multivariate Imputer with Iterative Imputer (and also Simple Imputer)
imputation when one is absolutely sure there appears to be a clerical error
that can be easily guessed: https://scikit-
learn.org/stable/modules/impute.html.
For instance, a study that includes test subjects in a certain age range, along
with checking a few other boxes that can be backtracked with reasonable
estimation.
Data Transformation
Data transformation is one of the most crucial operations in building a
quality dataset for machine learning pipelines. In broad terms, it can be
seen as the process of converting data from one shape to another. The goal
of the data transformation is to prepare data or create a dataset for
downstream consumption.
Data Exploration
One can also obtain individual values for transformation purposes. Here are
some examples.
The merge() method offers left join, right join, inner join, outer join, and
cross join options.
Let us look at them in detail. For illustrative purposes, let us assume we are
strictly focused on data frames with rows and columns.
Left Join
Given two data frames, the left join of these data frames would mean to
keep everything on the data frame that is on the left and add any matching
information from the data frame that is on the right. Let us imagine we
have two data frames, a and b, respectively. The process starts off with
keeping everything on the data frame a and looking for potential matches
from the data frame b. The matching is usually searched based on the
common column or a key in the data frame. When there is a match, then
bring that matched column from b and merge it with a. If there is a
matching row but only partially filled-in information, then the remaining
would be populated with NA.
Now let us perform the left join using the merge() method:
leftjoin = pd.merge(
left_df1,
right_df1,
on="DeptID",
how="left"
)
print(leftjoin)
would yield us
ID Name DeptID DeptName
0 1 Alice 101 HR
1 2 Bob 102 IT
2 3 Charlie 103 Marketing
3 4 David 104 NaN
4 5 John 103 Marketing
In the preceding example, you can see that everything from the left_df has
been kept, while only the matched columns from right_df have been
merged. There is only one row on the left_df for which there appears to be
no matching rows from the right_df and which is filled with NaN, as a
missing value.
Right Join
In our earlier example we had two data frames, left_df and right_df,
respectively.
The right join would start off with keeping everything on the data frame
right_df and look for potential matches from the data frame left_df. The
matching is usually searched based on the common column or a key in the
data frame. When there is a match, then bring that matched column from
left_df and merge it with right_df. If there is a matching row but only
partially filled-in information, then the remaining would be populated with
NA.
In the case of the right join, the table that is on the right remains
unchanged. The table that is on the left would be scanned to identify
matching information based on the department id column. The matched
columns would be then brought over and be annexed with the table on the
right.
Outer Join
In the relational database world, these are called full outer joins. Outer join
is where, given two data frames, both of them are kept without losing any
data. Outer joins have the most number of missing values, but they preserve
any and all data that is in both data frames:
outerjoin = pd.merge(
left_df1,
right_df1,
on="DeptID",
how="outer"
)
print(outerjoin)
would yield us
ID Name DeptID DeptName
0 1 Alice 101 HR
1 2 Bob 102 IT
2 3 Charlie 103 Marketing
3 5 John 103 Marketing
4 4 David 104 NaN
The outer join attempted to capture all data from both tables and assigned a
missing value (not a number) where a match is not available.
Inner Join
Inner join is the inverse of full outer join. These joins are where, given two
data frames, only the matched columns would be returned as the resultant
data frame. The partial matches from both the data frames would be
excluded, and only complete matches will be returned.
As you can see the row where the partial match was available has been
excluded from the resultant dataset.
Cross Join
Cross join is also referred to as cartesian join. It is where every row in the
given data frame is assigned or merged with rows from another data frame.
The common key is not required.
Data Reshaping
Data reshaping is the process of transforming the structure of data while
preserving the values and observations contained in the dataset. The idea of
performing data reshaping is to identify new information and insights while
changing the dimensions of it, without having to change the values.
pivot( )
The most common idea of data reshaping is the concept of pivoting a table.
Made famous by Microsoft Excel, pivoting a table enables rearranging of
rows and column data. Common applications include preparing data for
dashboards, further analysis, etc.
And so, the dates no longer repeat themselves, each of the cities in the
column has become a category in itself, and the temperature data is
displayed in better presentable format. The pivot() function is very
beneficial in cases where you need to reshape the data without having to
perform any type of aggregations.
pivot_table( )
stack( )
The stack() method is somewhat similar and closely related to the idea of
pivoting the table. The stack() method also reshapes the data by converting
the column categories into indexes. Let us look at an example to observe
the functionality of the stack() method.
The resultant data frame contains an index that was previously column
fields. Now, the resultant data frame has two indexes: one is the product
type and the other is the sales quarter column.
unstack( )
melt( )
The melt() method would perform the inverse function of the pivot()
function. The melt() method would convert the wide data format into a
narrow data format.
Let us look at the functionality of the melt() method with the following
example:
melt_data = {
'Product':
['Product A', 'Product B', 'Product C'],
'January':
[300, 200, 400],
'February':
[350, 250, 450],
'March':
[400, 300, 500]
}
melt_df = pd.DataFrame(melt_data)
print(melt_df)
would yield
Product January February March
0 Product A 300 350 400
1 Product B 200 250 300
2 Product C 400 450 500
Now, let us apply the melt() method, where we would like to identify the
product and measure the product against the month and sales variables:
melted_df = melt_df.melt(
id_vars=['Product'],
var_name='Month',
value_name='Sales'
)
print(melted_df)
Product Month Sales
0 Product A January 300
1 Product B January 200
2 Product C January 400
3 Product A February 350
4 Product B February 250
5 Product C February 450
6 Product A March 400
7 Product B March 300
8 Product C March 500
The preceding resultant data frame has successfully converted the wide
data frame into a narrow data frame.
crosstab( )
Here is an example:
ctdata = {
'Person':
[1, 2, 3, 4,
5, 6, 7, 8, 9],
'State':
['NY', 'WA', 'CO', 'NY',
'WA', 'CO', 'NY', 'WA', 'CO'],
'Likes':
['Mountains', 'Mountains', 'Mountains', 'Mountains',
'Mountains', 'Oceans', 'StateParks', 'StateParks', 'StateParks']
}
ctdf = pd.DataFrame(ctdata)
print(ctdf)
would yield us
Person State Likes
0 1 NY Mountains
1 2 WA Mountains
2 3 CO Mountains
3 4 NY Mountains
4 5 WA Mountains
5 6 CO Oceans
6 7 NY StateParks
7 8 WA StateParks
8 9 CO StateParks
Now, let us build a frequency table with the columns “State” and “Likes,”
respectively.
crossstab_df = pd.crosstab(
ctdf['State'],
ctdf['Likes'],
rownames=["Region"],
colnames=["FavoriteActivity"]
)
print(crossstab_df)
would yield us the following:
FavoriteActivity Mountains Oceans StateParks
Region
CO 1 1 1
NY 2 0 1
WA 2 0 1
The preceding data frame has provided the count of persons. You can also
obtain the percentage of persons who liked a certain activity by using the
“normalize” argument:
normalActMatrix = pd.crosstab(
ctdf['State'],
ctdf['Likes'],
rownames=["Region"],
colnames=["FavoriteActivity"],
normalize=True
)
print(normalActMatrix)
would yield us
FavoriteActivity Mountains Oceans StateParks
Region
CO 0.111111 0.111111 0.111111
NY 0.222222 0.000000 0.111111
WA 0.222222 0.000000 0.111111
You can also add the “margins” parameter to get the row and column sum
of the frequency table:
freqTable = pd.crosstab(
ctdf['State'],
ctdf['Likes'],
rownames=["Region"],
colnames=["FavoriteActivity"],
margins=True
)
print(freqTable)
would yield
FavoriteActivity Mountains Oceans StateParks All
Region
CO 1 1 1 3
NY 2 0 1 3
WA 2 0 1 3
All 5 1 3 9
which can also be used along with the “normalize” parameter:
normalFreqTable = pd.crosstab(
ctdf['State'],
ctdf['Likes'],
rownames=["Region"],
colnames=["FavoriteActivity"],
margins=True,
normalize=True
)
print(normalFreqTable)
would yield us
FavoriteActivity Mountains Oceans StateParks All
Region
CO 0.111111 0.111111 0.111111 0.333333
NY 0.222222 0.000000 0.111111 0.333333
WA 0.222222 0.000000 0.111111 0.333333
All 0.555556 0.111111 0.333333 1.000000
factorize( )
compare( )
The compare() method enables you to compare two data frames or series
and identify their differences. Let us take a simple example where we
observe differences between two data frames.
Notice the usage of missing values in the dataset. During comparison, if the
values are similar on the respective rows and columns, then missing values
are assigned. The values that are different between the data frames are only
displayed.
groupby( )
Now let us apply the groupby() to this dataset. Here we are looking to
obtain total sales performed by each department. This would mean
grouping the dataset and computing the sum for each of the categories.
Conclusion
So far, you have gained a solid understanding of the core concepts and
techniques that form the pillars of effective data manipulation using
Pandas. Through this chapter you have discovered the core data structures,
indexing methods, data reshaping strategies, advanced data wrangling
methods, combining multiple Pandas objects, and working with missing
values. The concepts and ideas that were discussed here will serve you
good even if you do not actively use Pandas library; these concepts will lay
the foundation for you to learn various other data wrangling libraries. You
may witness in the upcoming chapters that some of these methods and
methodologies you saw here may repeat themselves in some fashion.
© The Author(s), under exclusive license to APress Media, LLC, part of Springer
Nature 2024
P. K. NarayananData Engineering for Machine Learning
Pipelineshttps://doi.org/10.1007/979-8-8688-0602-5_3
Introduction
So far, Pandas library has been the most famous library for data manipulation and
analysis. However, Pandas may face performance challenges when dealing with
massive datasets. When dealing with large datasets, we need a solution that can
perform well without having to add additional memory and compute. This is
where Polars excels. Polars is a fast data manipulation library programmed in
Rust programming and integrates well with Python’s data analysis ecosystem.
Polars supports lazy evaluation and automatic query optimization, enabling data
engineers to handle even the most demanding data wrangling tasks easily.
Excel in the key concepts of Polars, its architecture, lazy evaluation, and
eager evaluation.
Combine multiple data objects using Polars and investigate advanced topics
like data reshaping, missing values and unique value identifications, etc.
Write SQL-like statements in Polars and query your flat files on your file
system using Polars command line interface (CLI).
Introduction to Polars
Polars is a promising, up-and-coming Python library within the data processing
ecosystem of Python language. The data structures of Polars library are based on
the Apache Arrow project, meaning Polars stores its data internally using Apache
Arrow arrays, comparable to how Pandas library uses NumPy internally to store
its data in the form of NumPy arrays.
It may be possible to use Polars library along with other libraries within the
Python data analysis ecosystem. However, it may be the case that Python libraries
are usually single threaded. If the data processing task can be parallelized, one
may leverage other libraries such as multiprocessing to speed up the entire data
pipeline.
The syntax for Polars library in Python can appear to be very similar when
compared to Pandas library. However, there are few differences that will be
discussed in the coming topics. To start off, Polars does not support index or
multi-index for its data structures, unlike Pandas. And so, when writing Polars
code, one may not be able to use methods like the loc or iloc method that are
available in Python.
Polars library supports both lazy evaluation and eager evaluation, whereas Python
supports only eager evaluation. Polars automatically optimizes the query plan and
finds opportunities to optimally leverage memory while accelerating the
execution of a query.
Lazy evaluation is simply a process of running a code if and only if that code
requires it to be executed during runtime.
Here, this program is a simple case of eager evaluation. Every line in that
function is processed. When the function is called with parameters, all the three
lines are evaluated and the statements are printed.
In contrast, in the case of lazy evaluation, the line that contains c=a+b will be
deferred from execution, until the time it is actually needed. Can you guess when
c=a+b is actually needed? It would be the last line where we are printing the sum.
That is when the addition operation is actually executed. The function requires
the value of c to be computed.
Note
To illustrate this even further, let’s take an example of visiting a restaurant to get
a meal. Eager evaluation is like visiting a buffet restaurant with standard
offerings and no customizations. Every dish is cooked and ready to be served,
regardless of how much the customer would consume or whether anyone would
even consume that dish at all.
Lazy evaluation is like visiting a restaurant, where the server person would ask
questions about which sides and items you like, how you like them, whether you
are allergic to something, and several other things. The chef would then prepare
the dish and include items based on the specific responses of the customer.
Eager evaluation computes all the results up front, consumes system resources
to compute everything regardless of they are ultimately needed, and may
experience a slower initial performance but subsequent access may be faster.
Lazy evaluation computes only the results that are required, conserves system
resources by executing only what is necessary, and may seem faster initially but
accessing subsequent results may be slower.
Polars offers data structures similar to those of Pandas library. They are Series,
DataFrames, and LazyFrames (the one with lazy evaluation). It may be safe to
say that data frames are the most commonly used data structure in the Python
ecosystem.
Polars Series
A data frame in Polars is similar to the concept of a data frame in Pandas. There
are options to reshape the data and perform aggregations and transformations.
Polars, by default, has prints with aligned rows and columns with borders. This
may resemble a pretty print library in Python. Like Pandas, one can view the
head and tail of the data frame simply by calling head(n) and tail(n), respectively.
The option describe() would return the summary statistics for the data frame.
Here is how the syntax looks like:
print(df.describe())
would yield
shape: (9, 4)
┌─────────┬────────┬───────┬───────┐
│ describe ┆ fruits ┆ price ┆ quantity │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ f64 │
╞═════════╪════════╪═══════╪═══════╡
│ count ┆5 ┆ 5.0 ┆ 5.0 │
│ null_count ┆ 0 ┆ 0.0 ┆ 0.0 │
│ mean ┆ null ┆ 1.966 ┆ 3.2 │
│ std ┆ null ┆ 1.097511 ┆ 1.923538 │
│ min ┆ Apple ┆ 0.75 ┆ 1.0 │
│ 25% ┆ null ┆ 1.33 ┆ 2.0 │
│ 50% ┆ null ┆ 1.5 ┆ 3.0 │
│ 75% ┆ null ┆ 3.0 ┆ 4.0 │
│ max ┆ Pineapple ┆ 3.25 ┆ 6.0 │
└─────────┴────────┴───────┴───────┘
A lazy frame is a data structure that enables lazy evaluation in Polars. A lazy
frame enables the query optimization feature, in addition to parallelism. It is
recommended to leverage on lazy evaluation where possible to gain maximum
performance from Polars. A lazy frame cannot be necessarily referred to as one
of the native data structures provided by Polars; however, given a data frame,
there are ways to convert the same into a lazy frame.
Let's look at the example discussed in the previous section. Though it is a Polars
data frame, it is not set up for lazy evaluation.
By adding the lazy(), the data frame is now set up to be a lazy frame.
One can also overwrite the Pandas data frame to a lazy frame.
JSON
Polars supports read and write operations for both JSON and ndJSON.
ndJSON stands for newline-delimited JSON and enables faster inserts and reads
from files.
Parquet
As noted earlier, Parquet is a columnar data store and it is considered faster than
many of the data sources. Here is how one can read Parquet files from Polars:
parquet_df = pl.read_parquet("path to file.parquet")
Here’s how to use lazy evaluation on a Parquet file:
parquet_df = pl.scan_parquet("path to file.parquet")
Here is how to write a Polars data frame into a Parquet file:
parquet_df = pl.write_parquet("path to destination file.parquet")
Polars also can provide lazy evaluation and read and write data in various other
formats like Avro, ORC, Feather, etc.
Polars Context
A context is simply a context where the expressions get evaluated and executed.
There are three main contexts in Polars: selection contexts, filtering contexts, and
group-by contexts. Let’s look at them in detail.
Selection Context
A selection context is where the idea of selection applies to a given Polars data
frame. Some of the operations that can be performed in a selection context are
dataframe.select([" ... "])
dataframe.with_columns([" ... "])
Let us look at an example:
import polars as pl
df = pl.DataFrame(
{
"A": [1, 2, 3],
"B": [4, 5, 6]
}
)
print(df)
When we look at the data, we see
shape: (3, 2)
┌────┬────┐
│A ┆B │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞════╪════╡
│1 ┆4 │
│2 ┆5 │
│3 ┆6 │
└────┴────┘
If we need to select a specific column, we can use
print(df.select(['A']))
which would yield
shape: (3, 1)
┌────┐
│A │
│ --- │
│ i64 │
╞════╡
│1 │
│2 │
│3 │
└────┘
However, this statement has very limited scope. If you want to perform further
transformations on the selected column, you need to include an additional
expression.
Filter Context
A filter context is where a given expression filters a dataset and provides a data
frame that matches the user-supplied conditions. The operations that can be
performed in this context are
dataframe.filter([" ... "])
dataframe.where([" ... "])
Here is an example of using filter in the preceding dataset:
conFilter = df.filter(pl.col('A')>1)
print(conFilter)
would yield us
shape: (2, 2)
┌────┬────┐
│A ┆B │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞════╪════╡
│2 ┆5 │
│3 ┆6 │
└────┴────┘
Group-By Context
You can also leverage the idea of lazy evaluation when working with these
contexts and expressions. Here is how that looks:
df1_lazy = pl.LazyFrame(
{
"Region":
["CityBranch", "CampusBranch", "CityBranch",
"SuburbBranch", "CampusBranch", "SuburbBranch",
"CityBranch", "CampusBranch"],
"Product":
["A", "B", "B",
"A", "B", "A",
"A", "B"],
"Sales":
[100, 200, 150,
300, 250, 400,
350, 180]
}
)
print(df1_lazy)
would yield
naive plan: (run LazyFrame.explain(optimized=True) to see the optimized plan)
DF ["Region", "Product", "Sales"]; PROJECT */3 COLUMNS; SELECTION:
None
Basic Operations
String Operations
Polars uses Apache Arrow as its backend, which enables a contiguous memory
model for all its operations. Contiguous memory is where the elements are stored
sequentially so that the CPU can retrieve these elements faster and have more
efficient operations. String processing can get expensive as the data grows over
time.
Polars offers powerful aggregations and summarizations, in both the eager API
and lazy API. Let us look at some of the aggregations by loading a dataset:
polars_dataset = pl.read_csv(
"/path-to-the-folder/polars_dataset.csv",
separator=",",
infer_schema_length=0
)\
.with_columns(
pl.col("birthdate")
.str.to_date(strict=False)
)
print(polars_dataset)
would yield
shape: (1_000, 7)
┌───────┬───────┬─────┬────────┬────────┬─────
─────┬──────────┐
│ firstname ┆ lastname ┆ gender ┆ birthdate ┆ type ┆ state ┆
occupation │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ date ┆ str ┆ str ┆ str │
╞════════╪═══════╪═════╪════════╪═══════╪═════
════╪══════════╡
│ Rosa ┆ Houseago ┆ F ┆ null ┆ Stateparks┆ California ┆
Accountant │
│ Reese ┆ Medford ┆ M ┆ null ┆ Stateparks┆ District of ┆
Electrician │
│ ┆ ┆ ┆ ┆ ┆ Columbia ┆ │
│ Gianna ┆ Rylands ┆ F ┆ 1973-07-02 ┆ Mountains ┆ Kentucky ┆
Doctor │
│ Abran ┆ McGeown ┆ M ┆ 1962-12-08 ┆ Baseball ┆ Virginia ┆
Accountant │
│ Lanita ┆ Yantsev ┆ F ┆ 1976-06-07 ┆ Stateparks┆ Illinois ┆
Electrician │
│ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... │
│ Selina ┆ Heyworth ┆ F ┆ 1951-04-10 ┆ Baseball ┆ Ohio ┆ IT
Professional│
│ Rubi ┆ Licari ┆ F ┆ null ┆ Oceans ┆ Michigan ┆
Lawyer │
│ Rubin ┆ Stanworth┆ M ┆ 1956-09-08 ┆ Baseball ┆ Indiana ┆
Electrician │
│ Benjie ┆ Amort ┆ M ┆ 1957-07-07 ┆ Museums ┆ Kentucky ┆
Doctor │
│ Gallard ┆ Samuels ┆ M ┆ 1961-07-09 ┆ Oceans ┆ California ┆
Doctor │
└───────┴───────┴──────┴────────┴───────┴─────
────┴───────────┘
Left Join
A left join returns all rows from the left table, scans the rows from the right table,
identifies matches, and annexes them to the left table. Let us look at an example.
Outer Join
Outer join is such that when a match is found between two given tables, it would
retrieve any and all rows that are matched:
outerjoin = df1.join(
df2,
on = "DeptID",
how = "full",
coalesce=True
)
print(outerjoin)
would yield us
shape: (5, 4)
┌────┬───────┬──────┬────────┐
│ ID ┆ Name ┆ DeptID ┆ DeptName │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 ┆ str │
╞════╪═══════╪══════╪════════╡
│ 1 ┆ Alice ┆ 101 ┆ HR │
│ 2 ┆ Bob ┆ 102 ┆ IT │
│ 3 ┆ Charlie ┆ 103 ┆ Marketing │
│ 4 ┆ David ┆ 104 ┆ null │
│ 5 ┆ John ┆ 103 ┆ Marketing │
└────┴───────┴──────┴────────┘
Inner Join
Inner join is the inverse of outer join and only returns rows that have matching
values from both tables:
innerjoin = df1.join(
df2,
on = "DeptID",
how = "inner",
coalesce=True)
print(innerjoin)
would yield us
shape: (4, 4)
┌────┬───────┬──────┬────────┐
│ ID ┆ Name ┆ DeptID ┆ DeptName │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 ┆ str │
╞════╪═══════╪══════╪════════╡
│ 1 ┆ Alice ┆ 101 ┆ HR │
│ 2 ┆ Bob ┆ 102 ┆ IT │
│ 3 ┆ Charlie ┆ 103 ┆ Marketing │
│ 5 ┆ John ┆ 103 ┆ Marketing │
└────┴───────┴──────┴────────┘
Semi Join
Semi join filters rows that have a match on the right table:
semijoin = df1.join(
df2,
on = "DeptID",
how = "semi",
coalesce=True)
print(semijoin)
would yield us
shape: (4, 3)
┌────┬───────┬──────┐
│ ID ┆ Name ┆ DeptID │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 │
╞════╪═══════╪══════╡
│ 1 ┆ Alice ┆ 101 │
│ 2 ┆ Bob ┆ 102 │
│ 3 ┆ Charlie ┆ 103 │
│ 5 ┆ John ┆ 103 │
└────┴───────┴──────┘
Anti Join
Anti join filters rows that do not have a match on the right table:
antijoin = df1.join(
df2,
on = "DeptID",
how = "anti",
coalesce=True)
print(antijoin)
would yield us
shape: (1, 3)
┌────┬─────┬──────┐
│ ID ┆ Name ┆ DeptID │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 │
╞════╪═════╪══════╡
│ 4 ┆ David ┆ 104 │
└────┴─────┴──────┘
Cross Join
Here is how that would look. Let us initialize two data frames:
df_one = pl.DataFrame(
{
'ID': [1, 2],
'Name': ['John', 'Doe']
}
)
df_two = pl.DataFrame(
{
'ItemID': [101, 102],
'Item': ['Laptop', 'Cable']
}
)
Now let us perform the cartesian product between these two data frames:
crossjoin = df_one.join(
df_two,
how = "cross",
coalesce=True
)
print(crossjoin)
would yield us
shape: (4, 4)
┌────┬─────┬──────┬──────┐
│ ID ┆ Name ┆ ItemID ┆ Item │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 ┆ str │
╞════╪═════╪══════╪══════╡
│ 1 ┆ John ┆ 101 ┆ Laptop │
│ 1 ┆ John ┆ 102 ┆ Cable │
│ 2 ┆ Doe ┆ 101 ┆ Laptop │
│ 2 ┆ Doe ┆ 102 ┆ Cable │
└────┴─────┴──────┴──────┘
Advanced Operations
Let us look at some advanced methods in Polars data analysis library. For our
analysis purposes, we will leverage our earlier dataset:
import polars as pl
polars_dataset = pl.read_csv(
"/Users/pk/Documents/polars_dataset.csv",
separator=",",
infer_schema_length=0
)\
.with_columns(
pl.col("birthdate")
.str.to_date(strict=False)
)
print(polars_dataset)
Let us try to look at missing values on specified columns within the dataset. This
would enable us to locate rows with missing data:
missingdata = polars_dataset.filter(
pl.col("birthdate").is_null()
)
print(missingdata)
would yield us
shape: (588, 7)
┌─────────┬───────┬──────┬────────┬─────────┬─
───────────────┬────────────┐
│ firstname ┆ lastname ┆ gender ┆ birthdate ┆ type ┆ state ┆
occupation │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ date ┆ str ┆ str ┆ str │
╞═════════╪═══════╪══════╪════════╪═════════╪═
═══════════════╪═════════════╡
│ Rosa ┆ Houseago ┆ F ┆ null ┆ Stateparks ┆ California ┆
Accountant │
│ Reese ┆ Medford ┆ M ┆ null ┆ Stateparks ┆ District of Columbia ┆
Electrician │
│ Alicia ┆ Farrant ┆ F ┆ null ┆ Oceans ┆ Florida ┆
Accountant │
│ Shawn ┆ Fenlon ┆ M ┆ null ┆ Stateparks ┆ Michigan ┆ IT
Professional │
│ Lenette ┆ Blackly ┆ F ┆ null ┆ Mountains ┆ Texas ┆
Accountant │
│ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... │
│ Perla ┆ Brixey ┆ F ┆ null ┆ Stateparks ┆ Washington ┆
Doctor │
│ Bobbi ┆ Longea ┆ F ┆ null ┆ Mountains ┆ Massachusetts ┆
Accountant │
│ Skip ┆ Urry ┆ M ┆ null ┆ Oceans ┆ Arizona ┆
Doctor │
│ Amory ┆ Cromie ┆ M ┆ null ┆ Baseball ┆ Virginia ┆
Lawyer │
│ Rubi ┆ Licari ┆ F ┆ null ┆ Oceans ┆ Michigan ┆
Lawyer │
└─────────┴───────┴──────┴────────┴─────────┴─
───────────────┴────────────┘
The pivot function would transform the entire dataset into a wider format, with
column categories becoming columns and rows, and the aggregation metric
would be populated to the respective row and column.
Polars/SQL Interaction
While Polars library does not have an underlying SQL engine, Polars does
support interacting with data objects using SQL-like statements. So, when you
are issuing a SQL-like statement to a Polars data object, Polars would convert
that to appropriate expressions and continue to provide you fast results (as it
always does). You can execute “SQL queries” on flat files, JSON files, and even
Pandas data objects. Let us look at an illustration involving a dataset that contains
random health data.
Here is how we can remove the special character from the dataset:
df = df.with_columns(
pl.col("copay_paid") \
.str \
.slice(1) \
.alias("copay_paid")
)
print(df.head())
would yield
shape: (5, 7)
┌────┬──────────────┬────┬─────┬──────┬───────
─────┬─────────┐
│ id ┆ name ┆ age ┆ state ┆ gender ┆ exercise_daily? ┆ copay_paid │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ str ┆ str ┆ bool ┆ str │
╞════╪═════════════╪═════╪═════╪══════╪═══════
═════╪═════════╡
│ 1 ┆ Allayne Moffett ┆ null ┆ SC ┆ Male ┆ null ┆ 206.76 │
│ 2 ┆ Kerby Benjafield ┆ null ┆ NM ┆ Male ┆ true ┆ 21.58 │
│ 3 ┆ Raina Vallentin ┆ null ┆ MI ┆ Female ┆ true ┆ 125.18 │
│ 4 ┆ Kaela Trodden ┆ null ┆ OH ┆ Female ┆ false ┆ 86.20 │
│ 5 ┆ Faber Kloisner ┆ null ┆ MN ┆ Male ┆ false ┆ 219.38 │
└────┴──────────────┴────┴─────┴──────┴───────
─────┴─────────┘
Now, we can convert the string data type to a decimal data type. Here is how we
can do that:
df = df.with_columns(
pl.col("copay_paid")\
.str \
.to_decimal() \
.alias("copay"))
print(df.head())
would yield
shape: (5, 8)
┌────┬──────────────┬─────┬─────┬──────┬──────
──────┬─────────┬──────────┐
│ id ┆ name ┆ age ┆ state ┆ gender ┆ exercise_daily? ┆ copay_paid ┆
copay │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ str ┆ str ┆ bool ┆ str ┆ decimal[*,2] │
╞════╪═════════════╪═════╪═════╪══════╪═══════
═════╪═════════╪══════════╡
│ 1 ┆ Allayne Moffett ┆ null ┆ SC ┆ Male ┆ null ┆ 206.76 ┆
206.76 │
│ 2 ┆ Kerby Benjafield ┆ null ┆ NM ┆ Male ┆ true ┆ 21.58 ┆
21.58 │
│ 3 ┆ Raina Vallentin ┆ null ┆ MI ┆ Female ┆ true ┆ 125.18 ┆
125.18 │
│ 4 ┆ Kaela Trodden ┆ null ┆ OH ┆ Female ┆ false ┆ 86.20 ┆
86.20 │
│ 5 ┆ Faber Kloisner ┆ null ┆ MN ┆ Male ┆ false ┆ 219.38 ┆
219.38 │
└────┴──────────────┴─────┴─────┴──────┴──────
──────┴─────────┴──────────┘
Let us now initialize the SQL context of Polars. The SQL context object is where
the mapping between SQL-like statements and Polars expressions is stored. And
so, this step is essential.
You may have noticed “eager=True” during the execution of the SQL query
within the SQL context. You can also set the option to False to enable lazy
execution. This will return a Polars lazy frame. You can continue to use that lazy
frame on downstream data manipulation tasks.
Polars CLI
Polars also has an option to access the library through a command line interface.
One can execute Polars commands and SQL-like commands right from the
command line. Here’s how to get started with using Polars in a command line
interface:
pip install polars-cli
And type in “polars” to access the command line interface. Here is how that may
look:
Figure 3-1
Here is how we can execute a SQL query on a flat file in your file system:
〉select * from read_csv('./person.csv');
would yield
A table represents the details of person i d, name, city, gender, and birth
date. A line of code at the top reads as follows. Select, asterisk, from,
read c s v, open parenthesis, single quotation, dot slash, person dot c s
v, single quotation, closed parenthesis.
Figure 3-2
You can even specify the SQL query directly on your shell and use a pipe
operator to call and execute the SQL command in the Polars CLI. Here is how
that looks:
--> echo "select \
concat(firstname, ' ', lastname) as fullname, \
occupation \
from \
read_csv('./polars_dataset.csv') \
where type = 'Baseball' and state = 'New York' " | polars
would yield
┌──────────────┬────────────┐
│ fullname ┆ occupation │
│ --- ┆ --- │
│ str ┆ str │
╞ ╪ ╡
╞══════════════╪════════════╡
│ Cynthia Paley ┆ Doctor │
│ Adam Seaborn ┆ Doctor │
│ Lilah Mercik ┆ IT Professional │
│ Alidia Wreiford ┆ Lawyer │
│ Rosita Guswell ┆ Accountant │
│ Winston MacCahey ┆ IT Professional │
│ Thorn Ayrs ┆ Accountant │
│ Cleon Stolle ┆ IT Professional │
│ Gabriello Garshore ┆ Doctor │
└──────────────┴────────────┘
Figure 3-3
Conclusion
Congratulations on reaching this far! So far, you have covered a lot of ground
with Python, SQL, git, Pandas, and Polars from these chapters. There is immense
potential in Polars. So far, we looked at how to extract and load data in various
formats, perform data transformation operations like filtering and aggregation,
and combine multiple Polars objects using join operations. We also explored
handling missing values, identifying unique values, and reshaping data with pivot
operations. I hope you find it in you to adopt Polars in your data pipelines and
demonstrate greater efficiency and value to the organization. While Polars may
be the perfect solution for CPU-based data wrangling tasks, in the next chapter,
we will look at GPU-based data wrangling libraries and how one can process
massive amounts of data using GPU. I hope you stay excited toward the
upcoming chapter.
© The Author(s), under exclusive license to APress Media, LLC, part of
Springer Nature 2024
P. K. NarayananData Engineering for Machine Learning
Pipelineshttps://doi.org/10.1007/979-8-8688-0602-5_4
Introduction
As the volume of data continues to grow, there comes a time where CPU-
based data processing may struggle to provide consistent results within
reasonable time. This is where GPU computing comes into play, where the
graphical processing units are leveraged to revolutionize data manipulation
and analysis. In this chapter, we will explore CuDF, a GPU-accelerated data
manipulation library that integrates well with the existing Python
ecosystem. Before we jump into CuDF, let us also review the architecture
of CPU- and GPU-based computations and concepts of GPU programming.
Note
CPU also has cache, a memory unit that stores data required for a given
iteration of a computation. The cache is supposed to be faster than the
physical memory, and at present there appear to be three levels of cache.
There is also a memory management unit that controls data movement
between CPU and RAM during their communication and coordination for
processing various tasks.
Introduction to CUDA
Note
Kernels
The process begins by the CPU calling the kernel function. In addition to
calling the kernel, the host would also call the execution configuration,
determining how many parallel threads are required for the kernel
execution. The GPU executes the kernel using a grid of blocks, each block
containing a set of threads. The scheduler portion of the GPU block would
send these threads to a streaming multiprocessor on the GPU. Once the
execution is complete, the host will retrieve the results from the GPU’s
memory. This would have completed one full execution cycle.
Memory Management
Introduction to CuDF
CuDF is a Python library that offers various data manipulation and
transformation methods and functions exclusively to be executed on
NVIDIA GPUs. CuDF leverages Apache Arrow for columnar format data
structures. All CuDF’s functions and methods are based on libcudf, a
CUDA library meant for C++ programming.
CuDF and Pandas provide most methods and functions that perform similar
operations. However, there are some areas where these two Python libraries
differ.
To start off, both CuDF and Pandas support similar data structures (Series,
DataFrames, etc.). Both CuDF and Pandas support similar transformational
operations on the data structures like various joins, filtering a data frame,
indexing, group-by, and window operations. CuDF supports many of the
existing Pandas data types and also provides support for special data types.
It may be better to visit the documentation for the same.
Iterating over a data structure is not considered optimal in CuDF. You can
still write code that would loop through a given column in a data frame and
perform some kind of computation; however, it might provide poor
performance. A better option is to either use an inbuilt CuDF method to
perform the transformation or convert the CuDF data frame into either
Pandas or Arrow data frame, perform the necessary operation, and convert
the resultant data frame into a CuDF data frame.
When performing joins or group-by operations, Pandas would sort the data
frame, whereas CuDF does not perform any kind of sorting or ordering of
elements. If you require some sort of ordering elements, then you enable
“sort=True” in the code.
Setup
Although the installation page provides more options for installing and
working with CuDF, I recommend the following to be able to get up and
running with CuDF.
You need a NVIDIA GPU with compute capability of 6.0 and above. First
of all, the CuDF library is built on top of CUDA, a parallel processing
framework exclusive for NVIDIA GPUs. NVIDIA offers various GPU
products and you may be able to view them on their product catalog.
Second, compute capability is just a glorified term for saying newer models
of their GPU. My understanding of compute capability is that the higher the
version number, the better the GPU can perform and can process more
complex operations and the more efficient it can do them.
Note
Think of it like an iPhone, where the more recent versions have more
capabilities when compared with the older iPhones. Compute capability is
like saying a newer iPhone model.
You also need a Linux distribution. Though you may be able to work with
Windows operating systems with WSL, it may be beneficial to have a
Linux instance with gcc++ 9.0 and above available.
Note
If you have a Windows operating system, then you may need to have a
GPU that has compute capability of 7.0 or higher (an even more recent
model).
You need appropriate NVIDIA drivers and CUDA drivers, where it may be
required to have a CUDA driver version 11.2 or newer.
You also need the latest version of Python 3 installed on your system.
It may be better to uninstall any existing CUDA toolkits and CUDA
packages that are loaded by default.
For those who do not have access to one of the abovementioned, it may be
beneficial to look at some cloud-based options. All the major cloud
providers offer GPU-based computing options. While this will be covered
in later chapters in detail, let us look at one such cloud provider example,
so we can get our hands on with CuDF. I am choosing Google Cloud
Platform-based Google Colab (arbitrarily).
To get started with Google Colab, simply visit the following portal and sign
in with your Google account (or create an account if you do not possess one
yet):
https://colab.google/
Figure 4-1
Google Colab offers a free option and a pro option. For our purposes, the
free option may suffice.
You may get the following screen:
Figure 4-2
At the time of learning, it may be wise to turn off the generate with AI
option. If you do wish to turn it off, click “Tools” on the menu and choose
the “Settings” option.
Figure 4-3
Under the settings, locate “Colab AI” and check “Hide generative AI
features.”
Now, let us change the runtime of our Jupyter notebook from CPU to GPU.
Let us navigate to the main page within Google Colab. From there, select
“Runtime” on the menu and choose “Change runtime type.”
Figure 4-4
Choose “Python 3” for the runtime type and choose an option that contains
the word “GPU” for the hardware accelerator. In this case, select the T4
GPU option. Note that the other GPU options are also NVIDIA offerings.
Figure 4-5
Now, it is time to install CuDF. Copy and paste the following command in
your code box. This might take some time so it may be better to wait for a
little while:
!pip install cudf-cu11 --extra-index-url=https://pypi.ngc.nvidia.com
!rm -rf /usr/local/lib/python3.8/dist-packages/cupy*
!pip install cupy-cuda11x
In some cases, the installation may fail for reasons including conflict
between package versions. If and when installation and setup appears to be
challenging, do consider installing an older and stable version.
A screenshot of a character user interface exhibits the status of
installing C u D F within the Google Colab environment.
Figure 4-6
Once we have the CuDF installed, it is time to test the installation. Please
issue the following code and observe the output:
import cudf
a = cudf.Series([1, 2, 3, 4, 5])
print(a)
would yield
0 1
1 2
2 3
3 4
4 5
dtype: int64
Figure 4-7
File IO Operations
CuDF supports most of the commonly utilized data formats in big data and
machine learning. Here is how the syntax may look:
csv_file = "/path-to-folder/path-to-file.csv"
cudf_df = cudf.read_csv(
filepath
,sep=","
,header="infer")
CSV
If you plan to export a given CuDF data frame into a CSV, here’s how to do
it:
cudf.to_csv(
"/path-to-file.csv"
,sep=","
,header= True
,index=True
)
Parquet
JSON
CuDF supports JSON file formats; however, many of the attributes for
reading a JSON file do not appear to be GPU accelerated.
Here is an example:
import cudf
filename = "/filepath/file.json"
cudf_dataframe = cudf.read_json(filename)
Here’s how to write out a JSON file:
cudf_dataframe = cudf.to_json(“/filepath/filename.json”)
Similarly, CuDF supports other data file formats like ORC, Avro, etc.
Basic Operations
Let us look at some basic operations with CuDF, by importing an example
dataset. Here is how the code looks:
import cudf
cudf_df = cudf.read_csv("/content/sample_data/polars_dataset.csv")
print(cudf_df.head())
Figure 4-8
Column Filtering
Let us try to select a subset of the data. Here’s how to obtain only certain
columns:
name_and_occupation = cudf_df[
["firstname","occupation"]
].head()
print(name_and_occupation)
would yield
firstname occupation
0 Rosa Accountant
1 Reese Electrician
2 Gianna Doctor
3 Abran Accountant
4 Lanita Electrician
Row Filtering
Figure 4-9
Figure 4-10
Left Join
Left join is where everything is kept as is on the left table while looking for
potential matches on the right table. We use cudf.merge to perform left
joins on GPU DataFrames.
Outer Join
Outer join is where the columns or indexes that are used to perform the join
are unionized from both data frames. Here is how that looks:
outerjoin = left_df1.merge(
right_df1,
on=["DeptID"],
how="outer"
)
print(outerjoin)
would yield us
ID Name DeptID DeptName
0 1 Alice 101 HR
1 2 Bob 102 IT
2 3 Charlie 103 Marketing
3 5 John 103 Marketing
4 4 David 104 <NA>
Inner Join
Inner join is the inverse of outer join. Inner join is where the intersection of
columns or indexes is performed on both data frames.
A left semi join is a type of join that returns only the rows from the left
table that match with the right table; rows from the left table that do not
seem to have a match on the right table are ignored.
A left anti join can be seen as an inverse of a left semi join. Left anti joins
would return the rows from the left table that do not have matches with the
right table.
Advanced Operations
Let us look at some advanced data wrangling operations using CuDF.
Group-By Function
Now let us try to get total sales by department. To accomplish this, we need
to group the dataset by the department and obtain the sum of the sales
numbers within that respective department.
Transform Function
apply( )
Though CuDF provides many of the commonly used methods and
functions on GPU accelerator mode, it is certainly possible there may be
cases where you have to write a custom function, also referred to as a user-
defined function. These user-defined functions are often applied to a series
or a data frame object on a given column or a set of columns. This is where
the apply() method is beneficial. The apply() method enables to implement
user-defined functions over a given column.
Cross Tabulation
Let us look at some advanced functions in CuDF. First, we will try to use
the “crosstab” function. Crosstab stands for cross tabulation that displays
the frequency distribution of variables in a table, where the row and column
belong to a data point and the value in the table denotes how often the
value occurs. Cross tabulation view is useful in exploring data, analyzing
relationships, or even just looking at distribution of a variable across
various categories.
The cut function is used for creating bins or buckets with discrete intervals
and dropping the numerical values into appropriate buckets. It is commonly
used to transform continuous data into categorical data. A common
application of this function is to create categorical features from continuous
variables.
Factorize Function
Window Functions
A window function is another key analytical function, where operation is
performed over a specific window of data, usually where they are placed. A
good example is calculating moving averages on certain data. Let us look at
an example:
ts_data = cudf.Series(
[10, 20, 30, 40, 50, 55, 60, 65, 70, 72, 74, 76, 78, 80]
)
moving_avg = ts_data.rolling(window=3).mean()
print("Moving average:", moving_avg)
would yield
Moving average:
0 <NA>
1 <NA>
2 20.0
3 30.0
4 40.0
5 48.33333333
6 55.0
7 60.0
8 65.0
9 69.0
10 72.0
11 74.0
12 76.0
13 78.0
dtype: float64
CuDF Pandas
CuDF provides Pandas accelerator mode that supports all of the Pandas
API functions in methods. This means that CuDF can accelerate Pandas
code on GPU without having to entirely rewrite the code. In cases where
the code may not be able to leverage GPU, this Pandas accelerator mode
automatically reverts to executing the Pandas code on CPU.
It may be a good idea to recall that CuDF supports most of the commonly
utilized Pandas functions and methods. If possible, it may be a better idea
to rewrite the Pandas code into CuDF code. Moreover, if you are working
with a large dataset with varying complex operations, it may be better to
test the cudf.pandas with a subset and compare the same with Pandas.
To enable cudf.pandas, simply add the following line in your Pandas code
for the GPU accelerator mode:
%load_ext cudf.pandas
Note
Conclusion
So far, we have gained an appreciation toward data processing in traditional
CPU environments and GPU-accelerated environments. We looked at key
concepts of GPU programming like kernels, threads, and memory
management, which form the foundation of efficient GPU utilization. We
also gained hands-on CuDF programming, performing advanced data
manipulation operations, feature engineering, and complex joins. This
chapter can serve as a foundation for setting up cost-effective and robust
machine learning pipelines. We will see more of these in upcoming
chapters.
© The Author(s), under exclusive license to APress Media, LLC, part of
Springer Nature 2024
P. K. NarayananData Engineering for Machine Learning
Pipelineshttps://doi.org/10.1007/979-8-8688-0602-5_5
Introduction
Data validation has been in existence for a long time. During the early
adoption of relational database management systems (RDBMSs), basic
validation rules were defined within a limited scope. As SQL evolved
within the relational systems, it provided more opportunities with respect to
specifying better validation rules by writing SQL code. In the modern days
of big data and machine learning, data validation has occupied greater
relevance. The quality of data directly affects the insights and intelligence
derived from analytical models that are built using them. In this chapter we
will explore two major data validation libraries, namely, Pydantic and
Pandera, and delve into features, capabilities, and practical applications.
As we have now moved into fully mature machine learning and analytics
development, advanced data validation techniques are now employed as
part of data quality measures. Data quality is an important part of the data
management process.
The entire machine learning modeling and data science analysis process
runs on a few assumptions. Just like how generic linear models assume that
the underlying data is normally distributed, the machine learning models
and data science analysis are based on assumptions about data. Some of
such assumptions in our projects may look like the following:
Our data is current, that is, believed to represent the reality of the
situation.
If the model exceeds the accuracy threshold, then the model is good;
otherwise, models and algorithms need to be retrained.
Models are built and decisions are made based on the idea that underlying
incoming data is good, consistent, and accurate. And so, any bottlenecks
and inconsistencies may be attributed to the model itself and require further
development cycles.
Incomplete data
Inconsistent data
Outdated data
Definition
The data validation process may occur during data entry, data import, or
data processing and involves checking data against predefined rules to
ensure it meets the expected criteria.
Data Accuracy
Data accuracy is the process of verifying data is free from errors. The goal
of data accuracy is to transform and perform sanitization operations to keep
it acceptable and correct, as per the context of the project or business
context. Lack of data accuracy may not necessarily affect the downstream
data pipelines as long as they are of the same data type that the downstream
processes are expecting. However, it may certainly have a major impact on
the quality of results and intelligence that downstream processes may
generate. Hence, it is important to keep the data accurate and current.
Data Uniqueness
Data uniqueness is when each row or a data point in either a row or column
is not a copy of another data point and not a duplicate in any form. In many
cases, a data point may repeat itself in several instances due to the nature
and context (e.g., categorical bins in numerical format). However, in some
cases, where the data point is the output of a transformation that involves
more than one input data point, it may be beneficial to have the uniqueness
tested and validated. Another application is that, when the data point is a
floating point number with mandatory “n” number of digits after the
decimal, it certainly helps to validate the uniqueness of the data point.
Data Completeness
Data Range
Data Consistency
Data Format
Data format is the process of verifying whether the data follows different
encoding than what is expected and whether the data types are in the
appropriate format. Data format is an important part of validation. When
data is transported between data systems, there is a greater likelihood that
the one system may not support the native encoding of the other system.
And so, the data types are often converted or imposed to data fields. A
common issue is when date is read as a string. Another issue is when an
identification number is casted as a string and gets auto-converted into an
integer; problem is it may lose all the trailing zeroes, which is the very
reason integers are casted as strings.
Referential Integrity
Data validation can be performed during the development phase, or you can
set up an automated process of checking various data points. In this chapter,
we are going to look at Pydantic library.
Introduction to Pydantic
Pydantic is the most popular Python library for performing data validations.
The way it works is that Pydantic enforces data types and constraints while
loading data into the Pydantic models. Some of the most effective
applications of Pydantic are
Please note that this is not a good programming practice to specify that this
function addition would return a float type while explicitly defining an
integer within the function. The idea is to show that Python does not
enforce the data types.
Pydantic is the most widely used library for data validation in Python. Big
tech like Apple and others along with large governmental organizations use
Pydantic (and contribute too). Pydantic recently released a major upgrade
called Pydantic V2. According to the release notes, Pydantic’s validation
and serialization logic has been written in Rust programming language, so
the V2 will be tens of times faster than the earlier version.
Note
Data coercion is the process of converting data from one type to another, so
the data is usable in a given context.
Pydantic Models
Models are how you define schemas in Pydantic. Pydantic models are
basically writing classes that inherit from BaseModel. Upon defining the
class, you can begin to define the schemas for the dataset. Once the data is
passed through the Pydantic model, Pydantic will make sure that every data
field in the source data will conform to the schema defined in the Pydantic
class.
Nested Models
Both the parent and immediate child get their own Pydantic model class.
The parent Pydantic model class references the immediate child model
class using a List method.
Let’s create the model object and print the model output:
for i in data:
output = Stu(**i)
print(output)
would yield
first_name='Angil' last_name='Boshers' interests=[Interests(first='sports',
second='mountains')]
first_name='Aggi' last_name='Oswick' interests=[Interests(first='long-
drives', second='beaches')]
Fields
The field function in Pydantic is used to specify a default value for a given
field, in case it remains missing from source documents.
JSON Schemas
Pydantic allows us to create JSON schemas from the Pydantic model class
we created. As you may recall, a JSON schema is simply a JSON data
structure. A JSON schema asserts how the JSON document will be like.
Let us revisit the earlier nested models example. We import the necessary
libraries, create the Pydantic model class, create the Pydantic model object,
and print the JSON schema:
import pydantic, json
from typing import List
with open("/home/parallels/Documents/data2.json") as f:
data = json.load(f)
print(data[0])
class Interests(pydantic.BaseModel):
first: str
second: str
class Stu(pydantic.BaseModel):
first_name: str
last_name: str
interests: List[Interests]
print(Stu.schema_json(indent=3))
# In case if the schema_json method in BaseModel class
# is deprecated, use the model_json_schema method instead:
#print(Stu.model_json_schema(indent=3))
would yield
{
"title": "Stu",
"type": "object",
"properties": {
"first_name": {
"title": "First Name",
"type": "string"
},
"last_name": {
"title": "Last Name",
"type": "string"
},
"interests": {
"title": "Interests",
"type": "array",
"items": {
"$ref": "#/definitions/Interests"
}
}
},
"required": [
"first_name",
"last_name",
"interests"
],
"definitions": {
"Interests": {
"title": "Interests",
"type": "object",
"properties": {
"first": {
"title": "First",
"type": "string"
},
"second": {
"title": "Second",
"type": "string"
}
},
"required": [
"first",
"second"
]
}
}
}
Constrained Types
A constrained type is a method offered by Pydantic that dictates the values
each field can take on in the Pydantic model class. Pydantic offers the
following types:
For int, there is the conint method; conint can be used to set up upper
and lower bounds for a continuous variable.
For str, there is the constr method; constr can enforce uppercase or
lowercase for a string, wherever required.
For floats, there is a confloat method; once again, it helps to set lower
and upper bounds for floats, decimals, etc.
Validators in Pydantic
Pydantic provides a way to validate the field data using the validator
method. You can even bind validation to a type instead of a model. It may
be worthy to note that validators do not get executed if a default field value
is specified. There are several types of validators:
Here, let us import necessary libraries and the data file (in this example, a
flat file):
from pydantic import BaseModel, field_validator
import csv
data =
csv.DictReader(open("/home/parallels/Downloads/MOCK_DATA.csv"))
The dataset is about persons and contains two variables, name and age.
Name is a string and age is an integer. The idea is to validate how many of
the persons are less than 21 years old. Now, let us define the Pydantic
model class with a validator:
class Person(BaseModel):
name: str
age: int
@field_validator('age')
def agelimit(cls, value: int) -> int:
if value < 21:
raise ValueError("Less than 21 years old")
return value
And so, when we run the model output this way
for person in data:
pydantic_model = Person(**person)
print(pydantic_model)
we get the following output:
name='Ermentrude Azema' age=22
name='Joey Morteo' age=56
name='Jordan Corp' age=56
name='Jehu Kettlesing' age=25
name='Adham Reeson' age=40
name='Ronda Colomb' age=32
name='Bond Dufton' age=47
name='Pearce Mantione' age=47
name='Reginald Goldman' age=39
name='Morris Dalrymple' age=32
name='Lemar Gariff' age=39
name='Leonhard Ruberti' age=63
name='Alyssa Rozier' age=63
name='Cassey Payze' age=22
name='Lonnie Mingaud' age=56
name='Town Quenby' age=39
Traceback (most recent call last):
File ~/anaconda3/lib/python3.11/site-
packages/spyder_kernels/py3compat.py:356 in compat_exec
exec(code, globals, locals)
File ~/Documents/pyd_model.py:25
pydantic_model = Person(**person)
File pydantic/main.py:341 in pydantic.main.BaseModel.__init__
ValidationError: 1 validation error for Person
age
Less than 21 years old (type=value_error)
Introduction to Pandera
Pandera is an MIT-licensed, open source Python library that performs data
validation in the context of cleaning and preprocessing. Pandera works well
with Pandas’ data frames’ data structure. Pandera enables declarative
schema definition, as in you can define the expected data structure
including data types, constraints, and other properties. Once a schema is
defined, Pandera can validate whether the incoming Pandas data frame
object adheres to the user-defined schema (including data types and
constraints), checking whether the fields are present among other things.
Furthermore, Pandera supports custom functions and user-defined
validation checks, which may be very domain specific for the incoming
dataset.
Pandera also supports statistical validation like hypothesis testing. Note that
statistical validation is very different from data validation; statistical
validation is where one hypothesizes that data should have a certain
property or meet a condition, which is then validated by applying various
statistical methods and formulas toward samples of the population data.
Pandera supports lazy validation; and so, when a data frame is set to be
validated against a long list of data validation rules, Pandera will perform
all the defined validation before raising an error.
Pandera offers a data frame schema class that is used to define the
specifications of the schema, which is then used to validate against columns
(and index) of the data frame object provided by Pandas. Pandera’s column
class enables column checks, data type validation, and data coercion and
specifies rules for handling missing values.
Here is an example:
As you can see, the schema enforces itself strictly onto the data frame. To
accept null or missing values, one can specify an argument like
nullable=True.
And so, the integer value has been coerced into a string.
Checks in Pandera
Let us define the check. In this example, let us check if the given number is
odd or even. Let us define that in the check:
from pandera import (Check,
DataFrameSchema,
Column,
errors)
import pandas as pd
odd = Check.isin([1,3,5,7,9])
Let us define the Pandera schema and data:
schema = DataFrameSchema({
"series": Column(int, odd)
})
data = pd.DataFrame({"series": range(10)})
Let us attempt to validate the schema:
try:
schema.validate(data)
except errors.SchemaError as e:
print("\n\ndataframe of schema errors")
print(e.failure_cases) # dataframe of schema errors
print("\n\ninvalid dataframe")
print(e.data) # invalid dataframe
would yield
dataframe of schema errors
index failure_case
0 0 0
1 2 2
2 4 4
3 6 6
4 8 8
invalid dataframe
series
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
And so, both the index and element are given in the output.
Note
You can also specify whether you need element-wise check operations. For
our previous example, here is how that looks:
from pandera import (Check,
DataFrameSchema,
Column,
errors)
import pandas as pd
odd = Check(
lambda x: x % 2 != 0,
element_wise=True
)
data = pd.DataFrame(
{
"series": [1,3,5,7,9]
}
)
schema = DataFrameSchema(
{
"series": Column(int, odd)
}
)
try:
schema.validate(data)
except errors.SchemaError as e:
print("\n\ndataframe of schema errors")
print(e.failure_cases) # dataframe of schema errors
print("\n\ninvalid dataframe")
print(e.data) # invalid dataframe
would yield
series
0 1
1 3
2 5
3 7
4 9
I have generated a dataset that contains some kind of score for both iOS
and Android operating systems. It is a randomly generated dataset. A
sample output would look like
os hours
0 ios 4.5
1 Android 5.1
2 ios 5.1
3 Android 5.1
4 ios 5.7
Please note that the data is randomly generated and I needed to split the
column into two columns, followed by rearranging the columns.
My claim is that iOS has a better score than Android. Once again, these are
randomly generated floating point numbers and do not represent anything
real here. Here is how they are interpreted in statistical language:
In statistical language
Null hypothesis: There is no significant difference in value scores between
iOS and Android.
Lazy Validation
So far we have seen that when we call the “validate” method, the
SchemaError is raised immediately after one of the assumptions specified
in the schema is falsified. And so, the program stops executing other
instructions supplied. In many cases, it may be more useful to see all the
errors raised on various columns and checks. Lazy validation is an option
that enables just that. We have seen lazy evaluation of code in our earlier
chapter. Lazy evaluation is when the code is executed only when it is
actually needed.
Let us validate a simple table with a record in it. First, import the libraries
and method, followed by defining the Pandera schema:
import pandas as pd
import json
from pandera import (Column,
Check,
DataFrameSchema,
DateTime,
errors)
# define the constraints
check_ge_3 = Check(lambda x: x > 3)
check_le_15 = Check(lambda x: x <= 15)
check_gpa_0 = Check(lambda x: x > 0.0)
check_gpa_4 = Check(lambda x: x <= 4.0)
# schema
schema = DataFrameSchema(
columns={
"StudentName" : Column(str, Check.equal_to("John")),
"CreditsTaken" : Column(int, [check_ge_3, check_le_15]),
"GPA" : Column(float, [check_gpa_0, check_gpa_4]),
"Date" : Column(DateTime),
},
strict=True
)
Now, let us load the data:
df = pd.DataFrame(
{
"StudentName" : ["JohnDoe", "JaneDoe","DoeJohn"],
"CreditsTaken" : [9,12,18],
"GPA" : [3.7, 4.01, 3.5],
"Date" : None,
}
)
Let us perform the lazy validation:
try:
schema.validate(df, lazy=True)
except errors.SchemaErrors as e:
print(json.dumps(e.message, indent=2))
print("dataframe of schema errors")
print(e.failure_cases) # dataframe of schema errors
print("invalid dataframe")
print(e.data) # invalid dataframe
would yield
{
"DATA": {
"DATAFRAME_CHECK": [
{
"schema": null,
"column": "StudentName",
"check": "equal_to(John)",
"error": "Column 'StudentName' failed element-wise validator number
0: equal_to(John) failure cases: JohnDoe, JaneDoe, DoeJohn"
},
{
"schema": null,
"column": "CreditsTaken",
"check": "<lambda>",
"error": "Column 'CreditsTaken' failed element-wise validator number
1: <Check <lambda>> failure cases: 18"
},
{
"schema": null,
"column": "GPA",
"check": "<lambda>",
"error": "Column 'GPA' failed element-wise validator number 1:
<Check <lambda>> failure cases: 4.01"
}
]
},
"SCHEMA": {
"SERIES_CONTAINS_NULLS": [
{
"schema": null,
"column": "Date",
"check": "not_nullable",
"error": "non-nullable series 'Date' contains null
values:0 None1 None2 NoneName: Date, dtype: object"
}
],
"WRONG_DATATYPE": [
{
"schema": null,
"column": "Date",
"check": "dtype('datetime64[ns]')",
"error": "expected series 'Date' to have type datetime64[ns], got
object"
}
]
}
}
dataframe of schema errors
schema_context column check check_number failure_case index
0 Column StudentName equal_to(John) 0 JohnDoe 0
1 Column StudentName equal_to(John) 0 JaneDoe 1
2 Column StudentName equal_to(John) 0 DoeJohn 2
3 Column CreditsTaken <lambda> 1 18 2
4 Column GPA <lambda> 1 4.01 1
5 Column Date not_nullable None None 0
6 Column Date not_nullable None None 1
7 Column Date not_nullable None None 2
8 Column Date dtype('datetime64[ns]') None object None
invalid dataframe
StudentName CreditsTaken GPA Date
0 JohnDoe 9 3.70 None
1 JaneDoe 12 4.01 None
2 DoeJohn 18 3.50 None
Please note that in the preceding example, SchemaError is coming from the
class errors. This exhaustive output of the “try..except” block is also a good
way to see detailed validation output.
Pandera Decorators
Note
Let us look at an example where we have a current Python code that adds
two columns and stores the result in a third column, all within a Pandas
data frame. We will leverage the idea of using decorators. Let’s look at the
code, without Pandera validation parameters:
import pandas as pd
data = pd.DataFrame({
"a": [1,4,7,9,5],
"b": [12,13,15,16,19],
})
def addition(dataframe):
dataframe["c"] = dataframe["a"] + dataframe["b"]
return dataframe
final_df = addition(data)
print(final_df)
It is a straightforward example where we add two columns and store their
result in a third column. Now let’s look at Pandera data validators along
with decorator integration. First off, we import the necessary functions,
data, and methods:
import pandas as pd
from pandera import DataFrameSchema, Column, Check, check_input
data = pd.DataFrame({
"a": [1,4,7,9,5],
"b": [12,13,15,16,19],
})
Now we define the Pandera data validation through DataFrameSchema.
Please note the check is defined separately and called within the
DataFrameSchema for purposes of readability. You can incorporate the
check directly within the DataFrameSchema:
check_a = Check(lambda x: x <= 10)
check_b = Check(lambda x: x <= 20)
validateSchema = DataFrameSchema({
"a": Column(int,
check_a
),
"b": Column(int,
check_b
),
})
Now, let’s utilize the Pandera decorators. We are going to use the
@check_input, to validate the input columns. The DataFrameSchema is
passed as parameter to the @check_input decorator:
@check_input(validateSchema)
def addition(dataframe):
dataframe["c"] = dataframe["a"] + dataframe["b"]
return dataframe
final_df = addition(data)
print(final_df)
Upon executing the code, we would obtain
a b c
0 1 12 13
1 4 13 17
2 7 15 22
3 9 16 25
4 5 19 24
Now, let’s say we change the validation condition of check_b. Currently it
looks like
check_b = Check(lambda x: x <= 20)
The modified check would look like
check_b = Check(lambda x: x <= 18)
Upon executing the code, we obtain the following:
SchemaError: error in check_input decorator of function 'addition':
<Schema Column(name=b, type=DataType(int64))> failed element-wise
validator 0:
<Check <lambda>>
failure cases:
index failure_case
0 4 19
Conclusion
Data validation remains one of the key aspects of engineering data
pipelines. We discussed various principles of data validation and the need
for good data. We then looked at Pydantic, a data validation library that
focuses on Python objects conforming to a certain data model. We then
discussed Pandera, another data validation library that focuses on applying
validation checks on data frame–like objects; these are very effective
toward the data analysis and data preprocessing tasks. In our next chapter,
we will discuss another data validation library that focuses heavily on data
quality assurance and data validation in data pipelines.
© The Author(s), under exclusive license to APress Media, LLC, part of
Springer Nature 2024
P. K. NarayananData Engineering for Machine Learning
Pipelineshttps://doi.org/10.1007/979-8-8688-0602-5_6
Introduction
So far we looked at the need for and idea behind data validation, principles
of data validation, and what happens when data isn’t validated. We also
looked at key Python data validation libraries Pydantic and Pandera.
Pydantic is one of the most utilized libraries in Python. We looked at
creating a model class and instantiating an object from the class. We also
looked at Pandera, a specialized open source Python library, with advanced
data validation options like checking conditions at column levels and so on.
In this chapter, we will be looking at Great Expectations, an entire data
validation framework that is designed for managing data validation and
testing for several production pipelines.
Great Expectations can help you in defining what you wish to expect from
the data that has been loaded and/or transformed. By employing these
expectations, you can identify bottlenecks and issues within your data. In
addition, Great Expectations generates documentation about data and
reports about data quality assessment of the datasets.
Data Context
Data context is referred to as the starting point for the Great Expectations
API; this component contains settings, metadata, configuration parameters,
and output from validating data, for a given Great Expectations project.
The metadata is usually stored in an object called Great Expectations
stores. The Great Expectations API provides a way to configure and
interact with methods using Python. You can access them using the Great
Expectations public API. Great Expectations also generates data
documentation automatically that contains where the data is coming from
and the validations that the dataset is subjected to.
Data Sources
The data source component tells Great Expectations about how to connect
to a given dataset. The dataset may be a flat file, JSON document, SQL
database, Pandas data frame located locally, or remote machine or one in a
cloud environment. The data source component contains one or more data
assets. Data assets are collections of records from the source data. Data
assets are stored in batches, where a given batch may contain an unique
subset of records. Great Expectations submits a batch request internally to
access the data to perform validation and testing. It may be interesting to
note that a data asset in Great Expectations may contain data from more
than one file, more than one database table, etc., whereas a given batch may
contain data from exactly one data asset.
Expectations
Checkpoints
A checkpoint is how the Great Expectations framework validates data in
production pipelines. Checkpoints provide a framework to group together
one or more batches of data and execute expectations against the said
batched data. A checkpoint takes one or more batches, providing them to
the validator module. The validator module generates results, which are
then passed to a list of validation actions. The validation results tell you
how and where the data stands with respect to the defined expectations.
The final step is validation actions that can be completely configured to
your needs. Actions include sending an email, updating data
documentation, etc.
First, let us make sure you have the latest version of Python, version 3.x.
There are two ways of getting started with a new Great Expectations
project. One is the legacy way of interacting with the command line
interface, and the other way is through the modern fluent data source
method.
Figure 6-1
Once the project structure creation has been completed, then you will see
the following screen:
Figure 6-2
Customizing the Great Expectations configuration page
Figure 6-3
Figure 6-4
This would then open up a Jupyter notebook, where you can complete the
setup of your new data source.
Here we have a dataset that contains the gender of a subject and their
respective blood sugar levels. We are going to write an expectation to see
what types of gender are available.
1. 1.
2. 2.
Create a data context, where you initialize the data context component
to store metadata and other settings management.
3. 3.
Connect to data, where you list all your data sources from Pandas data
frames to cloud-based databases, and perform necessary steps for each
data source.
4. 4.
5. 5.
6. 6.
7. 7.
We have already seen installing Great Expectations, prior to using the CLI,
in the “Setup and Installation” section. Upon successful installation of
Great Expectations, it is time to set up the data context. The data context
contains configurations for expectations, expectation stores, checkpoints,
and data docs. Initializing a data context is quite straightforward, as
importing the library and running the get_context() method:
import great_expectations as ge
context = ge.get_context()
print(context)
You will get something like this:
{
"anonymous_usage_statistics": {
"explicit_url": false,
"explicit_id": true,
"data_context_id": "14552788-844b-4270-86d5-9ab238c18e75",
"usage_statistics_url":
"https://stats.greatexpectations.io/great_expectations/v1/usage_statistics",
"enabled": true
},
"checkpoint_store_name": "checkpoint_store",
"config_version": 3,
"data_docs_sites": {
"local_site": {
"class_name": "SiteBuilder",
"show_how_to_buttons": true,
"store_backend": {
"class_name": "TupleFilesystemStoreBackend",
"base_directory": "/tmp/tmp_oj70d5v"
},
"site_index_builder": {
"class_name": "DefaultSiteIndexBuilder"
}
}
},
"datasources": {},
"evaluation_parameter_store_name": "evaluation_parameter_store",
"expectations_store_name": "expectations_store",
"fluent_datasources": {},
"include_rendered_content": {
"expectation_validation_result": false,
"expectation_suite": false,
"globally": false
},
"profiler_store_name": "profiler_store",
"stores": {
"expectations_store": {
"class_name": "ExpectationsStore",
"store_backend": {
"class_name": "InMemoryStoreBackend"
}
},
"validations_store": {
"class_name": "ValidationsStore",
"store_backend": {
"class_name": "InMemoryStoreBackend"
}
},
"evaluation_parameter_store": {
"class_name": "EvaluationParameterStore"
},
"checkpoint_store": {
"class_name": "CheckpointStore",
"store_backend": {
"class_name": "InMemoryStoreBackend"
}
},
"profiler_store": {
"class_name": "ProfilerStore",
"store_backend": {
"class_name": "InMemoryStoreBackend"
}
}
},
"validations_store_name": "validations_store"
}
In the preceding code, we are writing a data asset directly into a validator.
The next step is to create a data source.
The validator needs to know the batch that contains the data, used to
validate the expectations and the expectation suite, which contains the
combined list of expectations you create:
validator = context.get_validator(
batch_request=batch_request,
expectation_suite_name="my_expectation_suite",
)
print(validator.head())
We will get
Calculating Metrics: 0%| | 0/1 [00:00<?, ?it/s]
id name age state gender exercise_daily? copay_paid
0 1 Allayne Moffett NaN SC Male NaN $206.76
1 2 Kerby Benjafield NaN NM Male True $21.58
2 3 Raina Vallentin NaN MI Female True $125.18
3 4 Kaela Trodden NaN OH Female False $86.20
4 5 Faber Kloisner NaN MN Male False $219.38
Note
Now, let us run the validator to create and run an expectation. Let us look at
the missing values for one of the columns, say “age.” The expectation we
use for this operation is
expect_column_values_to_not_be_null
Here is how the code looks:
expectation_notnull = validator.expect_column_values_to_not_be_null(
column="age"
)
print(expectation_notnull)
This would yield
Calculating Metrics: 0%| | 0/6 [00:00<?, ?it/s]
{
"success": false,
"expectation_config": {
"expectation_type": "expect_column_values_to_not_be_null",
"kwargs": {
"column": "age",
"batch_id": "test2-asset1"
},
"meta": {}
},
"result": {
"element_count": 1000,
"unexpected_count": 1000,
"unexpected_percent": 100.0,
"partial_unexpected_list": [
null,
null,
null,
null,
null,
null,
null,
null,
null,
null,
null,
null,
null,
null,
null,
null,
null,
null,
null,
null
]
},
"meta": {},
"exception_info": {
"raised_exception": false,
"exception_traceback": null,
"exception_message": null
}
}
As you can see, there are about 126 records where the copay_paid value is
missing.
So far, we have run two expectations. While the expectations are saved in
the expectation suite on the validator object, validators do not persist
outside the current Python session. To reuse the current expectations, it is
necessary to save them. Here is how that looks:
validator.save_expectation_suite("/home/parallels/Documents/my_expectat
ion_suite.json", discard_failed_expectations=False)
For our current program, here is how that JSON document looks like:
{
"data_asset_type": null,
"expectation_suite_name": "my_expectation_suite",
"expectations": [
{
"expectation_type": "expect_column_values_to_not_be_null",
"kwargs": {
"column": "age"
},
"meta": {}
},
{
"expectation_type": "expect_column_values_to_not_be_null",
"kwargs": {
"column": "copay_paid"
},
"meta": {}
}
],
"ge_cloud_id": null,
"meta": {
"great_expectations_version": "0.18.4"
}
}
Creating a Checkpoint
A checkpoint is a means of validating data in production pipelines using
Great Expectations. Checkpoints allow you to link an expectation suite and
data; the idea is to validate expectations against the data.
Data Documentation
Data documentation, or simply data doc, provides interactive
documentation of the data validation results that Great Expectations
processed.
Figure 6-5
Figure 6-6
Expectation Store
So far we have seen connecting to data assets, building batch requests,
creating validators, running expectations on validators, and saving the
expectations on a JSON document. Now, let us see how to manage the
expectations we have built in an expectation store. An expectation store is a
connector to a store and that helps retrieve collections of expectations about
data.
The process is relatively simple, where you create a folder within your
Great Expectations folder, move all your JSON documents into that folder,
and update your configuration file to include a store for expectation results:
mkdir shared_expectations
mv path-to/my_expectation_suite.json destination-
path/shared_expectations/
Figure 6-7
Conclusion
We have continued to discuss more on data validations. Great Expectations
is one of the innovative data validation libraries, which comes with
predefined expectations. The number of predefined expectations continues
to grow, which is also a good resource for exploring various ways of
performing data validation on your data. You can also create a template
with predefined expectations called an expectation store. The Great
Expectations community appears to be evolving rapidly in my opinion as
during the process of authoring the book, I have passed two different
versions. By the time you are reading this book, you will be witnessing a
major 1.0 release.
© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2024
P. K. Narayanan, Data Engineering for Machine Learning Pipelines
https://doi.org/10.1007/979-8-8688-0602-5_7
Introduction
We have discussed earlier that processing large datasets effectively and efficiently is important. As datasets
continue to grow over time, traditional single-thread processing can be seen as a limitation. Python, by default,
has a global interpreter lock (GIL) that allows only one thread to hold the interpreter at a given point in time.
While this design ensures the integrity of computations submitted, it would be much more effective to use
many threads within a multi-core CPU that is commonly available today. In this chapter, you will learn
concepts of parallel programming, concurrent programming, and Dask, a Python library that supports
distributed processing and works around the global interpreter lock limitation by using multiple processes.
Dask also supports various data processing libraries that we have seen in earlier chapters.
By the end of this chapter, you will learn
– Parallel processing concepts and how they differ from sequential processing
– Fundamentals of concurrent processing
– Dask
– Dask architecture and concepts
– Core data structures of Dask and how they enable distributed computing
– Optimizing Dask computations
History
Parallel processing gained momentum during the time when a CPU started to contain more than one processing
unit within its microprocessor architecture. In general, the electrons carrying the digital signals in the
microprocessor travel very fast; there are upper limits on how quickly a single processor can operate. Such
limitations along with the need for minimizing heat generation, among other things, paved the way for multi-
core processors. Today, we have several processing units (we call them cores) in the microprocessor embedded
in our laptops, phones, computers, etc.
numberOfCores = cpu_count()
would return a number that denotes the number of cores that are on the microprocessor of your computer.
Here is a simple example of parallel processing using the “multiprocessing” library:
def cube(a):
return a * a * a
inputs = [5,12,17,31,43,59,67]
print(cubevalue)
The preceding code will display the cube values for the input supplied. Here is how the output may look for
you:
In the code, you can see the Pool method and its parameter called processes.
Concurrent Processing
Concurrent processing is similar to the idea of parallel processing as far as dealing with simultaneous tasks are
concerned. However, the method of execution is different from what we have earlier. Concurrent processing
refers to two or more tasks getting executed at the same point in time. However, they may not execute at the
exact same time.
For instance, within a single core, one task may start and perform operations for a bit of time. Let's say that
the first task may need to wait for some other process to perform another operation. And so it waits; during this
time one or more tasks might start and may even finish, depending upon their compute, memory, and storage
requirements and what is available to them.
Rather than waiting for full completion of a task, concurrency enables dynamic sharing and allocation of
memory, compute, and storage during runtime. It is a way of using resources seamlessly.
Note Imagine you are a chef at a restaurant. There are five orders of food that arrived at the exact same
time. You know that not all items in the menu take equal amounts of time. You have one stove and one
microwave for five orders. For the sake of simplicity, assume you have one deep fryer, grill, or something
like that. Now let us illustrate concurrency and parallelism.
Concurrency
Imagine starting off a slow-cooking dish. While that is cooking on simmer, prep vegetables for another
dish. As the vegetables are chopped and prepped, periodically check the slow-cooking dish, and stir maybe a
bit. Transfer the slow-cooking dish to the oven while placing the prepped vegetables on the stove. As the
vegetables start cooking, prepare eggs for another dish. Check the vegetables and the oven periodically. You
are switching between tasks, making progress on three orders even though you are working alone. This is
concurrency.
Parallelism
Imagine five chefs for five orders. Each chef takes care of one order with their own complete set of
equipment (stove, oven, etc.) and tools (knife, spoon, etc.). They work on their own dish, with their own
equipment and focusing on their own task. This is parallelism.
Concurrency is when one CPU core deals with multiple tasks, using shared resources, and rapidly
switches between tasks. Parallelism is when “n” CPU cores deal with “n” tasks using dedicated resources to
each core.
Introduction to Dask
Dask is a specialized Python library for enabling parallel and distributed computing for data analysis and
machine learning tasks in Python. Dask is a Python native library and does not depend on non-Python
programming languages for its implementation.
Dask can work well with major Python libraries like Pandas, Polars, CuDF, scikit-learn, and XGBoost, to
name a few. If you are working with data processing and machine learning pipelines with data that exceeds the
compute and memory requirements of a single machine, then you may benefit from adopting Dask in your
technology stack.
Dask has a relatively quicker learning curve, and it is relatively easy to integrate Dask into your production
data pipelines. Dask is used by big banks and major healthcare and national laboratories and organizations.
Dask supports many of the common data formats that are prevalent in big data pipelines such as Avro,
HDF5, Parquet, etc. while also supporting traditional data formats like CSV, JSON, etc.
Dask primarily leverages CPUs for performing its computations and operations. However, Dask can also
work with GPUs. Dask and the CuDF Python library interoperate well with each other. Let us dive in detail on
this topic in the coming sections.
or
or
One can also install Dask in a distributed computing cluster. Dask works well with Hadoop clusters; you
can deploy Dask with YARN, a resource manager for Hadoop that deploys various services within a given
Hadoop cluster. You can also create a cloud computing Dask cluster on AWS or GCP.
Here is a simple hello world–like implementation using Dask:
import dask
import dask.array as da
arr = da.from_array(
[1,2,3,4,5,6,7,8,9,10]
, chunks=2
)
squared = arr ** 2
result = squared.compute()
print(result)
In the preceding example, we started off by importing the necessary class and method. We then created a
Dask array with chunks of 2, meaning create two different chunks of this array. We then created a new function
called squared that would take the square of each value in the array. This is followed by execution of that
function by using a method called “compute(),” which triggers the actual computation and returns the result.
This is due to the lazy evaluation nature of Dask.
When we print the result, it will return
[ 1 4 9 16 25 36 49 64 81 100]
Features of Dask
Tasks and Graphs
In Dask, computation is represented as a collection of tasks as nodes that are connected using edges. These
nodes and edges are organized in a directed acyclic graph (DAG) manner. The tasks that are needed to be
accomplished are represented as nodes. The data dependencies between various tasks (or no data dependency)
are represented as edges. The directed acyclic graph representation enables Dask to effectively execute the
tasks and their underlying dependencies in multiple threads and processes or even across multiple CPUs.
Lazy Evaluation
We have seen the concept of lazy evaluation in previous chapters. To refresh your memory, lazy evaluation is
where the computation does not happen immediately. The execution of a given set of code happens only if it is
actually needed. Until then, this code may be delayed and sometimes it may not be executed at all. As we
already know, Dask creates a directed acyclic graph to represent its execution flow. But it never actually
executes every node, unless it is needed. This way, Dask optimizes the computation.
Partitioning and Chunking
Partitioning means splitting a very large dataset into multiple smaller or manageable portions of datasets. These
smaller datasets can be processed independently. By leveraging the idea of partitioning, Dask can handle
datasets that are larger than its available memory. Chunking is similar to the concept of partitioning except it is
used in smaller scope. The term “partitioning” is used heavily in the context of databases and data tables, where
a given table is split into multiple smaller tables. These tables, upon processing, may be regrouped or joined
back into its original structure (similar to the concept of MapReduce programming).
On the other hand, “chunking" is greatly associated with splitting arrays into smaller chunks. You can
specify the size of the chunks during the creation of Dask arrays. In the Dask context, it may be easier to
conclude that the term “partitioning” can be applied in the case of databases, tables and data frames, while the
term “chunking” can be applied in the case of one-dimension or n-dimension arrays.
Dask-CuDF
In a previous chapter, we have discussed GPU computing and CuDF in detail. As we know already, CuDF is a
GPU computation framework, which enables data frame processing across one or more GPUs. CuDF provides
various methods and functions that can import, export, and perform complex data transformations, which can
run strictly on GPUs.
Dask, on the other hand, is a general-purpose parallel and distributed computing library. We can use Dask
on top of Pandas and other leading data analysis libraries to scale out our workflow, across multiple CPUs.
Dask-CuDF is a specialized Python package that enables processing data frame partitions using CuDF
instead of Pandas. For instance, if you call Dask-CuDF to process a data frame, it can efficiently call CuDF’s
method to process the data (by sending the same to GPU). If you have very large data for your data processing
pipeline, larger than the limits of your machine, then you may benefit from using Dask-CuDF.
Dask Architecture
Let us review the various components in Dask. The architecture of Dask is flexible in that it allows single-
instance computations, single-instance with multi-core computations, and distributed computations.
Core Library
At the heart of the Dask system, we have the Dask collections that consist of various data structures. Some of
them are Dask arrays, which resemble NumPy arrays; Dask data frames, which resemble Pandas' data frames;
and Dask bags, which resemble Python lists and iterators, respectively. These collections enable you to perform
parallel and distributed computations natively. Just by calling Dask’s collections, you will be able to parallelize
or asynchronously execute the computation.
Schedulers
The Dask scheduler is a major component of the Dask execution mechanism. The Dask scheduler sits in the
heart of the cluster, managing the computation, execution of tasks among workers, and communication to
workers and clients. The scheduler component itself is asynchronous, as it can track progress of multiple
workers and respond to various requests from more than one client. The scheduler tracks work as a constantly
changing directed acyclic graph of tasks. In these directed acyclic graphs, each node can be a task and each arc
(edge between nodes) can be a dependency between tasks.
A task is a unit of computation in Dask that is basically a Python function or an execution on a Python
object. There are different types of schedulers available in Dask. There is a single-thread scheduler that
executes tasks on a single thread on a single machine; there is also a multi-thread scheduler that executes tasks
using multiple threads of a single machine; there is also a multi-core scheduler that executes tasks using
multiple cores of a processor; and there is a distributed scheduler where tasks are executed using multiple
worker nodes in a given cluster.
Client
Dask’s client is the primary mode of communication with Dask. You communicate with Dask using a client.
The client is the default scheduler and worker; the client manages and coordinates communication between
schedulers. From the computation point of view, the client manages communication with the scheduler process,
and the scheduler coordinates and assigns work to worker processes; these worker processes would perform the
necessary computation and return the result back to the client. Dask’s client communicates with the scheduler
and collects results from the workers as well. Here is how you may instantiate the Dask client:
client = Client(
threads_per_worker=4
, n_workers=2
)
client
Workers
Workers are end computation machines that perform the computation that is assigned to them by the scheduler.
A worker can be a single thread or a separate process or a separate machine in a cluster; each worker has access
to compute, memory, and storage. Workers communicate with the schedulers to receive tasks from them,
execute the tasks, and report the results.
Task Graphs
As discussed in the earlier section, the computation in a task is represented as a constantly changing directed
acyclic graph of tasks where the edges may be dependency between tasks. Task graphs define the dependencies
between various tasks. When you write Dask code, you are implicitly creating task graphs by performing
operations on a Dask array or Dask DataFrame or Dask bag. The Dask system automatically designs the
underlying task graph.
Dask Arrays
A Dask array is an implementation of NumPy’s ndarray. Dask arrays are made up of many NumPy arrays, and
they are split into smaller and uniform sizes of arrays called chunks. Dask uses a blocked algorithm to perform
this action. You can either specify how many chunks you would like the Dask to split the array or ask Dask to
automatically split the array for computations and calculations. These chunks are then processed independently,
thereby allowing a distributed and parallel execution of data processing across cores or workers within a
cluster. A Dask array supports parallelism and multi-core processing out of the box so that you do not have to
write additional code or add a decorator to enable concurrency.
Let us create a simple Dask array:
var1 = array.random.random(
(10,10),
chunks=(2,2)
)
print(var1)
would return
Here, we have a random array of (10 × 10) dimensions that is broken into smaller chunks of (2 × 2)
dimensions each.
When we add “T.compute()” to our variable, we get a transposed random matrix. Here is how that looks:
print(var1.T.compute())
The following is something that is not to be tried at home, but instead in a cloud account. Assuming you have
adequate memory for your dataset, you can try to persist the data in memory.
Here is how we can do that:
var2 = var1.persist()
You can also re-chunk your Dask array. Here is how we can do that:
var1 = var1.rechunk((5,5))
print(var1)
would return
We can export this Dask array in various formats. Let us look at HDF5 format, which is meant to store open
source large complex data. To do so, we need to install the Python library “h5py” as a starting point:
var3 = da.random.random(
(10000,10000),
chunks=(100,100)
)
da.to_hdf5(
'daskarray.hdf5',
'/dask',
var3
)
We have imported the necessary libraries and created the Dask array. We are using the “to_hdf5” method to
export the Dask array as a HDF5 file, where we also specify the path, where we want to store the file, and the
variable that we are exporting as the HDF5 file.
With the Dask array, you can perform many NumPy operations. Here is how you can calculate the mean()
function of the array:
dask_mean = da.array(var3) \
.mean() \
.compute()
print(dask_mean)
would yield
0.5000587201947724
Dask Bags
Dask bags are another type of data structure, designed to handle and process lists like objects. In a way they are
quite similar to Python lists or iterators, with the added feature that they are capable of parallel and distributed
computations. The word “bag” is actually a mathematical term. Let us quickly recall what a set is. It is an
unordered list with no element repeating itself. Bags are basically sets where elements can repeat themselves. It
is also known as multisets.
new_bag = bag.from_sequence(
[1,2,3,4,5,6,7,8,9,10]
,npartitions=2
)
print(new_bag)
print(new_bag.compute())
would yield
dask.bag<from_sequence, npartitions=2>
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
import json
from dask import bag
file = "/content/dask_data1.json"
bags = bag \
.read_text(file) \
.map(json.dumps) \
.map(json.loads)
print(bags.take(1))
would yield
('[{"id":1,"name":"Godard
Bloomer","language":"Greek","gender":"Male","occupation":"Financial
Analyst"},\n',)
file = "/your-folder-path-here/mockdata-0.csv"
bags = bag \
.read_text(file) \
.map(
lambda x: next(csv.reader([x]))
)
print(bags.take(5))
would yield
print(bags2.count().compute())
would yield
15
Dask DataFrames
Dask data frames are implementation of Pandas’ data frame objects except that Dask’s data frames can
parallelize and perform distributed computations across multiple cores of the same processor in a system or
across multiple systems in a given cluster. Dask data frames also support several of native Python functions
that can be executed in a parallelized manner. Let us look at some illustrations of Dask data frames:
dask.config.set(
{
'dataframe.query-planning': True
}
)
df = dd.from_pandas(
pd.DataFrame(
np.random.randn(10,3),
columns=['A','B','C']
),
npartitions=3)
print(df)
would yield
The “dask-expr” is a Dask expressions library that provides query optimization features on top of Dask data
frames. As for the “query-planning” property set to “True,” this may be enabled by default in up-and-coming
Dask versions. This “query-planning” property provides better optimization planning for execution of Dask
data frames.
To convert the lazy Dask collection to in-memory data, we can use the compute() function. The caveat is
that the entire data should fit into the memory:
print(df.compute())
would yield
A B C
0 -0.580477 0.299928 1.396136
1 0.076835 -0.343578 -0.381932
2 -0.092484 -2.525399 1.520019
3 -0.631179 0.215883 0.409627
4 -0.655274 0.117466 -0.525950
5 -0.003654 -0.215435 -0.522669
6 0.947839 -1.978668 0.641037
7 -0.755312 0.247508 -1.881955
8 1.267564 1.342117 0.601445
9 -1.395537 0.413508 -0.896942
Internally, a Dask DataFrame manages several smaller Pandas data frames arranged on an index. You will
see the true performance of a Dask DataFrame when you have a large dataset that requires execution on
multiple cores or even multiple machines. You cannot truly appreciate the value of Dask if you are dealing with
smaller data frames, whose computational times are less than a few seconds on a single core.
By default, a data frame in Dask is split into one partition. If you specify the number of partitions to a
number, then Dask will partition the dataset accordingly. In our preceding example, we have set the value to 3.
This would enable Dask to create three Pandas data frames, split vertically across the index. The way we define
our partitions heavily influences the performance of our data munging script. For instance, if we needed
aggregations of a specific month, say January, that fell on two partitions, then Dask would have to process two
of these partitions. On the other hand, if our partitions were made by month, then the queries would execute
faster and more efficiently.
As with loading data, you can load data from any number of sources. Let us look at an example:
import dask
import dask.dataframe as dd
import pandas as pd
import numpy as np
dask.config.set(
{
'dataframe.query-planning': True
}
)
csvfile = "/content/mock_health_data.csv"
df1 = dd.read_csv(csvfile)
print(df1)
print(df1.compute())
would yield
In the preceding example, the function “compute()” is applied in order to execute the computation of the
group-by function on the Dask data frame. As mentioned earlier, Dask has lazy evaluation enabled by default,
and so you need to call the compute() function explicitly to begin the execution of the function.
Dask Delayed
There are cases where the problems do not easily fit into the abstraction of series, arrays, or data frames. This is
where the Dask delayed feature comes in. Using Dask delayed, you can parallelize the execution of your own
algorithms. Let us consider the following example:
def add(x):
return x + 1
def square(x):
return x * x
def square_add(a,b):
return a+b
data = [1, 2, 3, 4, 5]
output = []
for x in data:
a = add(x)
b = square(x)
c = square_add(a,b)
output.append(c)
In the preceding example, we have functions that do not depend on each other; and so, they can be executed
in parallel. But this is not an easy task, or we simply cannot try to fit this into a series, array, or data frame.
Now let us use the Dask delayed feature on this problem:
import dask
output = []
for x in data:
a = dask.delayed(add)(x)
b = dask.delayed(square)(x)
c = dask.delayed(square_add)(a, b)
d = c.compute()
output.append(d)
In this code, we have decorated our functions with Dask delayed features. Dask delayed will enable the
parallel computation of these functions. However, notice that these functions do not execute. They simply get
added in the task graph that represents the computations performed in a specific order. In order to execute the
function, we need to add the function “compute().” So when you run this code, you will see
@dask.delayed
def add(x):
return x + 1
@dask.delayed
def square(x):
return x * x
@dask.delayed
def square_add(a,b):
return a+b
output = []
for x in data:
a = add(x)
b = square(x)
c = square_add(a,b)
output.append(c)
results = dask.compute(*output)
Dask Futures
A future, simply, in Python, represents an operation that is yet to be executed. It is basically a placeholder for a
result that has not yet been computed, but it may happen at some point. In Python, the concurrent.future
module provides asynchronous execution of functions. This module provides a way to execute the tasks and
obtain the results concurrently. Dask supports this concurrent.future module to scale your current data pipeline
across multiple systems in a given cluster, with minimal code changes.
Dask futures form a part of Dask’s greater asynchronous execution model, which is within Dask’s
distributed scheduler. So when you submit a task to the Dask scheduler, it returns a future representing that
given task’s result. After that, the scheduler may execute the task in the background by using the available
worker (can be within the system or another system in the cluster). It may be a good time for us to recall that
the future allows for concurrent execution and can enable parallel computations of code.
Let us also revisit the concept of eager and lazy evaluation and where futures stand in this point of view.
Eager evaluation executes code immediately regardless of whether a given function or set of code is relevant to
the result or not. Lazy evaluation does not execute anything unless you explicitly specify to execute, upon
which it will execute only the necessary code for the result. You have already seen the “compute()” function
that performs lazy evaluation on Dask. Dask futures can be seen as somewhere in between.
Dask futures are not lazy evaluation; when you submit a future for processing, it will begin the task in the
background. Dask futures are not eager evaluation either, because futures enable parallel and distributed
computation of tasks and functions, natively. One can naively (with caution exercised) view the Dask futures as
eager evaluation with added features of asynchronous execution and parallelism in a distributed environment,
error handling, and task scheduling.
Internally Dask futures are, by default, asynchronous. When you submit a computation, it returns a future
object immediately. Using the future, you can check the status of the computation at a given point in time or
just wait it out till it completes in the background using the workers. Dask futures are beneficial in designing
data pipelines, where you may need finer-level control over task scheduling.
To work with the Dask futures, you need to instantiate the Dask client. You will also be able to view results
in a dashboard:
from dask.distributed import Client
import dask
When you instantiate the client without any arguments, it starts local workers as processes, and when you
provide the argument “processes=False”, it starts local workers as threads. Processes offer better
performance when it comes to fault tolerance over threads. Here is how you can submit a task to the Dask
scheduler:
def square(x):
return x ** 2
a = client.submit(square, 10)
c = client.submit(square, 20)
In the preceding example, the submit method that submits the task returns a future. The result of this task
may not yet be completed and be completed at some point using a process. The result of this function will stay
in that specific process unless you ask for it explicitly. Let’s look at the contents of the variable “a”:
print(a)
would return
You have to explicitly ask it to return the result. Here is how that looks:
print(a.result())
would return
100
There is also the cancel() method that would stop the execution of a future if it is already scheduled to run;
if it has already run, then the cancel() method would delete it. Here is how that looks:
a.cancel()
You can wait on a future or a set of futures using the wait() function. Once you ask a future to wait, then
it will wait until all the other tasks have been completed or finished execution successfully. Here is how that
looks:
wait(a)
In computing, we have this term called “fire and forget” to describe a process or a task that is executed
without waiting for the response or the outcome. Dask also provides a “fire and forget” feature to tasks, where
the tasks are submitted to be executed without the need for checking their result or the tasks to even complete.
Once the task is fired, the system is told to forget about it, and so the system no longer monitors that task or
even waits for the task to finish:
client.fire_and_forget(a)
or simply
fire_and_forget(a)
Optimizing Dask Computations
In Dask distributed computing, the scheduler performs dynamic task scheduling, a process of building a
diagonal acyclic graph, where the nodes change over time. By dynamic, we are talking about making decisions
as we go along, as tasks are getting completed and new tasks are becoming ready to execute. The idea is to
execute the correct order of tasks while consuming the least amount of time. Let us look in detail how each and
every task passes through various stages in Dask computations.
Let us look at the following example:
def square(x):
return x ** 2
a = client.submit(square, 10)
c = client.submit(square, 20)
print(a.result())
print(c.result())
When we invoke the client.submit function, the Dask client sends this function to the Dask scheduler. As
we know already, the Dask client would create a future and return the future object to the developer. This
happens before the scheduler receives the function.
The scheduler then receives this function and updates its state on how to compute the a or c (whichever
comes in first). In addition, it also updates several other states. Once the scheduler gathers all the required
information in the memory, it would then proceed to assign this to a worker. All of these happen in a few
milliseconds.
The scheduler now goes by its playbook that defines a set of criteria for identifying which worker to assign
this work. It would start off to see if there are already workers who have this algorithm saved in their local
memory. If none exists, it would select the least number of bytes a worker might require to process this
function. If there are multiple workers who share similar least values, it would select the worker with least
workload.
The scheduler has now placed this computation at the top of the stack. A message is generated with this
function, worker assigned and other metadata; the scheduler serializes this message and sends it over to the
worker via TCP socket.
The worker receives this message, deserializes the message, and understands what is required from the task
point of view. The worker starts the execution of this computation. It looks to see whether it requires values
that may be with other workers by investigating the “who_has” dictionary. The who_has dictionary is basically
a list of all the metadata about workers and their tasks and contains other metadata that would enable the
workers to find each other. After this step, the worker launches the actual computation and completes the task.
The worker sends the output to the scheduler.
The scheduler receives this message and repeats the process of sending another task to the worker that is
available. The scheduler now communicates with the client that the computation is ready, and so the future
object that is in waiting mode should wake up and receive the result.
At this point, the Python garbage collection process gets triggered. If there are no further computations that
are based out of the current task, it removes the key of this computation from the local scheduler and assigns to
a list that gets deleted periodically.
Without delving deeply into the lower-level details, this captures a typical process of task scheduling.
Optimizing the task scheduling process may guarantee efficient execution of all tasks within the DAG.
Data Locality
Moving data between workers or unnecessary data transfers can limit the performance of the data pipeline in a
distributed system environment. Data locality is an important term in the broader context of distributed
computing. As we mentioned earlier, schedulers intelligently assign to workers that either have the data already
or are located close to the data transfer over the network. Let's look at a scenario:
client = Client()
df = dd.read_csv("/home/parallels/dask/mockdata-*.csv")
result = df["age"].mean()
final_result = result.compute()
In this example, we are loading multiple CSV files stored in a given folder. Dask would intelligently
partition this dataset across various workers available in the given cluster. Upon loading the dataset, the next
line performs a group-by computation and sum operations on top of that. Dask will intelligently identify the
workers who already have copies of the dataset available. In this case, the workers would be the ones that are
already holding partitions of this dataset. Dask will utilize these workers to minimize the network transfers and
increase the magnitude of the computation.
As we discussed in earlier sections, in certain cases (like custom algorithms), you may require more control
over which worker to process what type of data. Certain data might require GPU-heavy workers, and so you
can specify that in a following manner:
For instance:
Prioritizing Work
When there are many items in the queue for the Dask cluster to process, Dask has to decide which task to
process first and the ones to prioritize over others. Dask prefers to process the tasks that were submitted early.
Dask’s policy is first in, first out. You can custom define priority keywords, where tasks with higher priorities
will run earlier when compared with ones with lower priority:
a = client.submit(square, 5, priority=10)
b = client.submit(square, 7, priority=-1)
The operation with priority set to 10 carries the top priority, whereas the operation with priority set to –1
carries relatively lower priority.
Work Stealing
In Dask distributed computing, work stealing is another strategy that is employed as part of performance
optimization. A task may prefer to run on a specific worker because of the nature of processing or task
dependencies that exist in that specific worker. Stealing strategy may be good in cases where the time taken to
perform a computation is longer than time taken to communicate with the scheduler. Other cases where stealing
is a good choice would be to obtain workload from an overworked worker and replicating data that is used
more often. In Dask, work stealing is enabled by default with the idea of minimizing computational time while
allowing for optimal usage of workers.
Conclusion
We have seen a lot of concepts in this chapter; we learned how breaking down a large operation into several
smaller operations and running them either parallelly or concurrently utilizes lesser resources than running
everything on a single thread. We looked at Dask, a Python distributed computing library that can perform
concurrent tasks and overcome traditional Python limitations. We also looked at various data structures of Dask
and various optimization concepts. Dask also integrates well with the modern data stack. The tools and
concepts you have learned in this chapter lay a good foundation on parallel and concurrent data processing.
© The Author(s), under exclusive license to APress Media, LLC, part of
Springer Nature 2024
P. K. NarayananData Engineering for Machine Learning
Pipelineshttps://doi.org/10.1007/979-8-8688-0602-5_8
Introduction
Machine learning is a branch of artificial intelligence and computer
science. A machine learning model is an intelligent system that can learn
from data in order to make decisions. Organizations of all nature have
gained significant value by leveraging machine learning models in their
processes. Training a machine learning model is done using algorithms,
which can be computationally expensive. As datasets grow larger, both in
terms of volume and level of complexity of the data points, scalable
machine learning solutions are highly sought after. In this chapter we will
look at Dask-ML, a library that runs ML algorithms in a distributed
computing environment and integrates well with existing modern data
science libraries.
Dask-ML
Figure 8-1
Note
Data Exploration
Data exploration is all about getting to know a dataset. A dataset has its
own nature and characteristics, depending upon the nature, type, and
context of data points. Data exploration involves understanding distribution
of each data point, understanding correlation between one or more data
points, looking at various patterns, identifying whether missing values
exist, and gaining an initial assessment of how accurate or complete and
consistent the dataset is (even if it is approximate). Data exploration helps
identify critical insights about data points, minimize potential biased results
(if one or more data points represent the same thing; for instance, length of
a flower petal in millimeters, in centimeters, etc.), and most importantly
reduce aimless data analytics work that leads to nowhere.
Data Cleaning
Data cleaning is caring for the hygiene of the dataset. For lack of a better
term, clean data helps with effective data analysis. Data cleaning involves
cleaning up errors and outliers in data points that may negatively affect and
possibly skew the dataset into bias. Data cleaning also is about handling
those missing values that were previously “discovered,” deciding whether
to impute the value that is missing or remove that row completely. Data
cleaning can also involve handling data duplication, which would most
likely be removing it. Clean data leads to accurate analysis and accurate
regression models and makes the judgment more reliant.
Data Wrangling
Data Integration
Data integration is the idea of combining one or more datasets from one or
more data sources into a single dataset that can provide a comprehensive
view. Data integration would also involve further identifying and removing
duplicates that may arise from combining several data sources. Data
integration involves schema alignment, where data from one or more
sources are compatible and aligned. Data integration also deals with data
quality, where one would develop validations and checks to ensure data
quality is good after the data is integrated from several sources. Data
integration helps expand the scope of the entire project. Data integration
also helps improve accuracy and integrity of the dataset.
Feature Engineering
Once we have the dataset, then we can proceed to the next step, that is,
feature engineering. By this time, the dataset may contain several data
points. In addition to the existing data points in the dataset, feature
engineering involves creating new data points based on the existing data
points. Some of the simplest data points are calculating the age of a person
either at birth or at a specific date, calculating Fahrenheit from Celsius, or
calculating distance between two points based on a specific distance
measure, to name a few. Sometimes feature engineering could involve
complex computations like Fourier transformations or other complex
methods that are computationally expensive. The idea is to come up with
new data points that have higher correlation with the data point that the
model is looking to predict or classify.
Feature Selection
Now that you have the dataset you need for the project and engineered new
features based on the data points of the existing dataset, it is time to select
which of these data points enter into the model. There are many ways of
selecting the features for a given machine learning model. Tests like the
chi-square test help determine the relationship between categorical
variables, while metrics like the Fisher score help generate ranks for
various data points. There are approaches like forward selection or
backward selection that iteratively generate various feature combinations
and respective accuracy scores; however, they can be expensive as data and
features grow beyond a lab experiment. Other techniques include
dimensionality reduction using tools like principal component analysis and
autoencoders that encode data into lower dimensions for processing and
decode back to the original shape for output.
Data Splitting
Data splitting is the process of partitioning the entire dataset into smaller
chunks for training a machine learning model. Depending upon the training
strategy, the approach of splitting the data may differ. The most common
approach of splitting the data is the idea of holding out training, validation,
and test datasets. The training set is usually allotted for the model to learn
from the data, the validation set is primarily designed for hyperparameter
tuning purposes, and the test set is to evaluate the performance of the model
with chosen hyperparameters. There is also k-fold cross validation where
the dataset is partitioned into k chunks and iteratively trained by keeping k
– 1 as training and the remaining as the test set. One can even go further by
leveraging advanced statistical methods like stratified sampling to obtain
partitions of similar characteristics and distributions.
Model Selection
Note
Model Training
Based on the characteristics of data and nature and context of the problem
we are trying to solve using the machine learning model, the training
strategy and algorithm vary. The common training strategies are supervised
and unsupervised learning. Supervised learning is the process of training a
model with a labeled dataset, where data points have specific names and
usually the ones with higher correlations with the decision variables are
picked to train. Common examples include regression models and
classification models.
Model Evaluation
Once we have a machine learning model trained, the next step is model
validation. Sometimes testing alone may suffice, and in other cases testing
and validation are required. The idea is to evaluate the accuracy of a model
by making a decision on a completely new set of data and evaluate its
accuracy statistically by looking at precision metrics like r-squared, F1-
score, precision, recall, etc.
Hyperparameter Tuning
Final Testing
We have performed various checks and validations in the process of
building a machine learning model up to the process of selecting the best
hyperparameters for our machine learning model. Final testing is basically
testing the final chosen model on a held-out test set. This process exists in
order to obtain realistic expectations when the model is deployed in
production. The main goal behind this step is to get an idea of how this
model may perform when exposed to a new set of data. Once we test out
the final model, establish an understanding of strengths and weaknesses of
the model and document the results for future reference and maintaining
audit trails.
Model Deployment
Once we have the training model, the model is validated against test sets,
and hyperparameters are tuned, we now have a machine learning model
that is ready to be deployed. The concept of deployment changes with
respect to the nature and context of the business and problem statement in
hand. The conceptual definition is that deployment means the usage of a
model in everyday scenarios, where you are leveraging the model to inform
better decisions.
Model Monitoring
Model Retraining
scikit-learn
XGBoost
XGBoost is another popular library in the machine learning space for
gradient-boosted trees. Gradient-boosted trees are a machine learning
technique that is used for both continuous and categorical decision
variables (regression and classification, respectively). It belongs to the
ensemble learning paradigm, where multiple decision tree models minimize
the error and combine to make the final prediction model. Dask integrates
well with the XGBoost model, where Dask sets up the master process of
XGBoost onto the Dask scheduler and the worker process of XGBoost to
the workers. Then it moves the Dask data frames to XGBoost and let it
train. When XGBoost finishes the task, Dask would clean up XGBoost’s
infrastructure and return to its original state.
PyTorch
Other Libraries
RobustScaler( )
MinMaxScaler( )
It may be good to note that if the column had negative values, then it would
attempt to fit the data in the range of [–1,1]. MinMaxScaler is a good fit
when you are attempting to fit a neural network with a Tanh or Sigmoid
activation function. Please note that MinMaxScaler may not play well with
outliers. Either you remove the outliers or use the RobustScaler function,
instead.
Cross Validation
For instance, let us say we have five partitions of a dataset, calling them d1,
d2, d3, d4, and d5:
During the first iteration, d1 will be the test set and d2, d3, d4, and d5
will be training sets.
During the second iteration, d2 will be the test set and d1, d3, d4, and
d5 will be training sets.
During the third iteration, d3 will be the test set and d1, d2, d4, and d5
will be training sets.
Dask-ML also provides a method to split one or more Dask arrays. Let us
look at the following code:
import dask.array as da
from dask_ml.datasets import make_regression
from dask_ml.model_selection import train_test_split
X, y = make_regression(
n_samples=200,
n_features=5,
random_state=0,
chunks=20
)
# notice the use of method from Dask ML library which resembles similar
to Sklearn
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X)
would return
dask.array<normal, shape=(200, 5), dtype=float64, chunksize=(20, 5),
chunktype=numpy.ndarray>
After splitting
print(X_train)
would return
dask.array<concatenate, shape=(180, 5), dtype=float64, chunksize=(18, 5),
chunktype=numpy.ndarray>
print(X_train.compute()[:5])
results may look like
array(
[[ 0.77523133, -0.92051055, -1.62842789, -0.11035289, 0.68137204],
[ 0.02295746, 1.05580299, -0.24068297, 0.33288785, 2.14109094],
[-0.40522438, -0.59160913, 0.0704159 , -1.35938371, -0.70304657],
[-1.0048465 , 0.02732673, 0.02413585, 0.48857595, -1.22351207],
[-1.64604251, 1.44954215, -0.9111602 , -2.27627772, -0.02227428]]
)
When you are expecting a good model for production, it might take several
hours to compute the best hyperparameters for a given problem. We do
encounter challenges when we scale the hyperparameter searching effort.
The most common issues that we may come across are compute constraint
and memory constraint.
Grid Search
You define a grid of values (a matrix of values). The grid search would
exhaustively search through all the combinations of hyperparameters. It
would evaluate the performance of every possible combination of
hyperparameters within the grid using a cross validation method. While it
will deterministically search and provide the best possible hyperparameters
for your model, it gets expensive as the model grows and a lot depends on
the grid you have defined.
Random Search
You define the distribution for each hyperparameter. Random search
randomly samples hyperparameter values from predefined distributions.
Once again, this method evaluates the performance of the model using
cross validation. While this might sound like a more efficient approach,
there is likelihood that statistical distribution may not assist with finding
the global optimal values for these hyperparameters.
When you are facing memory constraints but your computer is enough to
execute the training model, you can use an incremental search method.
Incremental Search
When compute and memory are both constrained and posing a challenge to
obtain good hyperparameters for your machine learning model, you may
choose to explore Hyperband algorithms for hyperparameter tuning.
First, we will import the necessary classes and methods for our application:
import dask.dataframe as dd
import pandas as pd
from dask_ml.preprocessing import OneHotEncoder
from dask_ml.impute import SimpleImputer
from dask_ml.model_selection import train_test_split
from dask_ml.linear_model import LogisticRegression
from dask_ml.metrics import accuracy_score
Now let us create a sample dataset for us to work with the Dask-ML
methods. We are keeping it a simple n-dimensional array that can be
converted into a data frame for downstream processing purposes:
data = {
'color':
['red', 'orange', 'green', None, 'yellow', 'green'],
'weight':
[150, 180, 200, 160, None, 220],
'taste':
['sweet', 'sweet', 'sour', 'sweet', 'sweet', 'sour'],
'fruit':
['apple', 'orange', 'apple', 'apple', 'orange', 'melon']
}
Now, let us create the Dask data frame for the preceding data with two
partitions:
df = dd.from_pandas(
pd.DataFrame(data),
npartitions=2
)
Let us initialize the SimpleImputer. You can specify the strategy that can be
used to fill the missing value. The most common approach for a categorical
variable is to fill in with a category that is most commonly occurring. For a
continuous variable, you can choose the option of taking the statistical
average and fill in the missing value:
imputer1 = SimpleImputer(strategy='most_frequent')
df_fit = imputer1.fit(df)
df[['color']] = imputer1.transform(df[['color']])
Let us perform the imputation for the weight of the fruit:
imputer2 = SimpleImputer(strategy='mean')
ddf[["weight"]] = imputer2.fit_transform(df[['weight']])
print(ddf.compute())
would yield
color weight taste fruit
0 red 150.0 sweet apple
1 orange 180.0 sweet orange
2 green 200.0 sour apple
3 green 160.0 sweet apple
4 yellow 182.0 sweet orange
5 green 220.0 sour melon
Notice the first imputer assigned the most commonly occurring category to
the categorical variable. The second imputer assigned the statistical mean
of the column to the missing value. As it so happens the statistical average
of the missing orange fruit is close to the value found for orange.
Conclusion
The idea of utilizing distributed computing to run machine learning
algorithms in the Python-based ecosystem can be seen as achieving a
significant milestone in the applied research. Dask-ML integrates well with
scikit-learn and other leading libraries. So far we looked at various
preprocessing and tuning applications using Dask-ML. By knowing Dask-
ML and being able to use it in a production environment, organizations can
leverage gaining high-value insights and significant savings, enabling them
to push the boundaries to the next level in data-driven decision making.
© The Author(s), under exclusive license to APress Media, LLC, part of Springer
Nature 2024
P. K. NarayananData Engineering for Machine Learning
Pipelineshttps://doi.org/10.1007/979-8-8688-0602-5_9
Introduction
So far we looked at batch processing pipelines. In this chapter, we will be looking
at real-time streaming processing pipelines and a library that is incredibly rich in
functionalities. We are going to discuss Apache Kafka. Apache Kafka is a
distributed and fault-tolerant streaming and messaging platform. Kafka helps
build event streaming pipelines that can capture creation of new data and
modification of existing data in real time and route it appropriately for various
downstream consumption purposes. In this chapter, we will look at Kafka, its
architecture, and how Kafka can help build real-time data pipelines.
ksqlDB
Introduction to Distributed Computing
Now is a good time to briefly touch on the evolution of distributed computing.
Traditionally, computing helped in developing software applications,
programming heavy computations that utilized data. Here, the data is stored in
relational tables, while the application and programming are stored in a
production computer, which may be bigger when compared with end user
computers. As adoption of computing increased, the amount of data generated at
a given point in time also increased. In addition, the type of data and the speed in
which the data was generated were also on the rise. Given the need for more
computing, distributed systems and the concept of distributed computing came
into play.
One of the big data evolutions is called event processing or streaming data. The
idea behind event processing is that the data is captured at the time of event and
written to a data store as events. The relational data models and data-capturing
methods focus on capturing data as entities. Entities can be seen as nouns; it
could be a person, an organization, a project, an experiment, a vehicle, or even a
light bulb (by definition of entities). On the other hand, streaming data processing
is basically an action or an event that happened in a given time interval. They can
denote anything.
To read and write streams of event data and continuously import or export
data from or to other systems
“Distributed” means the ability to run on one or more computers; the software
would come with an abstraction layer that would sit on top of one or more
commercial-grade computers so that the abstraction would mimic one single
computer, where the resources (computing, memory, storage) are shared and
utilized as and when they are needed.
“Secure” means that the data, systems, and resources are protected from
unauthorized access, damages, and attacks from external entities. A secure
system may contain access control, two-factor authorization, encryption in
motion and at rest (disk encryption and transporting data in secure tunnels), and
regular security updates to protect the system from potential threats and other
vulnerabilities that may exist.
Note
Encryption at Rest
Encryption at rest means the idea of disk encryption. Disk encryption makes sure
that when an unauthorized entity gains access to disk and attempts to read, it
won’t be able to access the data. The data is encrypted and it requires a “key” to
decrypt.
Encryption in Motion
Encryption in motion means the idea of encrypting the data in transit. The
transport layer encryption refers to securing data while it is in transit over a given
network. There are network protocols (industry-approved instructions) that
implement encryption in motion. Some of them are HTTPS, TLS, and SSL
protocols. And so, when the data is intercepted in between transit, the information
does not make sense (requires some sort of a key to decrypt).
Now, let us talk about the pub–sub messaging system. Pub–sub stands for
publisher and subscriber, respectively. In this architecture, the messages are
“published” into what is called a topic. The topic can then be read by another
program, called “subscriber.” These subscribers can receive messages from the
topic. You can have multiple subscribers receiving messages from topics; and so,
these messages are broadcasted to multiple subscribers.
Kafka Architecture
Kafka can be deployed in virtual machines, cloud environments, containers, and
even bare metal appliances. Kafka consists of servers and clients within its
distributed system. Kafka servers actively provide brokerage and Kafka Connect
services. The brokers form the storage layer, whereas Kafka Connect provides
continuous import or export of data as event streams. The Kafka client provides
you with libraries and APIs to write distributed applications and microservices
that can process streams of events, function in parallel, scale up where needed,
and be fault tolerant. Let us look at some of the components of Kafka.
Events
Topics
Topics are where events are organized and stored securely. The concept is similar
to that of files and folders. A folder is where one or more files are stored;
similarly, one or more events are stored in topics. In the preceding example, John
withdrawing a sum of $100 is considered an event. If similar events were
grouped, then it would be referred to as the “Cash withdrawal” topic. Events can
be read as many times as needed and do not get deleted after consumption. You
can define how long to retain these events. These retention periods can be
modified topic to topic.
Partitions
A partition is the process of dividing a large data block into multiple smaller ones
so that it helps with reading and access. Topics are partitioned. Any given topic is
partitioned, as spread over a number of buckets. Events with the same event key
(in the preceding example, John is an event key) will be written to the same
partition. These partitions can also be replicated such that in case of a given
partition failure, there is always another partition that contains the same topic and
event. The replication is performed at the topic partition level. For instance, if the
replication factor is 3 and you have four partitions for one given topic, then there
will be 12 different partitions (as in three copies of each partition per topic).
Broker
Brokers in Kafka are storage services provided by the Kafka server. The
abovementioned topic partitions are stored across a number of brokers. This way,
when one broker fails, Kafka will locate the other broker that contains the similar
topic partition and continue with the processing. Brokers communicate with each
other for data replication, and they try to ensure consistency of data across the
entire Kafka cluster.
Replication
Kafka automatically replicates the topic partition, and each partition gets stored
in more than one broker, depending upon the replication factor set by the Kafka
administrator. This is to enable fault tolerance so that even when a cluster fails,
the messages continue to remain available. The main partition is called the leader
replica, and the copies of topic partitions are stored at a place called follower
replicas. Every partition that gets replicated has one leader and n – 1 follower
replicas. When you are reading and writing data, you are always interacting with
the leader replica. The leader and followers work together to figure out the
replication. The whole thing is automated and happens in the background.
Producers
Producers are part of the Kafka client, enabling the process of publishing events
to Kafka topics. Typically Kafka producers send a message to the topic. These
messages are partitioned based on a mechanism (e.g., key hashing). For a
message to be successfully written in a topic, the producer must specify a level of
acknowledgment (ack). Some examples of a Kafka producer include web servers,
IoT devices, monitoring and logging applications, etc. Kafka may serialize the
data that is coming in using any of Java’s built-in serializers or using Apache
Avro, Protobuf, etc.
Note
You connect to the cluster using the Kafka producer, which is supplied with
configuration parameters including brokers’ addresses in the cluster, among other
things.
Consumers
Once producers have started sending events to the Kafka cluster, we need to get
the events back out into other parts of the system. Consumers are those that
subscribe to (read and write) these events. They read these events from producers
and generate a response with regard to the event. Note that consumers only read
the topics and not “consume” in any other ways; these topics are still available in
the Kafka cluster. The data that is coming from producers needs to be accurately
deserialized in order to be consumed. Consumers read topics from oldest to
newest.
Schema Registry
Now that we have producers and consumers ready, the producers will start
publishing messages into topics, while the consumers will begin to subscribe to
these same messages that the producers publish. The consumers will start
consuming only if they identify and understand the structure of the data the
producers publish. The schema registry maintains a database of all schemas that
producers have published into topics. The schema registry is a standalone process
that is external to Kafka brokers. By utilizing the schema registry, you are
ensuring data integrity among other great benefits.
With Avro
With Protobuf
Protobuf stands for Protocol Buffers. It is a way of serializing structured data and
comes with its own schema definition language. Protobuf is developed by
Google; it is text based and readable when compared with JSON format. Protobuf
is binary based, language neutral, and efficient in terms of speed.
Kafka Connect
Kafka Streams is a service that enables the idea of streaming data pipelines. A
streaming data pipeline is a such that processes data as it is being generated or as
an event is happening. Streaming data pipelines do not necessarily extract data;
rather, they just receive data from a perennial flow of data coming from a given
system. The easiest way to visualize is the idea of weather forecasting systems.
These systems receive atmospheric data from a satellite continuously. And so,
there is no “extraction” happening here. Instead, they is configured to receive the
same.
Kafka Streams is scalable and fault tolerant and can be deployed in leading
operating systems, bare metal appliances, containers, etc. Kafka Streams
primarily caters to Java programming language to be able to write streaming
pipelines and also to enrich current Java applications with stream processing
functionalities. For other programming languages, we have another service called
ksqlDB that enables writing Kafka stream functionalities.
ksqlDB is a database that integrates with Kafka, enabling you to build streaming
data pipelines, similar to building traditional data applications. More specifically,
ksqlDB is a computational layer that uses SQL-like syntax to interact with the
storage layer, which is Kafka. You can interact with ksql using web, command
line, and REST APIs. To use ksqlDB with Python REST API, you can install
pip install ksql
You can create a new table or stream within the ksqlDB and stream your existing
data in relational tables or other file systems into topics, for instance:
CREATE TABLE campaign(
AdvertiserID FOREIGN KEY referencing table,
CampaignID INT PRIMARY KEY,
CampaignName STRING,
StartDate DATE,
EndDate DATE,
Budget FLOAT
)
WITH (
Kafka_topic = "Campaign",
Partitions = 3,
Value_format = "Protobuf"
);
If you have existing Kafka topics, you can use ksqlDB to create a stream or table
and then begin streaming data into ksqlDB:
CREATE STREAM ads
WITH (
kafka_topic="campaign",
value_format= "protobuf"
);
Just like relational SQL, you can perform lookups, joins, filtering,
transformations, and aggregations to derive real-time analytic insights from data.
Figure 9-1
Kafka architecture
Figure 9-2
Choose the appropriate cluster type. For learning purposes, the basic option
should suffice. Here is how that may look:
Figure 9-3
You will have the option to specify where your cluster should be hosted. Please
feel free to select your favorite cloud service provider and a region closest to
where you are located. You will see the estimated costs at the bottom. Here is
how this may look for you:
Figure 9-4
This will be followed by a payment page. If you are registering with Confluent
for the first time, you may get free credits; so you can skip the payments and
proceed to launch the cluster.
Figure 9-5
Once you have a Kafka cluster ready, proceed to create a topic within that cluster.
A topic is an append-only log register that registers messages.
At this time, navigate to your cluster within the Kafka dashboard, locate the
“Connect to your systems” section, choose Python programming language, and
create a cluster API key and a schema registry API key.
Figure 9-6
For development and sandbox purposes, create a config.py file within your
development environment and store the following:
config = {
'bootstrap.servers': '<bootstrap-server-endpoint>',
'security.protocol': 'SASL_SSL',
'sasl.mechanisms': 'PLAIN',
'sasl.username': '<CLUSTER_API_KEY>',
'sasl.password': '<CLUSTER_API_SECRET>'}
sr_config = {
'url': '<schema.registry.url>',
'basic.auth.user.info':'<SR_API_KEY>:<SR_API_SECRET>'
}
Make sure to replace the endpoints, key, and secret with appropriate data that is
obtained during the process. This is essential for your local Kafka code to interact
with your Kafka cluster in the cloud.
Now, let us set up the Kafka development environment in your local workstation.
We will start off by revising the creation of the virtual environment:
pip3 install venv
python3 -m venv kafkadev
source kafkadev/bin/activate
If you are on MS Windows operating system, use the following command to
activate your Python virtual environment:
kafkadev\Scripts\activate
To deactivate this environment, go to the command line and enter
deactivate
Let us now set up the Kafka development environment, from the command line:
pip3 install confluent-kafka config
Now let us create a simple Kafka publisher–subscriber program. To get started,
create a work directory:
mkdir Kafka-dev
cd Kafka-dev
Create a new Python script named “config.py”. Copy the earlier configuration
into the script and comment out the schema registry section that is not needed for
our program.
Now let us examine the code. We start off by importing the necessary libraries,
namely, Confluent distribution of Kafka and config. We then instantiate the
producer class with the config. The most important methods of the producer class
are produce and flush.
Let us now create a Kafka consumer that would read the messages from the topic
within the Kafka cluster. Please note that the Kafka consumer only tries to read
the messages and not in any way alter the form or perform operations other than
read. These messages will continue to persist in the topic within the Kafka
cluster.
Create a new Python script named “consumer.py” and enter the following code:
from config import config
from confluent_kafka import Consumer
def consumer_config():
config['group.id'] = 'hello_group'
config['auto.offset.reset'] = 'earliest'
config['enable.auto.commit'] = False # type: ignore
consumer_config()
subscriber = Consumer(config)
subscriber.subscribe(["hello_topic"])
while True:
event = subscriber.poll(1.0)
if event is None:
continue
if event.error():
print(event.error())
val = event.value().decode('utf8')
partition = event.partition()
print(f'Received: {val} from partition {partition}')
subscriber.close()
We start off, once again, by importing the necessary classes. In addition to the
configuration supplied by config.py, we also need to mention the consumer group
id, whether to start consuming messages that arrived earliest or latest, etc. A
consumer class is instantiated in the name of the subscriber, and the same is
subscribed to consume messages from the given topic. Finally, the subscriber will
operate on an infinite while loop to output the messages, unless the user
interrupts with a key combination. As a consumer program, the idea here is to
read the messages from the topic and, if possible, print the same to the console
screen.
Let us look at an example. Here we have student data, and we are using
JSONSerializer to serialize and deserialize our data in publisher and subscriber
scripts, respectively.
We start off with loading the various classes and methods required for this
publisher script. We also have loaded a data class and a JSON schema for
demonstrating the student data example. We have defined a function that converts
student data to a dictionary. The remaining portion of the program is quite
familiar to you, as we are instantiating the producer class, schema registry, and
JSON serializer, supplying various parameters required for the producer to
serialize the data and iterate over the list to write to the topic within the Kafka
cluster.
The JSON serializer accepts the schema both in string format and as a schema
instance. The student_to_dict function converts the Python object (in this case a
list) to a dictionary. The “ctx” parameter provides metadata relevant to
serialization.
When this program is run
python3 publisher.py
we would get
Student reading for 55054 produced to student_data
Student reading for 55052 produced to student_data
Student reading for 55053 produced to student_data
Now let us focus on the consumer portion of this project. Create a file, called
consumer.py, and load the following code into the file:
from confluent_kafka import Consumer
from confluent_kafka.serialization import SerializationContext, MessageField
from confluent_kafka.schema_registry.json_schema import JSONDeserializer
from config import config
class Student(object):
def __init__(self, student_id, student_name, course):
self.student_id = student_id
self.student_name = student_name
self.course = course
schema_str = """{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "Student",
"description": "Student course data",
"type": "object",
"properties": {
"student_id": {
"description": "id of the student",
"type": "number"
},
"student_name": {
"description": "name of the student",
"type": "string"
},
"course": {
"description": "course of the student",
"type": "string"
}
}
}
"""
def set_consumer_configs():
config['group.id'] = 'student_group'
config['auto.offset.reset'] = 'earliest'
def dict_to_student(dict, ctx):
return Student(dict['student_id'], dict['student_name'], dict['course'])
if __name__ == '__main__':
topic = 'student_data'
print(type(schema_str))
json_deserializer = JSONDeserializer(schema_str, from_dict=dict_to_student)
set_consumer_configs()
consumer = Consumer(config)
consumer.subscribe([topic])
while True:
try:
event = consumer.poll(1.0)
if event is None:
continue
student = json_deserializer(event.value(),
SerializationContext(topic, MessageField.VALUE))
if student is not None:
print(f'Student with id {student.student_id} is
{student.student_name}.')
except KeyboardInterrupt:
break
consumer.close()
Once again, the program loads all the required classes and methods to the file,
followed by the data class for the student data schema and JSON configuration
defining the schema string. We then add the consumer configuration required in
addition to the existing configuration and a function that would convert the
dictionary to a list. From here on, we proceed to instantiate our consumer class,
mention the specific topic, instantiate the JSON deserializer, and create a loop
that would print the student data unless interrupted by keyboard keys.
Protobuf Serializer
As we saw earlier, Protobuf stands for Protocol Buffers. Protobuf is developed by
Google and considered to be more efficient in terms of speed and size of
serialized data. To be able to use the Protobuf serializer, you need to follow the
mentioned guidelines as follows:
First off, define your Protobuf schema in your schema definition file. You
need to define the schema and save the file with “.proto” file format. These
proto schema files are very much human readable and contain information in
much simplified manner when compared with JSON documents.
Second, compile the “.proto” file into your programming language, in our
case a Python script. As mentioned earlier, Protobuf is language agnostic
and supports many programming languages.
Finally, you have to import the generated file into your main code. In our
case, we would obtain a Python file. This Python file needs to be imported
in our main publisher or subscriber program.
As you can see, the schema reads much easier when compared with JSON. You
can define the data types and name of the fields and also specify whether a
specific item requires to be filled, by using the keyword “optional.” The numbers
that are assigned at the end of each line are field numbers.
Now, let us compile this proto model using the Protobuf compiler:
./protoc --python_out=. model.proto
This would yield us
model_pb2.py
Let us examine the contents of this generated Python file:
# -*- coding: utf-8 -*-
# Generated by the protocol buffer compiler. DO NOT EDIT!
# source: model.proto
# Protobuf Python Version: 4.25.2
"""Generated protocol buffer code."""
from google.protobuf import descriptor as _descriptor
from google.protobuf import descriptor_pool as _descriptor_pool
from google.protobuf import symbol_database as _symbol_database
from google.protobuf.internal import builder as _builder
# @@protoc_insertion_point(imports)
_sym_db = _symbol_database.Default()
DESCRIPTOR =
_descriptor_pool.Default().AddSerializedFile(b'\n\x0bmodel.proto\x12\x16\x63o
m.kafka.studentmodel\"C\n\x07student\x12\x12\n\nstudent_id\x18\x01
\x01(\x05\x12\x14\n\x0cstudent_name\x18\x02
\x01(\t\x12\x0e\n\x06\x63ourse\x18\x03
\x01(\t\"H\n\x0cstudent_data\x12\x38\n\x0fstudent_records\x18\x01
\x03(\x0b\x32\x1f.com.kafka.studentmodel.studentb\x06proto3')
_globals = globals()
_builder.BuildMessageAndEnumDescriptors(DESCRIPTOR, _globals)
_builder.BuildTopDescriptorsAndMessages(DESCRIPTOR, 'model_pb2',
_globals)
if _descriptor._USE_C_DESCRIPTORS == False:
DESCRIPTOR._options = None
_globals['_STUDENT']._serialized_start=39
_globals['_STUDENT']._serialized_end=106
_globals['_STUDENT_DATA']._serialized_start=108
_globals['_STUDENT_DATA']._serialized_end=180
# @@protoc_insertion_point(module_scope)
Now, let us look at our producer script:
from confluent_kafka import Producer
import model_pb2
from config import config
class student:
def __init__(self, student_id, student_name, course):
self.student_id = student_id
self.student_name = student_name
self.course = course
def callback(error, msg):
if error:
print(f"Message delivery did not succeed due to {error}")
else:
print(f"Messages of {msg.topic()} delivered at {msg.partition()}")
producer = Producer(config)
student_data = model_pb2.student_data()
data = [
student(55052,"Johndoe","Chem101" ),
student(55053,"TestStudent","Physics"),
student(55054,"JaneDoe","Biology")
]
for i in data:
student_record = student_data.student_records.add()
student_record.student_id = i.student_id
student_record.student_name = i.student_name
student_record.course = i.course
serialized_data = student_data.SerializeToString()
# Send serialized data
producer.produce('student_data', serialized_data, callback=callback)
producer.flush()
We have imported the necessary classes and methods, including the Python file
generated by the Protobuf compiler. Then we create a student data class, followed
by a callback function that would write out messages to console in case of
success or failure of publishing messages to the topic within the Kafka cluster.
We then instantiate a producer class and student data. This is succeeded by
iteratively adding the test bed of data. Upon serialization, the data is published in
the “student_data” topic within the Kafka cluster.
In this example, we have imported necessary classes and methods, set additional
required configurations for the consumer class, and started listening on the given
topic. We then write an infinite loop that would print the messages that are read
from the topic within the Kafka cluster. The “ParseFromString()” method is from
Protobuf Python library, which enables you to deserialize the binary object to
structured data. I hope these illustrations help strengthen the foundational
thinking and most commonly used classes and methods in Kafka programming.
Stream Processing
So far we have seen how Kafka can publish and subscribe to events. We also saw
that Kafka stores events in a Kafka cluster. In this section, we will be looking at
how Kafka can process and analyze these events that are stored in the Kafka
cluster. This is what we call stream processing in Kafka. Let us define an event. It
is simply something that occurred at a given point in time. It could be anything
from a student signing up for a course in an online system to a television viewer
who is voting for their favorite performer in a reality TV show. Events are
basically commit logs that are immutable, and only append operation is
permitted. You cannot undo an event; in the case of the student example, you can
withdraw from the course that you have signed up for, but remember that
becomes another event.
ksqlDB is not a relational database and it does not replace relational database at
all. There are no concept of indexes in ksqlDB. The tables in ksqlDB are different
from tables in relational database management systems. The ksqlDB table can be
seen more as a materialized view and not a table. Streams can also be defined as
sets of events that you can derive a table from, for instance, the price of a
company’s stock listed/traded on a stock exchange in a given time interval. The
sequence of stock prices over the time period is a stream. The same stream can be
seen as a table to describe the most recent stock price of that company.
There are two types of stream processing that can be performed within ksqlDB,
namely, stateful and stateless processing. Stateless processing in Kafka is where
an event is processed without having to rely on information relating to previous
messages. As stateless processing does not require to have any knowledge about
the messages that occurred in the past or will happen in the future, they are
simpler and faster to process. Here are some examples:
Transformation of a value that does not involve any knowledge from past or
future messages, for instance, converting a Celsius to Fahrenheit
Creating branches where the input streams are branched into multiple output
streams based on criteria applied to the message
A stream processing application would typically involve ingesting data from one
or more streams, applying transformations where applicable, and loading them
into a storage. An average stream processing application can contain many
moving parts, right from the data ingestion and performing transformations
(which might mean that data is written in a staging area, and so that is another
storage) to writing to a storage. ksqlDB simplifies the entire stream processing
application and narrows down to only two services: the ksqlDB that performs
computation and Kafka that constitutes storage. Within ksqlDB, you can utilize
Kafka to both stream data out of other systems into Kafka and from Kafka to
other downstream systems.
It may be good to reiterate that ksqlDB is a separate package from Apache Kafka.
And so, we will continue to use our existing clusters and may create new streams.
Let us try to set up a ksqlDB cluster for us to write streaming data pipelines. To
get started, log in to Confluent Cloud, select the appropriate environment, choose
the right cluster, and click ksqlDB.
A screenshot of a web page showing a menu on the left and a
description of a product on the right. Left side represents a cluster
Kafka underscore dev for cluster overview, networking, A P I keys,
cluster settings, stream lineage, and designer, topics, and k s q I D B.
Right represents k s q I D B, stream processing tool, and resources.
Figure 9-7
You may choose to create a cluster on your own or start with the tutorial. The
goal is to set up a development instance to get started with the stream data
processing pipelines. You may choose to enable global access for your ksqlDB
cluster and ensure appropriate privileges to the schema registry.
A screen shot of a web page with the title New cluster. It represents two
sections with the titles Global access and Granular access. The text box
at the bottom reads important confirm that the user or service account
has the required privileges to access the schema registry.
Figure 9-8
You may choose the lowest Confluent streaming units and choose to launch
clusters. Here is how that looks:
A screenshot of a web page with the title New Cluster. It represents 1.
access control, and 2. name and sizing. The name and sizing are
selected and include the cluster name of k s q I D B underscore cluster,
cluster size of 1, and configuration options default and advanced. The
option default is selected.
Figure 9-9
Do wait a few minutes for your cluster to be provisioned. You’ll know the cluster
is ready when the status changes to “Up” from Provisioning. Once we have the
ksqlDB up and running, we can try to set up the command line interface for both
Confluent and ksqlDB.
The easiest and straightforward method is to spin up a Docker container
containing the Confluent platform. You can visit the appropriate page, copy the
Docker compose file, and run those commands to get up and running. However, it
would be nice to install these packages in the terminal, since we will also get
hands on installing a package without an executable. Let us start looking at
installing Confluent CLI in a Debian operating system.
Let us import the public key of the ksqlDB package. This is to verify the
authenticity of the package, part of a security mechanism known as cryptographic
signing and verification:
Note
Let us obtain the source code and extract the tarball at your preferred destination:
curl -O https://packages.confluent.io/archive/7.5/confluent-7.5.3.tar.gz
tar xzvf confluent-7.5.3.tar.gz
Copy the file path to set the environment variable and path:
export CONFLUENT_HOME=<your local installation path>
export PATH=$PATH:$CONFLUENT_HOME/bin
Enter the following help command on the terminal to check whether installation
succeeded:
confluent –help
Figure 9-10
If you get something similar to this screen, then you succeeded in installing a
Confluent platform.
Now it is time to set up the environment so you can interact with the ksqlDB
cluster in the cloud. Within the dashboard, click ksqlDB and choose CLI
instructions. You will see instructions on logging in to Confluent Cloud CLI and
setting the context to your environment and cluster:
confluent login
confluent environment use env-0jvd89
confluent kafka cluster use lkc-555vmz
Create an API key and secret for your ksqlDB cluster:
confluent api-key create --resource lksqlc-8ggjyq
Figure 9-11
Now, enter the following so you can access the ksqlDB command line interface;
don’t forget to enter the appropriate key and secret:
$CONFLUENT_HOME/bin/ksql -u <KSQL_API_KEY> -p
<KSQL_API_SECRET> https://pksqldb-pgkn0m.us-east-
2.aws.confluent.cloud:443
Figure 9-12
Now let us start with some basic stream applications. A stream associates a
schema with an underlying Kafka topic. Let us begin by creating a topic in the
Kafka cluster. Visit the cluster’s dashboard and click “Topics.” Here is how that
looks:
A screenshot of the window screen represents a new topic in the Kafka
cluster. The topic cluster overview includes networking, A P I keys,
cluster settings, stream lineage, stream designer, topics, k s q l D B,
connectors, and clients.
Figure 9-13
Select “Add topic” and name the topic as “students.” Leave other fields as is and
select “Create with defaults.” Here is how that looks:
A screenshot of the window screen with the title New Topic. The form
has two text fields named topic names of students and partitions of 6.
At the bottom right, the create with defaults button is selected.
Figure 9-14
Select the option of creating a schema for message values. The option for
creating a schema for message keys may help in the production environment
when it comes to achieving consistency on structure of keys and routing to
appropriate partitions (messages with the same key are sent to the same partition).
A screenshot of the web page with the title your topic has been
successfully created. It represents the definition of data contract. At the
bottom left, create a scheme for message values is selected. At the
bottom right, the create schema button is selected.
Figure 9-15
Figure 9-16
There is also the option of creating a stream from the ksqlDB command line
interface. Let us look at that approach in detail. Using CLI, we will create a new
stream called students. Locate the Editor tab within the ksqlDB and enter the
following code to create a new students stream. Ensure you set the
“auto.offset.reset” property to earliest.
Figure 9-17
Figure 9-18
Click “Run query.” Notice the terminal where you have the SQL results
persistent. You will likely see the following:
A table with three columns labeled ID, NAME, and COURSE. The
table shows three rows of data. The first row represents ID 5, NAME
John, and COURSE Physics. The second row shows ID 7, NAME Jane,
and COURSE Physics. The third row shows ID 8, NAME Jan, and
COURSE Physics.
Figure 9-19
We have just witnessed how these supposedly queries are not exactly relational
database–like queries but rather streams. As soon as we insert data into the
stream, that same data pops up in our persistent stream output.
Kafka Connect
We saw in our earlier discussion how Kafka Connect helps us scale and stream
data reliably between Apache Kafka and other database systems. Kafka Connect
offers a tool called connectors that would connect with other data systems to
enable and coordinate data streaming. There are two types of connectors, namely,
source connector and sink connector. Source connectors ingest the data systems
to stream data to Kafka topics, whereas sink connectors deliver data to data
systems from Kafka topics.
Best Practices
As you can see, we have illustrated Kafka streaming data pipelines using a
commercial distribution, Confluent. At the time of writing this chapter, the
vendor provided hundreds of dollars on cloud credits that can be used to apply
toward availing their services. This way, anyone who is looking to get their feet
wet with Kafka can avail their services to run sample clusters. At the same time,
it is also a good practice to be mindful of unused Kafka clusters running in the
background. Here are some commands that can be used toward cleaning up
unused Kafka resources:
DROP STREAM IF EXISTS customers;
You can also asynchronously delete the respective topic backing the table or
stream:
DROP TABLE IF EXISTS t_customers DELETE TOPIC;
You can remove Kafka connectors as well:
DROP CONNECTOR IF EXISTS customerDB;
Conclusion
It is safe to say that Apache Kafka represents a paradigm shift on how we
approach building a real-time data pipeline and stream analytics solution. The
architecture is scalable, distributed, and fault tolerant, which provides a reliable
foundation for building and managing complex data pipelines that can handle the
modern data streams. We discussed the Kafka architecture and various
components and how they all work together. We looked at publisher–subscriber
models and how topics are stored in a database. We then looked at streaming data
pipelines using ksqlDB and how the data can be analyzed to gain business
insights. I hope the skills and knowledge you gained from this chapter may serve
you well in building real-time data pipelines.
© The Author(s), under exclusive license to APress Media, LLC, part of
Springer Nature 2024
P. K. NarayananData Engineering for Machine Learning
Pipelineshttps://doi.org/10.1007/979-8-8688-0602-5_10
Introduction
In the complex world of data engineering and machine learning modeling,
where there are several moving parts (as you have seen so far), it is
essential to deliver data services and deploy machine learning models for
consumption. In this chapter, we will be discussing FastAPI. FastAPI is a
Python library that primarily enables web application development and
microservice development. We will look at FastAPI with the sole intention
of using the library for delivering data services and machine learning
models as services. Though Django has won many hearts in the world of
Python-based application development, FastAPI is simple, strong, and very
powerful. It is important to understand the concept of application
programming interfaces for a ML engineer or data engineer.
OpenWeather API
https://openweathermap.org/api
Types of APIs
Not all APIs are made the same way; neither do they serve the same
constituents. The most common class of APIs is the Open API. These APIs
are provided by various organizations and they are free to access and
utilize. Engineers can access and utilize the services without any
restrictions. The other class of APIs are internal APIs. These APIs once
again deliver services; however, they are internal to an organization. They
are not exposed to the public. Instead, they are accessed and utilized to
fetch data and functionality by various teams, within a given organization
only. There is another variation to internal APIs, which is called partner
APIs. These APIs again pose some level of confidentiality; however, they
may cater to the organization and some of their partner organizations as
well.
In addition to various classes, and the purposes they serve, there are various
methods of accessing APIs; some of them are as follows.
SOAP APIs
SOAP stands for simple object access protocol. SOAP APIs have long been
in existence and used in enterprise-grade applications. SOAP is primarily
meant for exchanging structured data between two applications. SOAP uses
either HTTP or SMTP to transport the message. They use an XML
template for message formatting.
REST APIs
GraphQL APIs
GraphQL stands for graph query language. As the name stands, GraphQL
uses query-based requisition for fetching, making sure they fetch just
enough data they need. The emphasis of GraphQL is that it avoids
overfetching. While that information may be a bit too far-fetched for you
(depending upon your background), it is best to note that such an API
exists.
Webhooks
It all starts off when one of the parties, let’s call them client, sends a request
to the host through the API. The API carries the message and submits to the
host. The host would process that message, prepare the necessary
procedures, and provide the requested information.
Now, the host (let’s call them server) responds to the client. The response
contains the information that is intended for the appropriate request. This
interaction or communication, whichever you may prefer to call it as,
happens by exchange of data. This data is structured in a certain way for
APIs and servers to understand. It is commonly referred to as data format.
The most common data formats are JSON and XML. These data formats
ensure that all the three entries—client, API, and server—can understand
the interaction with each other.
Endpoints
Jane Doe,
Accounting Department,
ABC Company,
Say, if one sends a physical mail to this address, then it would go to the
respective town, to the concerned organization (hypothetical one, in this
case), to the respective department, and to the person addressed in that said
department.
In here
This whole URL is called an endpoint. Endpoints are addresses just like the
physical address mentioned previously. By stating the correct base URL,
correct path, and appropriate query parameters, the movie theatre server
will be able to process your request and retrieve the response.
APIs move data from source to sink and provide an easier way to ingest
data.
To build an API, we begin by defining the API requirements like the data it
may need to access, operations it will perform, and outcomes it will
produce, to name a few. Once you have the requirements defined, you can
move toward building data models, methods, and endpoints and defining
response data formats for requests, authentication, authorization, etc. Using
the requirements, implement the API, where you set up the required servers
and their infrastructure, implement the code and dependencies (if any), and
other relevant tasks. Performing adequate testing on features and
functionalities may be a good step, either during the implementation or
post-implementation phase. The final step would be to deploy the API in
production servers; performance and usage monitoring may benefit
addressing any improvements that may be needed.
REST API
As discussed, REST expands to representational state transfer. It is one of
the types of web service that complies with REST architectural standards.
REST architectural standards leverage client–server communication over a
given network. REST architecture is stateless, meaning that the server will
not maintain session information for a request from a client. Instead, the
request message itself would contain just enough information for the server
to process the request.
RESTful services listen for HTTP methods and retrieve the response from
the server. HTTP stands for HyperText Transfer Protocol. The HTTP
methods are such that they tell the server which operation to perform in
order to retrieve the response for a request. There are two categories of
HTTP methods, safe HTTP methods and idempotent HTTP methods. Safe
HTTP methods are such that they always return the same or similar
response regardless of the number of times they have been called. Safe
HTTP methods do not change the data on the server. Idempotent methods
may be able to change the data on the server. Some of the more common
HTTP methods (or verbs) include
GET: Reads data from the server and does not alter or change
resources in any way
GET /multiplexapi/movies/
HEAD: Similar to GET, retrieves only the header data without the
content
Here is a HEAD request to see the last update time for movies:
HEAD /multiplexapi/theatreslist
PUT: Replaces current data with new data; can be seen as performing
an update operation
OPTIONS /multiplexapi/bookings
CONNECT app.multiplexapi.com:3690
201 Created
Request successful, new resource was created as a result.
Description
301
Permanent Requested page has permanently moved to a new URL.
Redirect
400 Bad
Server cannot process the request due to an error from client.
Request
404 Not
Whatever is requested, cannot be found on the server.
Found
HTTP
Description
Status Code
500 Internal Due to some unexpected issue, server can’t seem to find the
Server Error requested.
503 Service
Server unable to handle the request at the moment.
Unavailable
These web services expose their data or a service to the outside world
through an API.
To build APIs in Python, there are quite a few packages available. There is
Flask, a Python library known for and meant for building simple, easy-to-
setup, and smaller APIs. The popular ones are Django and FastAPI. Django
has been around for a long time, has more functionalities, and is a better
option than Flask. In this chapter, we will take a deep dive into FastAPI.
FastAPI
FastAPI provides certain advantages over other Python libraries; first off,
FastAPI supports Python’s asynchronous programming through asyncio.
FastAPI automatically generates documentation that is based on Open API
standards, allowing the users of the API to explore the documentation and
test various endpoints. Most important of them all, as the name suggests,
FastAPI is high performance and faster than the many other Python
libraries.
This will install all the optional dependencies used by FastAPI, Pydantic,
and Starlette Python libraries, which include the “uvicorn” web server and
asynchronous client “http.”
Now let us consider a simple hello world application using FastAPI. Here
is how that looks:
from fastapi import FastAPI, APIRouter
app = FastAPI()
api_router = APIRouter()
@api_router.get("/")
def root() -> dict:
return {"msg": "Hello, World!"}
app.include_router(api_router)
Core Concepts
Starlette, on the other hand, is an AGSI standard framework that helps with
core functionality of FastAPI. Starlette provides the core routing
infrastructure, handling requests and responses, database integration, and
many other features that are leveraged by FastAPI. On top of Starlette and
AGSI standards, FastAPI natively provides data validation, documentation
generation, etc. so engineers can focus on the business logic of things.
https://www.amazon.com/
Figure 10-1
Figure 10-2
Note
You can even mention the data type of the path parameter. In the earlier
example, where the get_student() function has a path parameter mentioned,
the data type of the path parameter is also defined.
If you enter a data type that is not defined in the code, FastAPI would
throw an HTTP error, like the following one:
Figure 10-3
Here is an example:
from fastapi import FastAPI
app = FastAPI()
Students = [
{
"Student_id": 1,
"Student_name": "StudentPerson1",
"Course" : "BIO 101",
"Department" : "Biology"
},
{
"Student_id": 2,
"Student_name": "StudentPersonTwo",
"Course" : "CHEM 201",
"Department" : "Chemistry"
},
{
"Student_id": 3,
"Student_name": "StudentPerson3",
"Course" : "MATH 102",
"Department" : "Mathematics"
},
]
@app.get("/students/")
def read_item(skip: int = 0, limit: int = 10):
return Students[skip : skip + limit]
http://127.0.0.1:8000/items/?skip=1&limit=2
The query parameters are skip and limit; we intend to skip the first record
and limit the total results to two records only. So the output would look
something like this:
Figure 10-4
Pydantic Integration
The code utilizes similar examples as previous instances. The get method
retrieves students by their student_id. The post method would create a new
student and the required parameters. Please note that this example is for
illustrative purposes only. The data (that is posted) does not persist in the
disk.
Response Model
In the get and post functions we have defined, one can see the use of the
“response_model” attribute. The response_model would convert the given
data into Pydantic and still return the type that the function is designated to
do. The response_model does not actually change the return type of the
function. The response_model parameter tells FastAPI to convert the
returned data into a Pydantic object and perform necessary validations
(wherever required). You can use the “response_model” parameter of the
FastAPI decorator methods. Some of these methods are
app.get()
app.post()
app.delete()
… and so on.
There is also an option to set the response_model to None. This will enable
FastAPI to skip performing the validation or filtering step that
response_model intends to do. By specifying the response model to None,
you can still keep the return type annotation and continue to receive type
checking support from your IDE or type checkers.
Note
Figure 10-5
Within that window, do expand any of the HTTP methods to try out their
functionality. Here is how that looks for “Get Student”:
Figure 10-6
Figure 10-7
When you are writing a data processing pipeline, you might come across a
functionality or a set of functionality that is basically the same code but
applies to multiple data pipelines. The easiest and quickest example is
connecting to a specific database. It is possible that there may be “n”
number of data pipelines that require connection to a specific database.
Obviously, the said data pipeline is dependent on connecting with this
specific database. The process of providing your code with this said
database connectivity dependency is called dependency injection.
Note
In order for the GET method to retrieve and respond with that piece of
information, it requires a connection to a database. As you may notice the
database connection is defined as a separate function.
And so, that database connection gets passed as a parameter within the
get_users function with the FastAPI method “Depends”. This way, the
“get_users” function calls the function “dbConnect()” function, which
establishes a connection to a database.
FastAPI integrates with any and all relational databases. There are different
ways of interacting with databases. Let us begin by looking at a simple and
straightforward way of interacting with a relational database. For our
illustration, we will be looking at working with an open source relational
database called MySQL on a Debian workstation. Let us execute the
following commands:
sudo apt update
sudo apt install mysql-server
Check to see if the MySQL has been installed correctly:
mysqld —version
Let us log in as root:
mysql -u root -p
Enter the appropriate password and you will get the mysql prompt:
mysql>
Let us create a new database user:
create user 'newscott'@'localhost' identified with caching_sha2_password
by 'Tiger@#@!!!9999';
To list all the databases, use
show databases;
To select a database, use
use your-database-name;
To create a database, use
create database your-database-name;
Let us create a database, create a table within that database, and insert data
into that table:
create database stu;
Use stu;
CREATE TABLE students (
id INT(11) NOT NULL AUTO_INCREMENT,
name VARCHAR(100) NOT NULL,
course VARCHAR(100) NOT NULL,
PRIMARY KEY (id)
);
The following command is also helpful:
show tables;
would yield
+---------------+
| Tables_in_stu |
+---------------+
| students |
+---------------+
insert into students (id, name, course) values (1, "test", "tescourse");
insert into students (id, name, course) values (2, "JohnDoe", "Physics101");
insert into students (id, name, course) values (3, "Joe", "Chemistry101");
Let us look at the table now:
select * from stu.students;
+----+---------+--------------+
| id | name | course |
+----+---------+--------------+
| 1 | test | tescourse |
| 2 | JohnDoe | Physics101 |
| 3 | Joe | Chemistry101 |
+----+---------+--------------+
Let us also grant privileges to the new database user we have created:
grant all privileges stu.* to 'newscott'@'localhost';
flush privileges;
Note
These commands are meant for simple database setup for development
purposes. You would have to use mysql_secure_installation and follow
DBA best practices for setting up production instances of databases.
Now create a new Python script called “db1.py” and enter the following
code:
from fastapi import FastAPI
from mysql import connector
dbapp = FastAPI()
database_creds = connector.connect(
host="localhost",
user="newscott",
password="Tiger@#@!!!9999",
database="stu",
auth_plugin='mysql_native_password')
# GET request to list all students
@dbapp.get("/AllStudents")
def get_students() -> dict:
cursor = database_creds.cursor()
cursor.execute("Select * from students")
result = cursor.fetchall()
return { "students" : result }
# GET request to obtain a name from student id
@dbapp.get("/student/{id}")
def get_student(id: int) -> dict:
cursor = database_creds.cursor()
cursor.execute(f"Select name from student where id = {id}")
result = cursor.fetchone()
return { "student" : result }
# GET request to obtain a course from student name
@dbapp.get("/stucourse/{name}")
def stucourse(name: str) -> dict:
cursor = database_creds.cursor()
cursor.execute(f"Select course from student where name = {name}")
result = cursor.fetchall()
return { "student" : result }
# POST request to insert a new student
@dbapp.post("/newstudent")
def newstudent(name: str, course: str) -> dict:
cursor = database_creds.cursor()
sql = "INSERT INTO student (name, course) VALUES (%s, %s)"
val = (name, course)
cursor.execute(sql, val)
database_creds.commit()
return {"message": "student added successfully"}
Some of the ORM libraries that are utilized more often are Peewee,
SQLAlchemy, and Django ORM. Django, being a web framework, has its
own ORM library built in. Currently, the most adopted or widely used
ORM library is SQLAlchemy.
SQLAlchemy
SQLAlchemy consists of two major modules, core and ORM. The core
module provides a SQL abstraction that enables smooth interaction with a
variety of database application programming interfaces (also known as
DBAPI). The core module does not require ORM to be adopted to connect
and interact with a database. The ORM module builds on top of the core
module to provide object relational mapping functionality to SQLAlchemy.
Let us look at some key concepts in SQLAlchemy.
Engine
The other concept is session. As the name suggests, the session establishes
conversations with the database, after a connection is established. The
session creates a workplace for all your queries and transactions (commits,
rollbacks, etc.). If you are building a bigger application, it is recommended
to utilize the sessionmaker class to create a session. Here is an illustration
of how to utilize “sessionmaker” to create a new session:
from sqlalchemy import URL
from sqlalchemy.orm import sessionmaker
databaseURL = URL.create(
"mysql+mysqlconnect",
username="your_username_here",
password="your_password_here",
host="localhost",
database="testdatabase",
)
engine = create_engine(databaseURL)
sampleSession = sessionmaker(autoflush=False, bind=engine)
# queries, transactions here
sampleSession.close()
The argument “bind” is a way to associate an engine with the session. The
argument “autoflush” flushes or commits the database transactions
automatically when a query is invoked to send SQL to the database or
when you are using certain methods of session class.
Query API
Alembic
Figure 10-8
Middleware in FastAPI
Note
Please note that the focus is more on the FastAPI implementation for our
machine learning app illustration.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
import pickle
import numpy as np
df = pd.read_csv("/<your-path-to>/iris.csv")
x = df.iloc[:,:4]
y = df.iloc[:,4]
x_train, x_test, y_train, y_test = train_test_split(x,
y,
random_state=0)
model=LogisticRegression()
model.fit(x_train,y_train)
y_pred = model.predict(x_test)
#print(confusion_matrix(y_test,y_pred))
#print(accuracy_score(y_test, y_pred)*100)
export_model = pickle.dump(model, open('irismodel1.pkl', 'wb'))
input_data_reshaped = np.asarray(input_data).reshape(1,-1)
prediction = model1.predict(input_data_reshaped)
Once we have the model pickle file, we can proceed to build our
application in FastAPI with a custom middleware that limits access to the
API after a few number of requests. We start off with importing the
necessary classes and methods, followed by instantiating the FastAPI,
loading the model, and creating a dictionary to store the IP addresses and
number of hits. Then we define the middleware function, where we obtain
the IP address from the request class and increment the hits for each IP
address. The middleware is coded to ensure that if the number of hits goes
beyond 5 in a given hour, then it will throw an exception:
Note
To access or test the API, it is suggested to open a browser tab and enter the
FastAPI documentation link:
http://127.0.0.1:8000/docs#/default/predict_predict__post
or simply
http://127.0.0.1:8000/docs
would yield
A screenshot of a Swagger U I interface reading the details of a
POST A P I endpoint in a fast A P I application. The endpoint
makes a prediction. The request body requires a J S O N object
with specific keys and values.
Figure 10-9
You can start inputting the values and let the system classify (predict) the
iris plant. Here is how that looks:
Figure 10-10
Conclusion
APIs, application programming interfaces, are one of the growing ways of
consuming data analytics and machine learning services. API-based
development and deployment is an ocean of its own; it is good for a ML
engineer or a data engineer to set up and build an API. In this chapter, we
looked at the concepts and architecture of APIs. We went into greater detail
on FastAPI architecture and looked at a variety of illustrations. We moved
toward database integration, building a data service REST API. We also
looked at deploying a machine learning model as a REST API. Not only
have you mastered the core concepts of FastAPI, but also you have
acquired a skillset that bridges the gap between complex engineering and
service delivery. Overall, I hope this served you greatly in understanding
how to engineer and deploy data pipelines using FastAPI.
© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2024
P. K. Narayanan, Data Engineering for Machine Learning Pipelines
https://doi.org/10.1007/979-8-8688-0602-5_11
Introduction
Organizations are increasingly leveraging data, creating more data teams,
and performing various data projects. In addition to building and deploying
products and services, companies are increasingly creating more data teams
and performing various data projects from data discovery to gaining value
to ensuring best security and privacy data practices. Data projects can be
based on IT products, IT services, shared services, and other tasks and
activities in organizations. Depending upon the nature and type of data
projects, the data pipelines can get complex and sophisticated. It is
important to efficiently manage these complex tasks, making sure to
orchestrate and manage the workflow accurately to yield desired results.
Every data project requires transparency and observability as to what
the project does, how it is working out, and what systems it is sourcing
from and feeding to. Every data job or data pipeline is essential enough that
in case of a job failure, one is expected to restart the job and troubleshoot
whatever the systems that caused failure. With the big data revolution,
organizations are increasingly looking to leverage various kinds of
machine-generated data. As the number of data pipelines increases, it
presents another complexity of managing them.
By the end of this chapter, you will learn
– Fundamentals of workflow management and its importance to
engineering data pipelines
– Concepts of workflow orchestration and how it optimizes the data
pipeline operations
– The cron scheduler and its role in task automation
– Creating, monitoring, and managing cron jobs and cron logging
– Cron logging and practical applications of scheduling and orchestration
of cron jobs
Slicing and packaging, where the bread that is made is being sliced,
packaged with a paper bag, and stamped when it was baked and how
long until it is safe for consumption.
6.
Distribution, where the bread packed in paper bags is once again
packed into boxes and loaded into various trucks. These trucks then
transport the bread boxes to various destinations that include grocery
stores and restaurants.
Here is how this may look:
Figure 11-1 An example of an assembly line workflow
This is the most basic example of a workflow. In the steps listed earlier, one
cannot interchange given steps. These steps have to be followed in exactly
the same order. Furthermore, the output of a given step constitutes the input
of the next step. And so, the output of the next step becomes the input of the
next subsequent step, and so on.
A recent variant of ETL is referred to as ELT. ELT stands for extract, load,
and transform. In this case, data from various sources are extracted and
sourced into the target system. From there on, various transformation and
business rules are applied directly onto the data. There may or may not be a
staging layer in the ELT workflow. ELT workflows are more common in the
case of cloud data stores. ELT is more suited when you are dealing with
diverse data types. Most cloud providers offer serverless data
lake/warehouse solutions that serve ELT pipelines, making it cost effective
and easy to maintain.
Let us take an example of what an ELT data pipeline may look like.
Let’s say relational database data is being sourced from an inventory system
on various products, along with JSON data of log sensor readings of the
warehouse, along with few other data. Obtaining this data at frequent
intervals would constitute the extract stage. For the load stage, imagine if
we have a Google BigQuery where you dump all the data. The transform
stage is where the data gets cleaned, wrangled, and either integrated into a
unified dataset or loaded into dimensional tables for further analysis.
Here is how this may look:
Figure 11-3 A sample ELT architecture for inventory data processing pipelines
Workflow Configuration
Workflow configuration is another key concept to understand. Workflow
configuration is basically a set of configuration parameters that would drive
the workflow processes. Workflow configuration is essential and crucial to
the success of delivering a product or solution, in this case, bread.
In this example, workflow configuration might involve the following
aspects:
1.
Ingredient mixing: The exact quantity of each ingredient, and the
measurement is on the same scale.
2.
Proofing: The controlled temperature of the proofing room, the exact
amount of oil that is applied on the proofing sheet, and an automatic
timer that will alarm once the number of hours of proofing has been
completed.
3.
Baking: The calibration of the temperature sensor, the total amount of
time needed to be spent in the oven, and the exact temperature that the
oven will be in during the time spent in it (first 10 minutes, 170 °C;
second 10 minutes, 200 °C; cooling 10 minutes, 150 °C).
4.
Quality control: The number of defects should be reduced to 3 loaves
for every 1,000 produced and so on.
5.
Slicing and packaging: The slicers should be configured to specific
measurement so the bread is cut evenly; packing paper should be clean
and without defects.
6.
Distribution: The various types of boxes and how many loaves they can
fit in; temperature within the truck; delivery time should be within x
hours.
This preceding workflow configuration helps the organization in
streamlining the production process, improving and establishing evenness
and consistency, planning procurement and manpower needs, and optimal
utilization of resources. This level of detailed configuration helps one to
understand the steps needed to follow to produce quality bread.
Similarly, in the case of a data engineering or machine learning pipeline,
workflow configuration parameters exist. Some of them could be as
follows:
1. If a job is scheduled to run periodically, then the job may require the
most recent or fresh data; so the time and date when the most recent
source system refresh happened is very essential.
2.
Organizations are starting to adopt data governance best practices. One
such practice is to create a service account or specialized user account
(depending upon the size, nature, and complexity of the data shop). The
appropriate credentials must be configured in order to access the right
data, which is crucial for building better data products.
3.
A data pipeline may consist of several transformation and business
rules. These transformations are basically functions that are applied to
the data that was sourced from the system. Such transformation logic
and business rules are workflow configurations as well.
4.
While attempting to load the final data product into an appropriate
system, there may be constraints that would enforce loading the data
into a schema. A schema is simply a table structure where all the fields
and their data types are defined. The schema can also be another
workflow configuration parameter.
Workflow Orchestration
Workflow orchestration refers to the centralized administration and
management of the entire process. Workflow orchestration involves
planning, scheduling, and controlling of each step of the production process
and their respective workflow configuration parameters, in order to ensure
that all tasks are executed within the controls defined and in the appropriate
order. Workflow orchestration optimizes the utilization of resources, helps
plan procurement of raw materials, and establishes high quality and
consistency among each of the products developed, eventually leading to
seamless functioning of the organization. In this context, workflow
orchestration would refer to
1.
Defining the entire workflow: Where each step of the processes is
defined with configuration parameters baked in.
2.
Scheduling: Where the defined workflow will be scheduled, wherein a
process will start upon completion of a process in the assembly chain.
A central control room that oversees the entire shop floor from
p
3. ingredient mixing to packing bread in boxes where a forklift operator
would place these boxes in trucks.
4.
Automations: Where a computer specifies and controls configuration
parameters for each of the industrial appliances like ovens, proofing
rooms, etc.
5.
Error handling and routing: In case a machine in the assembly chain
fails to process its task, the workflow orchestration system would
implement retries (automatic restart of the machine) for specific
appliances and trigger certain events to avoid bottlenecks.
6.
Integration with external systems: So the workflow orchestration keeps
track of the number of boxes shipped, the number of loaves produced,
which cluster of market received how many boxes, the number of units
of raw materials consumed, the number of hours the machines were
switched on, and other such data and feeds them to external systems.
Workflow orchestration enables a factory shop to run streamlined, with
minimized errors, efficiency, and predictability.
Similarly, in the context of data pipelines, some of the workflow
orchestration can be seen as the following:
1.
Defining the entire workflow, where each step of the data engineering
pipeline is configured with appropriate parameters to successfully
execute.
2.
In case a task fails to connect to a source or target system, a mechanism
is available to be able to automatically re-execute a function after a few
seconds with the hopes of connecting to that system.
3.
In-built or well-integrated secrets management, so the development or
production data pipeline is leveraging the appropriate credentials to
execute the job.
4.
Automatically logging the events of the job execution that could later
be utilized for data analysis, especially identification of any patterns in
certain types of events.
5. Centralized monitoring and observability, so one can view, manage,
monitor, and control all the data pipelines under one roof.
Concepts
At the heart of the cron scheduler, we have something called the cron
daemon process. This process runs on system startup and in the
background. The cron daemon executes commands that are mentioned in
the file called crontab. There are two files that specify the users and their
access to the cron scheduler environment.
These may be located at
/etc/cron.allow
/etc/cron.deny
/etc/cron.d/cron.allow
/etc/cron.d/cron.deny
Note In the case of Ubuntu and Debian operating systems, cron.allow
and cron.deny files do not exist, meaning all users of Ubuntu and Debian
operating systems are provided access to the cron scheduler by default.
The most basic unit of a cron scheduler environment is a cron job. A cron
job represents a task or job that is scheduled to be executed. A cron job is
one line in a crontab, representing one execution of a shell command.
Crontab File
Crontab is short for cron table. Crontab is a utility that maintains crontab
files for users. A crontab file is automatically generated for a given Linux
user. The cron daemon process runs this crontab file regardless of whether a
task or a job is scheduled to run or not. A crontab file can be seen as a
registry that contains a list of jobs scheduled. A crontab file belongs to root
or a superuser. You can become a superuser to create a crontab file or edit
an existing crontab file. The default location for crontab files is
/var/spool/cron/crontabs/
To create a new crontab file, use “crontab -e” from the command line.
Here is how that looks:
You will be presented with the following template to write your own cron
jobs:
To list all the current crontab files, you can try the following:
ls -l /var/spool/cron/crontabs/
The minute column represents the specific minute when the job should
run. It can contain numeric values only. The values can range any number
between 0 and 59.
The hour column represents the hour at which the job should run and
takes only numerical values. The values can range anything between 0 and
23.
The day-of-month column represents the day in a given calendar month.
As you know, months contain 28, 29, 30, and 31 days depending upon the
month and whether it’s leap year or not. You can mention a numerical value
between 1 and 31.
The month column represents a calendar month and can take numerical
values between 1 and 12. These numerical values represent calendar months
from January to December, respectively.
The weekday column represents a given day of the week starting from
Sunday through Saturday. It can take any value between 0 and 6, where 0
represents Sunday and 6 represents Saturday.
The shell command is where the command that requires it to be run is
issued.
Note In the crontab file, we have an asterisk operator. The “*” operator
can be applied in the first five columns of the crontab entry. When a “*”
is mentioned for the appropriate column, it means that it can run on all
the intervals of the given allowable range. Here is an example:
15 6 * * * /user/parallels/bin/helloworld
The abovementioned crontab entry would run the hello world script
every day at 15 minutes past 6 AM in the morning. Specifically, the
script would run every day of the month, every month, and every
weekday as well.
Cron Logging
The cron daemon process writes a log entry every time the system starts up,
it executes a new job, and it has trouble interpreting the command or faces
an error during execution of a job. Unfortunately, the log entry does not
contain the source of the error. If you are working on a Debian-based
operating system (Ubuntu and others), then you can find the logs here:
/var/log/syslog
/var/log/cron
You can obtain the most recent cron log entries by the following
command:
Logging in cron stores a lot of key information. The log entry starts off
with a date timestamp when the job was started, the host of the workstation,
the user account of that workstation, and the command that is executed.
Here is an example of a log entry:
In the preceding example, the log entry begins with the date timestamp,
followed by the hostname (in this case, linux), followed by the user on the
host (in this case, parallels) and finally the command that is executed (in
this case, executing a Python script named helloworld.py).
nano cronexample.sh
And enter the following code in the nano editor:
#!/bin/bash
Save the file and exit the editor. Let us provide execute permissions to this
file:
chmod +x cronexample.sh
Now enter the crontab to schedule the execution of this shell script:
crontab -u parallels -e
What we have is
* * * * * /home/parallels/Documents/cronexample.sh
This script line would execute the command at every minute of every
day of every week of every month.
Now let us introduce another operator through another example:
*/6 * * * *
/home/parallels/Documents/cronexample.sh
We have a “/” operator, which is called the step value. In this case we
have incorporated the step value for the minute field. This example would
mean to run the following script at every sixth minute regardless of hour,
day, month, date, etc.
The “/” step operator can be extended to other fields as well, rendering
much more complex applications.
Here is an example:
*/24 */4 * * *
/home/parallels/Documents/cronexample.sh
This would enable the cron scheduler to run this script at the 24th
minute of every 4th hour. Here is how a sample schedule looks like:
0 22 * */6 1-5
/home/parallels/Documents/cronexample.sh
Now, let us try to combine the range operator and the step value
operator:
36 9-17/2 * * *
/home/parallels/Documents/cronexample.sh
In this example, we have scheduled the script to run at the 36th minute
of the given hour, every 2 hours between 9 AM and 5 PM:
Here is how that would look like:
Notice the use of the character “L” and the hash symbol “#”. The “L”
character indicates the last of the values for a given field. The “#” hash
operator indicates the nth day of the month. In this case, the script is
scheduled to be executed every 15 minutes, between 3 AM and 3:59 AM,
on the last day of the month if that last day happens to be the fourth Friday
of the month, in the months of June, August, and September.
Database Backup
Database backup is a process of creating a copy of an organization’s master
files and transactional tables. This is a critical operation, as data is the most
valuable asset and data is required to run everyday business and operations.
There are various types of database backup methods that exist depending
upon the type and frequency of backups. There are differential backups,
incremental backups, and complete backups.
Let us look at a simple example where a shell script executing database
backup operation is scheduled:
0 2 * * * /home/parallels/mysql_backup.sh
Data Processing
Many of the organizations are working with a variety of systems that
generate data. The data team within the organization would usually perform
report generation every day or perform a file transfer of certain data. These
data processing tasks that arise within certain source systems may be
automated using cron scheduler utility.
Let us look at scheduling a batch job that performs data processing or a
report generation every day:
45 3 * * *
/home/parallels/Documents/data_processing.sh
Email Notification
Email notification is a process of notifying one or more persons with a
certain message. In the world of data systems and engineering, one can set
up to receive email notifications about the status of the task that is
processed (backing up a database, preparing a dataset, data mining, etc.).
Several libraries exist to set up email server configuration and output
messages to the appropriate email.
Let us look at a simple example:
45 7 * * MON-FRI
/home/parallels/email_notification.sh
Conclusion
Workflow management and orchestration are essential components in data
engineering. In this chapter, we looked at crontab and cron scheduling and
how we can orchestrate complex jobs and automate routine tasks using the
same. What appears to be a simple concept is a powerful scheduling tool for
automating recurring tasks. The foundations and key concepts you have
learned in this chapter may serve you well when looking at modern data
orchestration platforms, which we will see in the next couple of chapters.
Efficient workflow management, scheduling, and orchestration are key to
unlocking the complete potential of your data infrastructure.
© The Author(s), under exclusive license to APress Media, LLC, part of Springer
Nature 2024
P. K. NarayananData Engineering for Machine Learning
Pipelineshttps://doi.org/10.1007/979-8-8688-0602-5_12
Introduction
Apache Airflow is an open source workflow orchestration platform. It enables
engineers to define, schedule, and orchestrate workflows as directed acyclic
graphs (DAGs). Airflow supports a wide range of integrations with popular data
storage and processing systems and is suitable for a variety of use cases, from
ETL (extract, transform, load) processes to sophisticated machine learning
pipelines. With its scalable and extensible architecture, great graphical user
interface, and collection of plugins and extensions, Airflow is widely used to
improve productivity and ensure data reliability and quality. We will have a deep
dive into the architecture of Apache Airflow, various components, and key
concepts that make Apache Airflow a powerful workflow orchestrator.
Here is how we can get set up with Apache Airflow. The version I am working
with is Apache Airflow 2.8.1. Let us begin by creating a new virtual
environment:
python -m venv airflow-dev
Here’s how to activate the virtual environment in Linux:
source airflow-dev/bin/activate
If you are using MS Windows, then please use
airflow-dev/Scripts/activate
Create a new folder somewhere in your development workspace:
mkdir airflow_sandbox
Now, copy the folder path and set the AIRFLOW_HOME environment variable.
Ensure that you are performing this prior to installing Apache Airflow in your
system:
export AIRFLOW_HOME=/home/parallels/development/airflow_sandbox
The following instructions are extracted from the Apache Airflow website. We
shall obtain the Python version using the following command:
PYTHON_VERSION="$(python --version | cut -d " " -f 2 | cut -d "." -f 1-2)"
As mentioned earlier, we are looking to install Apache Airflow version 2.8.1.
Here is the command to perform this step:
pip install "apache-airflow==2.8.1" \
—-constraint "https://raw.githubusercontent.com/apache/airflow/constraints-
2.8.1/constraints-${PYTHON_VERSION}.txt"
This command will kick-start installing Apache Airflow and its dependencies.
Here is how that may look:
Text graphic with command lines for installing Apache airflow and its
dependencies.
Figure 12-1
Now, let us initialize the backend database, create an admin user, and start all
components by entering the following:
airflow standalone
Text graphic with command lines for initializing the backend database,
creating an admin user, and starting all components.
Figure 12-2
Once this step is completed, let us initialize the Apache Airflow web server by
the following command:
airflow webserver —D
This command will start the Airflow web server as a background process.
Text graphic with command lines for starting the air flow web server as
a background process.
Figure 12-3
Once you have the web server running, let us try to run the scheduler. The
following command will kick-start the Airflow scheduler:
airflow scheduler -D
You may open your browser and enter the following URL to access the Apache
Airflow graphical user interface.
http://localhost:8080
Figure 12-4
When you ran the “airflow standalone” command, it must have created an admin
account with the password stored in a text file within the location of
AIRFLOW_HOME. Obtain the password from the text file
“standalone_admin_password.txt”. Here is how that may look:
Text graphic with command lines for obtaining the password from a
text file.
Figure 12-5
Once your login is successful, you may see the following screen with prebuilt
jobs:
Figure 12-6
Home screen of Apache Airflow
Airflow Architecture
Airflow provides you a platform to define, schedule, and orchestrate jobs and
workflows. As discussed earlier, a workflow can be a series of tasks where the
output of a given task serves as the input of the subsequent task. In Apache
Airflow, a workflow is represented by a diagonal acyclic graph. A diagonal
acyclic graph is a mathematical term representing a series of nodes connected by
edges that does not form a closed loop nowhere in its structure. In the context of
Apache Airflow, a diagonal acyclic graph represents a series of tasks connected
and dependent on each other but does not in any way form a loop. They are called
simply as DAGs.
A DAG workflow has tasks represented as nodes; data sources and dependencies
are represented as edges. The DAG represents the order in which the tasks should
be executed. The tasks themselves contain programmatic instructions that
describe what to do. The concept of DAG is at the heart of Apache Airflow’s
functionalities. In addition to that, there are some core components that are
essential to Apache Airflow functionalities. Let us look at these core features in
detail, starting with the components we are already familiar with.
Web Server
The web server component provides the user interface functionality of the
Apache Airflow software. As we have seen in the previous section, the web
server runs as a daemon process. The installation is shown as a local
machine/single machine standalone process. You can deploy Apache Airflow, in
a distributed process where the web server can have its own instance with
heightened security parameters. The web server provides the user interface for a
developer to look at various jobs that are currently scheduled, the schedules they
were configured to run, the last time they were run, and the various tasks that are
contained in the workflow of a given job.
Database
Executor
The executor is how the tasks within a given workflow are run. The executor is
actually part of the scheduler component. It comes with several pluggable
executors. These executors provide various different mechanisms for running
your workflows. There are two major types of pluggable executors, namely, the
local executors and the remote executors. As their namesakes, local executors run
tasks within your workstation (locally), whereas remote executors run these tasks
in multiple worker node environments (distributed environment). For instance,
the LocalExecutor runs on your local workstation and supports parallelism and
hyper-threading. Another honorable mention is the KubernetesExecutor, which
runs each task in a separate pod in a Kubernetes cluster. A pod is like a container,
and Kubernetes is a container orchestration platform. You have to have a
Kubernetes cluster set up to utilize the KubernetesExecutor.
Scheduler
This is the most important component of the Apache Airflow architecture. The
scheduler component monitors all the workflows and all the tasks within each
workflow. The scheduler triggers execution of a task once all its underlying
dependencies are complete. The scheduler is a separate service that needs to be
started manually, as we have seen in the “Setup and Installation” section. Apache
Airflow supports more than one scheduler component running concurrently to
improve performance. The scheduler component makes use of the executor to
execute tasks that are ready to go.
Configuration Files
The Airflow configuration file is not part of the Apache Airflow architecture;
however, it controls the workings and functionalities. Apache Airflow comes
with an option to fine-tune various options through the “airflow.cfg” file. This
file gets created when you install Apache Airflow in your workstation. Each
property that is configurable has been provided with descriptions and default
configuration options as well. You can open the Airflow configuration file simply
using a notepad or text editor. You can also get the list of options from the
command line as well by entering the following command:
airflow config list
Figure 12-7
If you want to change the location where you keep your DAGs, then here is
where you would specify the location of the folder:
[core]
dags_folder = /user/parallels/<location-to-your-DAGs-folder>/
If you want to specify or change the executor that Airflow intends to use, then
here is where you would do so:
[core]
executor = LocalExecutor
If you wish to change the default sqlite backend to either MySQL or Postgres
database instance, then you may change the database string from sqlite driver and
credentials to either mysqldriver or psycopg2 driver along with the respective
credentials. Here is where you would replace the database string:
[database]
sql_alchemy_conn = sqlite:////
If you would like to gain visibility over your development or sandbox instance of
Apache Airflow and enable ability to view the configuration, then set the
following parameter to True:
[webserver]
expose_config=False
Once you have made the necessary changes, it is essential to restart the Airflow
server. A good production practice is to create a service within the systemd
service in order to start, stop, and restart the Airflow server. You may find a
systemd service configuration file within the GitHub repository of Apache
Airflow. However, for development instances, it is easier to identify the process
id (also known as PID), attempt to kill the process, and start the Apache Airflow
server by using the following command:
airflow webserver -p 8080 -D
Note
The “-p” refers to specifying the port, whereas “-D” is to run the server as a
daemon process (commonly known as background process).
In addition, you can use the following command from the terminal to obtain a list
of all current DAGs loaded in the environment:
airflow dags list
A Simple Example
We import the required classes and methods, to begin with. We have two
functions in this code: the first function would print a string, and the other
function would display the current date and time. We then define the DAG where
we specify various details like name, description, and schedule. These DAGs can
also take in much more complex configurations. Then, we define the tasks for
each of those functions and associate them with our DAG. Finally, we define the
order of the tasks to be executed. While this may not print anything in the
command line, this is helpful to understand the various parts of workflow
definition using Apache Airflow. To get the DAG displayed on your user
interface, make sure you place the Python script in the appropriate folder.
Note
You have to place your Python script in the folder that is referenced in the
airflow.cfg configuration file in order to view and monitor your DAG. Make sure
your “dags_folder” property is pointing to the appropriate folder where you have
your scripts that need to be run. Moreover, locate the configuration parameter
named “expose_config” and set that parameter as “True”. This will expose all the
configuration parameters in the web server. In development or sandbox instances,
these may be helpful to troubleshoot.
Airflow DAGs
The diagonal acyclic graph or simply the DAG is the heart of a workflow
management and orchestration system and, in this case, Apache Airflow. A
typical DAG would consist of various tasks that are connected with each other
through the database connections and dependencies such that there won’t be a
loop formation anywhere in the graph.
Figure 12-8
These DAGs will define and control how these tasks will be executed; however,
they have no information about what these tasks actually do. In a way, these
DAGs are task agnostic.
There are three ways of declaring a DAG within a data pipeline. The most basic
and familiar way is using a constructor and passing the DAG into an operator that
you plan to use. Here is how that looks:
my_test_airflow_dag = DAG(
dag_id = 'hello_world',
description='A simple DAG that prints Hello, World!',
start_date=datetime.datetime(2024,2,5)
schedule="@weekly",
)
EmptyOperator(task_id="my_task", dag="my_test_airflow_dag")
You can also declare a DAG by using a Python context manager and adding the
operators within the context:
with my_test_airflow_dag(
dag_id = 'hello_world',
description='A simple DAG that prints Hello, World!',
start_date=datetime.datetime(2024,2,5)
schedule="@weekly",
):
PythonOperator(task_id="hello_world_task")
Lastly, you can define a function, where you specify all the operators within that
function; then you can use a decorator to turn that function into a DAG generator.
Here is how that looks:
@dag(
dag_id = 'hello_world',
description='A simple DAG that prints Hello, World!',
start_date=datetime.datetime(2024,2,5)
schedule="@weekly",
)
def some_random_function():
EmptyOperator(task_id="hello_world_task")
PythonOperator(task_id="print_current_date")
some_random_function()
Whenever you run a DAG, Apache Airflow will create a new instance of a DAG
and run that instance. These are called DAG runs. Similarly, new instances of the
tasks defined within DAGs will also be created during runtime. These are called
task instances.
Tasks
DAGs require tasks to run. Tasks are at the heart of any given DAG. They can be
seen as a unit of work. During a run, you will be creating a new instance of a
task, namely, task instance. These task instances have a begin date, an end date,
and a status. Tasks have states, representing the stage of the lifecycle. Here is a
sample lifecycle:
None ➤ Scheduled ➤ Queued ➤ Running ➤ Success
There are three kinds of tasks, namely, operators, sensors, and task flows. As
discussed in previous sections, tasks are interconnected using relationships. In
this context, the dependencies between tasks are represented as relationships.
Here is an example of how relationships are represented in Apache Airflow. Let’s
assume we have three tasks, namely, extract, transform, and load. The output of
an extract task serves as the input of a transform task, and the output of a
transform task serves as the input of a load task. Here is how these tasks can be
represented in relationships:
extract_data >> transform_data >> load_data
Task1 >> Task2 >> Task3
Operators
This command will pick up the DAG from the folder, and you can see the DAG
listed in the user interface. Here is how that looks:
Figure 12-9
List of DAGs active in Apache Airflow
Sensors
SQLSensor: The “SQLSensor” executes a SQL query and waits till the
query returns the results.
Let us look at a simple example of a sensor. We have a file sensor that continues
to wait till a file arrives in a folder/file system for a specified amount of time.
Whenever the file arrives into the respective folder location, the DAG would run
the downstream task, which echoes that the file has arrived. In this context, it is
assumed that the file will be placed in the respective folder by another process:
from datetime import datetime
from airflow.decorators import dag
from airflow.sensors.filesystem import FileSensor
from airflow.operators.bash import BashOperator
@dag(
dag_id = "file_sensor_example",
start_date=datetime(2024,2,3)
)
def wait_for_file():
FileSensor(
task_id="wait_for_file",
filepath="/Users/<your-location-here>/airflow/testfile.txt",
mode="poke",
timeout=300,
poke_interval=25,
)
@dag(
dag_id = "file_sensor_example",
start_date=datetime(2024,2,3)
)
def file_notification():
BashOperator(
task_id="file_notification_task",
bash_command='echo "file is available here"'
)
wait_for_file >> file_notification
Task Flow
As you have seen in previous sections, you will have to start planning tasks and
DAGs as you develop your data pipeline. If you have data pipelines that are
mostly written in pure Python code without using Airflow operators, then task
flow decorators will help create DAGs without having to write extra operator
code:
from airflow.decorators import dag, task
@dag(schedule="@daily", start_date=(2024,2,2))
def taskflow_example():
@task(task_id="extract", retries=5)
def extract():
"""
Extract data from the data source.
If there is a connection issue, attempt to
establish the connection again
"""
return dataset
@task(task_id="transform")
def transform(dataset):
"""
Perform the necessary transformation for
the extracted dataset
"""
return processed_dataset
@task(task_id="loaddata", retries=5)
def load(processed_dataset):
"""
Obtain the dataset and transport it to file system.
Retry the task if there is an issue with connection.
"""
pass
extract_data = extract()
transform_data = transform(dataset)
load_data = load(processed_dataset)
taskflow()
Xcom
Tasks, by default, perform or execute the instructions they have been provided in
isolation and do not communicate with other tasks or processes. Xcom is an
Apache Airflow exclusive concept that enables cross-communications between
tasks. With Xcom, you can obtain information or value from one task instance
and utilize that in another task instance. While this can also be programmatically
accomplished by exporting the value to a file or a database and then reading the
file or database to obtain the value, Xcom functionality is native and can be
considered as more secure.
Hooks
Apache Airflow lets you connect with external data platforms and systems with
ease, through the concept of Hooks. Hooks in Airflow, let you connect with
external systems without having to write code to manually connect to these
systems. Hooks also have the optional “retry” logic embedded in them. Airflow
offers many built-in hooks and there is also a community of developers
developing custom hooks to various applications. Let us look at a simple
illustration.
Notice the use of the Postgres database hook while connecting and retrieving
rows; the database connection parameters are not required to be specified here.
Note
Prior to using Hooks in Airflow, you must setup the credentials of the external
data system in Airflow separately.
Once you have the external data system configured, you may then get a
connection id that you can use it in the hook.
Variables
Apache Airflow variables are key–value pairs that can be defined and used within
an Airflow DAG. These variables are scoped globally and can be accessed from
any of the tasks. The tasks within a DAG can query these variables. You can also
use these variables within your template. Here is how you can set a variable in
Apache Airflow. A good application is storing some database or access
credentials that remain static during the course of the project. When using
Airflow variables to store API secrets, keys, passwords, and other sensitive
information, Airflow will automatically mask the variables. In addition, Airflow
also encrypts these variables. And so, it is suggested to use variables to store
sensitive information over other options, which may lack masking and encryption
features.
You can create Apache Airflow variables programmatically within the pipeline,
through the command line interface, or even through the user interface (web
server). Here is an example of setting up Airflow variables in the command line
interface:
airflow variables set server_ip 192.0.0.1
airflow variables set -j aws_creds '{"AWS_SECRET": "your_secret_here"}'
You can also set up or bulk upload JSON from the Apache Airflow web server.
To get started, navigate to the Admin menu in the web server and click Variables:
A screenshot of an airflow window with menu items, D A Gs, Cluster
activity, Datasets, Security, Browse, Admin and Docs. There are no
records found.
Figure 12-10
You can click the “+” button to create a new variable. It will render a new screen
where you can enter the key, value, and description.
Figure 12-11
Params
Params are short for parameters. Similar to variables, params are arguments you
can pass to Airflow DAG runs or task instances during runtime. The difference is
that values passed to params are not encrypted and they should not be used for
storing sensitive information. You can pass params at the DAG level or at the
individual task level. Task-level params take precedence over DAG-level params.
Params are more localized when compared with variables in Apache Airflow. If
you want to pass different values for a task at each run, params would be a good
choice.
Templates
Macros
In the preceding example, we have called a method from the Python library with
the macros namespace. If you try to call it without the macros namespace, the
command won’t return the datetime.
You can control the execution of a DAG in two different concepts, branching and
triggering. In branching, you make use of operators to define the conditional
execution. In this section we will look at two commonly used operators,
BranchPythonOperator and ShortCircuitOperator.
This example utilizes the older syntax where you create an instance of the
respective operator class, whereas the latest syntax makes use of a decorator that
handles the instantiation and operation of the respective operator.
This example uses task decorators. Depending upon the value a given task
generates, a condition is checked, and based on the evaluation, a respective task
id is returned, which would determine which downstream task to run. In the code,
“@task.branch” is the decorator version of the “BranchPythonOperator”.
The other operator that can be used in controlling the workflow is called the
ShortCircuitOperator. The ShortCircuitOperator, if it satisfies a given condition,
will skip all the downstream tasks. Let us look at an example:
from airflow.decorators import task, dag
from airflow.models.baseoperator import chain
from datetime import datetime
from random import getrandbits
@dag(
start_date=datetime(2024, 2, 2),
schedule='@daily',
catchup=False,
)
def sc1():
@task
def begin():
return bool(getrandbits(1))
@task.short_circuit
def decision_loop(value):
return value
@task
def success():
print('Not skipping the execution')
chain(decision_loop(begin()), success())
sc1()
In this example, we have a simple function that returns a random true or false,
which is provided to the decision loop function. If the random function returns
true, then the downstream task success() gets executed; if the random function
does not return true, then the downstream task is skipped. And so, every time this
DAG is run, you would obtain different results.
Triggers
As we discussed in earlier sections, a given task will execute only after all its
preceding tasks have been successfully completed. We have seen ways in which
we can control execution based on conditions. Let us imagine a scenario where
we have “n” number of tasks linearly scheduled with one beginning to execute
after its preceding task successfully completed.
Let us say we have a requirement where if a task fails, all the other succeeding
tasks are ignored and only the conclusion task must be run. You cannot short-
circuit the remaining workflow as that would enable skipping all the tasks
including the conclusion task. This is where the concept of triggers is beneficial.
You can configure triggers within the task itself, by adding the “trigger_rule”
argument to the task. Let us look at an example:
from airflow.decorators import dag, task
from airflow.utils.trigger_rule import TriggerRule
from datetime import datetime
import random
@dag(
start_date=datetime(2024,2,2),
dag_id="trigger_illustration",
schedule="@daily"
)
def dag_trigger():
@task(task_id="begin")
def begin():
print("Task execution begins")
@task(task_id="random_task")
def random_task():
if random.choice([True,False]):
raise ValueError("Task 2nd in queue unsuccessful")
else:
print("Task 2nd in queue executed")
@task(
task_id="task3",
trigger_rule=TriggerRule.NONE_FAILED_MIN_ONE_SUCCESS
)
def task3():
print("Task 3rd in queue executed")
@task(
task_id="task4",
trigger_rule=TriggerRule.NONE_FAILED_MIN_ONE_SUCCESS
)
def task4():
print("Task 4th in queue executed")
@task(
task_id="conclusion_task",
trigger_rule=TriggerRule.ONE_FAILED
)
def conclusion_task():
print("One of the tasks unsuccessful")
#begin() >> random_task()
#random_task() >> [task3(), task4()] >> conclusion_task()
begin() >> random_task() >> [task3(), task4()] >> conclusion_task()
dag_trigger()
In the preceding example, we have five tasks, where one of the tasks has been
configured to either execute or not execute, during which the conclusion task
must be executed. By specifying various trigger rules, we are able to skip the
downstream tasks and execute the conclusion task, in the event of the task not
being executed.
Conclusion
As we explored throughout this chapter, Apache Airflow is a powerful workflow
orchestration tool and a good solution for orchestrating data engineering and
machine learning jobs. Apache Airflow has been present for close to ten years,
enabling workflow orchestration and providing advanced features and
customizations. All the cloud vendors do seem to offer Apache Airflow as a
service, showing its wide adoption in the industry. We looked at setting up
Airflow, the architecture of the Airflow system, components, and how they all
work together. We then looked at DAGs, hooks, operators, variables, macros,
params, triggers, and other concepts and how you can leverage them in
orchestrating a data pipeline. In the next chapter, we will look at another
workflow orchestration tool that is more robust and enables you to work without
a DAG. The choice of the workflow orchestration tool depends on the project
requirements, current infra, and team.
© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2024
P. K. Narayanan, Data Engineering for Machine Learning Pipelines
https://doi.org/10.1007/979-8-8688-0602-5_13
Introduction
Earlier we looked at Apache Airflow, the most widely adopted workflow
orchestration solution. Apache Airflow is well adopted and supported by
the community. In this chapter, we will look at another workflow
orchestration solution called Prefect. We will look at its architecture and
how you can create a workflow without DAGs. You can still define DAGs
in Prefect though. The features, components, and typical workflow of
Prefect appear different from what we have seen with Apache Airflow. As
for which tool is a better choice, it depends on many factors like team,
requirements, and infrastructure, among other things.
By the end of this chapter, you will learn
– Fundamentals of Prefect workflow orchestration and key concepts
– Setup, installation, and configuration of a Prefect development
environment
– Core concepts of Prefect like flows, tasks, and results
– Artifacts, variables, and persisting results
– States, blocks, and state change hooks
– Task runners
Introduction to Prefect
Prefect is a second-generation, open source workflow orchestration
platform, designed to help automate, orchestrate, execute, and manage
various Python-based tasks and processes. Prefect can generate dynamic
workflows at runtime. The learning curve for Prefect is shallow, meaning it
takes relatively less time to use Prefect in the workflows. Using Prefect, one
can transform Python functions into a unit of work that can be orchestrated,
monitored, and observed. Some of the highlighting features that Prefect
offers are
Scheduling: Scheduling a job or a task is usually done through cron
scheduling or using a dedicated orchestration service. With Prefect, one can
create, modify, and view the schedules in a minimalistic web UI. Prefect
offers cron scheduling, interval scheduling, and recurrence rule scheduling
for more complex event scheduling.
Retries: Data engineers often write code to retry a certain operation in
case of a failure. Often, this is used in cases where one needs to connect to a
database, as part of a data integration/engineering pipeline that runs
periodically. With Prefect, one can simply use a Python decorator to specify
the number of retries, eliminating the need for several lines of code and
their associated maintainability.
Logging: Logging is a process of registering or recording information
when an event happens during an execution of a data pipeline. Logging can
include high-level information, like registering the status of connecting to a
database, and lower-level information, like capturing the value of a variable
at a given point in time during execution. Prefect automatically logs events
for various operations without having to perform additional configurations
with Prefect.
Caching: Caching is a method of storing copies of data in specific
locations, often referred to as cache. The idea is to efficiently retrieve such
items to reuse the same for other jobs. With Prefect, one can cache the
results of a task and persist the same to a specified location.
Async: Async is short for asynchronous and is a method of executing
tasks, outside the main flow. With Prefect, one can run tasks
asynchronously, meaning that Prefect can start another task while waiting
for another task to finish, without stopping the execution of the program.
Notifications: Notifications are messages that are sent by the data
pipelines about an event or an update before, during, or after the execution
of a data pipeline. Prefect Cloud enables notifications about the data
pipeline by providing various alerts by various communication mechanisms
like emails, text messages, and platforms like Slack, MS Teams, etc.
Observability: Prefect Cloud offers a dashboard-like user interface,
where users can observe and monitor various tasks and flows that are
scheduled, currently running, or finished.
In a traditional data stack, one has to write several lines of code, and in
some cases, it would involve the utilization of another tool or software to
enable such features. The fact that Prefect can offer these features out of the
box is considered to be appealing when compared with similar orchestration
platforms.
Prefect comes in two different editions, namely, Prefect Core and
Prefect Cloud. Prefect Cloud is a commercial offering of Prefect
Technologies, Inc., that comes with a dashboard-like user interface where
one can observe and monitor the workflows along with few other features
like role-based access control and integrations with other tools and
technologies, to name a few.
Prefect Core is an open source framework that provides the core
functionality. One can run Prefect Core within their own infrastructure.
Prefect Core is a Python package that provides the core components of
Prefect. One can obtain Prefect from the Python package installer.
source prefect/bin/activate
If you are using MS Windows operating system, then please use the
following command:
prefect/Scripts/activate
prefect version
Prefect Server
Prefect Server is an open source backend that helps monitor and execute the
Prefect flows. Prefect Server is part of the Prefect Core package. Prefect
Server offers the following options in its user interface:
Flow runs
Flows
Deployments
Work pools
Blocks
Variables
Notifications
Task run concurrency
Artifacts
Here is how you can set up a local server:
You would get the following screen if the web server successfully
started:
Navigate to a browser and enter this URL to access the user interface:
http://127.0.0.1:4200
Figure 13-3 Prefect user interface in the web browser
Prefect Development
Flows
Flows are the most basic unit of Prefect development. A flow is a container
for workflow logic and allows users to manage and understand the status of
their workflows. It is represented in Python as a single function. To convert
a Python function into a Prefect flow, one needs to add the following
decorator to the function:
@flow
Flow Runs
Flow runs represent a single instance of the flow execution. One can create
a flow run, simply by running the Python script that contains the flow. A
flow run can also be initiated by setting up deployment on Prefect Cloud or
self-hosted Prefect Server. If a flow run consists of tasks or subflows, then
Prefect will track the relationship of each child flow run to the parent flow
run.
Here is an example of a basic Prefect flow:
from prefect import flow
@flow()
def hello_world():
print("Hello World of Prefect!")
hello_world()
@flow(name = "MyFirstFlow",
description = "This flow prints Hello World",
retries = 3,
retry_delay_seconds = 10,
flow_run_name = "Hello World Flow",
timeout_seconds = 300
)
def hello_world():
print("Hello World of Prefect!")
hello_world()
As you can see, Prefect has executed an instance of the flow, and this flow
run has been given a name. You can also see that the flow finished in the
completed state. Let us look at the user interface:
Interface
Recall that we had assigned a name for both the flow and the instance of the
flow as “MyFirstFlow” and “Hello World Flow,” respectively. You can
obtain the same log trace and few other details when you click the flow run
name. Here is how that looks:
Tasks
A task is a function that represents a single discrete unit of work in Prefect.
It is entirely possible to create a workflow, solely with flows, where the
process is encapsulated as a Python function with a flow decorator.
However, tasks enable breaking down the workflow logic in smaller
packages that may be reused in various other flows and subflows. Tasks are
similar to functions, where they take inputs, perform functions, and return
an output. Tasks can be defined within the same Python file, where the flow
is also defined. Tasks can also be defined within a module and can be
imported to be used in flow definitions. It is essential that all tasks must be
called from within a flow.
Tasks receive metadata from upstream dependencies, including the state
of the dependencies prior to their execution, regardless of receiving data
inputs from them. This enables the developers to make a task wait prior to
the completion of another task. Usage of tasks is optional in Prefect. One
can write all their code in a giant function and use a @flow decorator for
that function. Prefect can execute that function. However, organizing the
code into as many smaller tasks as possible may help in gaining visibility
into the runtime state and also in debugging the code. In the absence of
tasks, if a job runs in one flow and any line of code fails, then the entire
flow will fail. If tasks are used to break the flow into several tasks, then it
can be debugged relatively easily.
Each Prefect workflow must contain one primary, entry point @flow
function. From that flow function, a number of tasks, subflows, and other
Python functions can be called. Tasks, by default, cannot be called in other
tasks; special syntax must be employed while calling tasks from other tasks:
@task
def task1():
print("I am task1")
@task
def task2():
task1.fn()
print("I am task2")
@flow
def flow1():
task1()
task2()
flow1()
In the preceding example, we have seen two tasks that are incorporated
within a flow. While Prefect does not allow triggering tasks from other
tasks, there is an option to call the function of the task directly by using the
“fn()” method. You can see the function “task1” has been called within the
“task2()” function. It will only be a one-time execution of the function, and
you cannot utilize retries among other Prefect functionalities when calling
this function from a task.
Let us look at the output of this program.
When you run this code from the terminal, you may receive a log trace
similar to following:
Figure 13-7 Observing the output of a Prefect flow in the command line
This Prefect flow has two tasks, namely, task1 and task2, where task2 calls
the function of task1 directly. And so, you will see the task1() execution and
the task2() execution, which includes the print function of task1.
Tasks provide parameters for further customization. Some of them are
listed in the following. Here is an example of tasks with optional parameters
included:
@task
def task1(
name="Task1",
description="This is the task1 returns a
string",
tags=[test],
retries=2,
retry_delay_seconds=10
):
print("I am task1")
@task
def task2(
name="Task2",
description="This is the task2 returns a
string and \
calls the task1 function
directly",
tags=[test],
retries=2,
retry_delay_seconds=10
):
task1.fn()
print("I am task2")
@flow
def flow1(
name="MainFlow",
description="Main flow with tasks
included",
retries=1
):
task1()
task2()
flow1()
In addition to providing the name and description of the task, you can
also specify the number of times to retry the specific task upon the
unsuccessful execution of the task and the duration to wait before retrying
the task. There is also a Boolean parameter specifying whether to persist the
task run results to storage.
Results
Results are quite straightforward. They represent data returned by a flow or
a task. As mentioned previously, tasks or flows are basically Python
functions that may return a value.
Here is a simple example of how results are returned:
from prefect import flow, task
@task
def my_task():
return 1
@flow
def my_flow():
task_result = my_task()
return task_result + 1
result = my_flow()
assert result == 2
In the preceding example, you can see the output of a task (function
with the task decorator) is saved to a variable. And the output of a flow is
also saved to a variable. While this program is able to save the results of
tasks and flows to variables, the program cannot persist them.
Persisting Results
Prefect does not store the results unless they help Prefect reduce the
overhead of reading and writing to result storage. Results are persisted to a
storage location, and Prefect stores a reference to the result.
The following features are dependent on results being persisted:
– Task cache keys
– Flow run retries
– Disabling in-memory caching
To persist the results in Prefect, you need to enable the result persistence
parameter, specify the location of storage, and also have the option to
serialize the output either to JSON or using pickle serializer. Depending
upon the parameters used in the program, results can either be persisted to a
given task or all tasks or to a flow. Consider the following example:
@task
def task1(a: int, b: int) -> int:
c = a + b
return c
@flow(retries=2)
def flow():
ans = task1(2,3)
print(ans)
flow()
In this example, the flow has the retries parameter enabled, which
would require that results of all tasks to be persisted. However, flow retries
do not require the flow to be persisted. Let us try to run this program:
Figure 13-8 Observing the output persistence of the Prefect flow in the command line
We can observe the program executed successfully. When you examine the
results in the Prefect server, you may be able to see the results persisted,
whereas the flow didn’t. Here is how that may look:
Figure 13-9 Results persist in storage
One can also manually toggle the result persistence. Toggling persistence
will override any action that Prefect may arrive on its own. Here is an
example of that:
@task(persist_result=False)
def my_task(a: int, b: int) -> int:
c = a * b
return c
@flow(persist_result=True)
def my_flow() -> int:
result = my_task(5,6)
return result
my_flow()
In the preceding example, the persist_result has been set to False for the
task. This means the task will never persist. Any feature that depends on
persistence may not work and result in error. However, for the flow, the
persist_result has been set to True. The flow will persist the result even if it
is not necessary for a feature.
Let us run this code:
The code ran without errors. Recall that we have asked the task not to
persist the result and the flow to persist the result. Let us look at the flow
run output in the Prefect server:
Figure 13-11 Flow run output in the Prefect server
Here, we see that while the task did not persist the result, the flow stored the
result in a local storage.
Upon further examination, here is the serialized output of that file:
{
"serializer": {
"type": "pickle",
"picklelib": "cloudpickle",
"picklelib_version": "2.2.1"
},
"data": "gAVLHi4=\n",
"prefect_version": "2.14.21"
}
You can choose to specify a specific location where you would like the
results to be stored.
Artifacts in Prefect
Artifacts, in Prefect, refer to persisted outputs like links, reports, tables, or
files. They are stored on Prefect Cloud or Prefect Server instances and are
rendered in the user interface. Artifacts enable easy tracking and monitoring
of objects that flows produce and update over time. Artifacts that are
published may be associated with the task run, task name, or flow run.
Artifacts provide the ability to display tables, markdown reports, and links
to various external data.
Artifacts can help manage and share information with the team or with
the organization, providing helpful insights and context. Some of the
common use cases of artifacts are:
Documentation: Documentation is the most common use of Prefect
artifacts. One can publish various reports and documentation about specific
processes, which may include sample data to share information with your
team and help keep track of one’s work. One can also track artifacts over
time, to help see the progress and updates on the data. To enable that, the
artifacts must share the same key.
Data quality reports: Artifacts can help publish data quality checks
during a task. In the case of training a large machine learning model, one
may leverage using artifacts to publish performance graphs. By specifying
the same key for multiple artifacts, one can also track something very
specific, like irregularities on the data pipeline.
Debugging: Using artifacts, one can publish various information about
when and where the results were written. One can also add external or
internal links to storage locations.
Link Artifacts
Artifacts can be created by calling the create_link_artifact() method. Each
create_link_artifact() method produces a distinct artifact and has the
following parameters.
Here is a simple example of an artifact containing a link:
@flow
def my_flow():
create_link_artifact(
key="artifcatkey1",
link="https://www.google.com/",
link_text="Google",
description="Here is a search engine"
)
if __name__ == "__main__":
my_flow()
When we run this code, it will generate an artifact for us. Let us try to
locate the same. In your Prefect server, click Flow Runs and navigate to the
Artifacts column:
When you click the artifact, here is how that may render the link artifact:
Figure 13-13 Rendering the link artifact
Markdown Artifacts
Markdown is a lightweight markup language that can be used to add
formatting elements to plain text documents. The markdown syntax needs
to be added to various texts to indicate which words or phrases look
different and how. To create a markdown artifact, the
create_markdown_artifact() function needs to be used:
@task(name="Manufacturing Report")
def markdown_task():
markdown_report = f"""
# Manufacturing Report
## Summary
| Items | Quantity |
|:--------------|-------:|
| Portable PC | 60000 |
| Tablet | 70000 |
| Phone | 80000 |
## Conclusion
"""
create_markdown_artifact(
key="manufacturing-report-1a",
markdown=markdown_report,
description="Quarterly Manufacturing
Report",
)
@flow()
def my_flow():
markdown_task()
if __name__ == "__main__":
my_flow()
Upon running the Python code, we may be able to access the markdown
artifact in the same location. Here is how the artifact looks like:
Figure 13-14 Output of the markdown artifact
Table Artifacts
One can create a table artifact by using the function create_table_artifact().
Here is an example:
@flow
def my_fn():
highest_churn_possibility = [
{'customer_id':'12345', 'name': 'John Doe',
'churn_probability': 0.85 },
{'customer_id':'56789', 'name': 'Jane Doe',
'churn_probability': 0.65 }
]
create_table_artifact(
key="table-artifact-001",
table=highest_churn_possibility,
description= "## please reach out to these
customers today!"
)
if __name__ == "__main__":
my_fn()
Upon running this program, we may be able to locate the artifact at the
similar location, within the Prefect server.
Here is how that may look:
Furthermore, one can list all the artifacts in the command line.
By entering “prefect artifact ls”, we can list all artifacts like
the following:
Figure 13-16 Listing all artifacts in the terminal
In addition, you can inspect the contents of these artifacts using the artifact
key in the terminal.
The command is "prefect artifact inspect <artifact-
key>".
States in Prefect
A state basically means the status or condition of an object at a given point
in time in the system. In Prefect, states contain information about the status
of a particular task run or a flow run. A reference to the state of a flow or a
task means the state of a flow run or task run. For instance, if a flow is
meant to be successful or completed or running, then it directly refers to the
state of the flow run. Prefect states describe where the program currently is,
during its lifecycle of execution. There are terminal and non-terminal states
in Prefect.
Here are the states and their descriptions:
Table 13-1 Table of various states and their descriptions
In Prefect, tasks of flow objects can return data objects, Prefect state
objects, or Prefect futures, a combination of both data and state objects. Let
us look at an illustration for data objects:
@task
def add_one(x):
return x + 1
@flow
def my_flow():
result = add_one(1) # return int
A state object is a Prefect object that indicates the status of a given flow
run or task run. You can return a Prefect state by setting the
“return_state” parameter to true. Here is how that may look:
@task
def square(x):
return x * x
@flow
def my_flow():
state = square(5, return_state=True)
result = state.result()
return result
my_flow()
A Prefect future is a Prefect object that contains both data and state. You
can return a Prefect future by adding “.submit()” to the function:
@task
def square(x):
return x * x
@flow
def my_flow():
state = square.submit(5)
data = state.result()
pstate = state.wait()
return data, pstate
print(my_flow())
@task(
on_completion=[success_hook,
conclusive_hook],
on_failure=[failure_hook, conclusive_hook]
)
def my_task():
print("this task is successful")
@flow(
on_completion=[flow_completion_hook],
on_cancellation=[flow_cancellation_hook,
conclusive_hook],
on_crashed=[flow_crashed_hook]
)
def my_flow():
my_task()
my_flow()
You can see from the preceding code that, for a given state transition,
we have more than a state change hook specified as a list. Also, you can
specify the same state hook to multiple tasks and flows for the same state
transition. This way, when you have several tasks and flows, you can easily
control the workflow, with respect to state changes in tasks and flows.
Blocks
Blocks are one of the important aspects of Prefect. Blocks help store
configuration and authentication information and provide an interface to
interact with other systems.
With Prefect blocks, one can securely store authentication, pull values
from environment variables, and configuration information for various
source systems, downstream systems, communication channels, and other
systems. Using blocks, one can query data from a database, upload or
download data from a cloud data store like AWS S3, send a message to
Microsoft Teams channels, and access source code management tools like
GitHub. Blocks are available both in Prefect Server and Prefect Cloud.
It is important to note that blocks are used for configurations that do not
change or modify during runtime. What is defined in blocks may not be
altered during the runtime of flows. Blocks can be created and existing
blocks can be modified using Prefect UI.
Here is how it appears on the Prefect UI:
Figure 13-18 Blocks page in the Prefect server
Prefect provides blocks for so many cloud data stores and other external
systems. It is also possible to create a custom block in Prefect. Prefect’s
custom blocks are built off of BaseModel class from Pydantic library. While
all blocks are encrypted before being stored, one can also obfuscate fields
using Pydantic’s SecretStr field type.
Let us create a new block. In here, click Add block; you will see a list of
various blocks. Here is an example of creating a custom AWS S3 block. S3
is an object storage cloud service, offered by Amazon Web Services. To
create a new S3 block, simply type in “s3” in the search bar and you will
get the S3 block. Enter the required details in the block and save it. Here is
how that may look:
One can also register the custom blocks, so they can be used in the Prefect
UI, using the command
Prefect Variables
In Prefect, variables are named and mutable string values. Variables can be
created or modified at any time and they may be cached for faster retrieval.
Variables can be used to store information of nonsensitive nature, such as
configuration information. It is important to note that Prefect variables are
not encrypted. To store sensitive information like credentials, one would use
blocks, as opposed to variables.
One can create, read, write, and delete variables via the Prefect user
interface, Prefect API, and command line interface. To access Prefect
variables using the command line interface, here’s what you need to do:
To list all the Prefect variables
prefect variable ls
VariableFromPrefect =
variables.get('prefect_var_1')
print(VariableFromPrefect)
pull:
- prefect.deployments.steps.git_clone:
repository:
https://github.com/PrefectHQ/hello-projects.git
branch: "{{
prefect.variables.deployment_branch }}"
Task Runners
In Prefect, tasks can be run in different ways. They are concurrent, parallel,
and distributed ways of executing a task. Task runners are not required to be
specified for executing tasks. Tasks simply execute as a Python function
and produce appropriate returns for that function. By using default settings,
when a task is called within a flow, Prefect executes the function
sequentially.
A sequential execution is a method where functions or tasks are
executed one after the other, only after the previous task or function is
completed. This is the simplest model of task execution and it performs
tasks one at a time, in a specific order. Prefect provides a built-in task
runner, called the SequentialTaskRunner.
This SequentialTaskRunner works both with synchronous and
asynchronous functions in Python. Here is an example:
@task
def task1():
print("I am task1")
@task
def task2():
task1.fn()
print("I am task2")
if __name__ == "__main__":
flow1()
@task
def stop_at_floor(floor):
print(f"elevator moving to floor {floor}")
time.sleep(floor)
print(f"elevator stops on floor {floor}")
@flow(task_runner=ConcurrentTaskRunner())
def elevator():
for floor in range(10, 0, -1):
stop_at_floor.submit(floor)
if __name__ == "__main__":
elevator()
import dask
from prefect import flow, task
from prefect_dask.task_runners import
DaskTaskRunner
@task
def show(x):
print(x)
def my_flow():
with dask.annotate(resources={'GPU': 1}):
future = show(0) # this task requires 1
GPU resource on a worker
if __name__ == "__main__":
my_flow()
Conclusion
In this chapter, we looked at Prefect, another open source workflow
management and orchestration tool. Prefect offers a flexible and convenient
approach to workflow orchestration that addresses many of the challenges
of an engineering team. Prefect is Python native, and the concepts of flows
and tasks and the idea of simply using a decorator to create a workflow are
convenient. So far, we looked at the components of the workflow
orchestrator, followed by setup and installation. We gained an appreciation
for flows and flow instances, tasks, and task instances. We further looked
into results, persisting results, and various artifacts. We also discussed
states, state change hooks, and variables. We looked at blocks for secrets
management; there is also integration available with HashiCorp Vault, for
secrets management. It may be worth noting that the choice of a workflow
orchestration solution may depend on several factors (current infrastructure,
team’s expertise, requirements of the project, etc.). Overall, you have
gained higher-level skills by learning Airflow and Prefect, enabling you to
build scalable and more efficient data pipelines.
© The Author(s), under exclusive license to APress Media, LLC, part of
Springer Nature 2024
P. K. NarayananData Engineering for Machine Learning
Pipelineshttps://doi.org/10.1007/979-8-8688-0602-5_14
Introduction
The landscape of technology in general and data processing have
transformed significantly in the past decade. In the past decade alone, the
evolution of cloud computing revolutionized the concept of building data
pipelines in the cloud. This chapter will serve as an introduction or
preamble to engineering data pipelines using major cloud technologies. We
will focus on the early adopters of cloud computing, namely, Amazon,
Google, and Microsoft, and their cloud computing stack. In this chapter, we
will discuss how cloud computing is packaged and delivered, along with
some technologies and their underlying principles. Although many of these
are now automated at present, it is essential to have an understanding of
these concepts.
IP address
DNS
DNS stands for Domain Name System. It can be seen as an address book
but only for the Internet. Instead of remembering the complex IP address
digits and characters, you can see this as a string of words, an easily
remembered domain name that can be used to access the website. DNS is a
database where the IP addresses are mapped with the domain names.
Ports
Ports are logical entities that serve as gates to the servers. In order to enter a
server, not only you need to have the appropriate IP address but also the
correct port. Port numbers range from 0 to 65535. Imagine this as an airport
with 65,535 gates. In order to board your flight, you have to reach the
correct airport and access the correct gate to get to your flight. For instance,
port 1521 is used by Oracle database, port 22 is used by SSH, etc.
Firewalls
A firewall is like a gated system where only certain IP addresses are given
access. Think of it more like an invite-only launch event that allows
members who are in the white list.
Virtualization
The most widely accepted system in the big data space is Hadoop. Hadoop
is an ecosystem of tools and technologies that supports processing and
delivering insights using big data. After its first release in the early 2000s,
Hadoop was open-sourced and contributed by people all over the world.
Hadoop has its roots in distributed computing. Hadoop enables the creation
of a distributed computing platform that runs on a large number of
consumer-grade machines. The Hadoop ecosystem consists of a cluster that
has a name node and one or more worker nodes, wherein the name node
would be responsible for assigning and coordinating tasks and the worker
nodes would carry out the tasks that are assigned to them.
The Hadoop ecosystem has its own file system called Hadoop Distributed
File System (HDFS). HDFS supports the MapReduce paradigm, which
enables parallel computation on big data. Hadoop enables writing quick
and easy MapReduce jobs using a scripting language called Pig Latin. Pig
Latin is helpful in writing complex data transformations and useful in
exploratory data analysis with big datasets. Furthermore, Hadoop also
supports Hive, a data warehousing solution on top of Hadoop’s file system.
The Hive tool provides a SQL-like interface for querying data stored in
Hadoop’s file system. Hive translates this SQL-like syntax into MapReduce
jobs and enables distributed computation on the data.
Spark
The biggest leap in the evolution of big data systems happened with
Apache Spark. Apache Spark is an analytics engine that supports big data
processing and possesses the ability to run in-memory computations. The
idea of performing in-memory computations enabled Apache Spark to be
100 times faster than MapReduce processing on disk. Furthermore, Spark
supports multiple programming languages like Java, Scala, Python, and R
offering exhaustive analytics capabilities that include SQL processing, real-
time stream processing, machine learning pipeline development, and graph
processing as well.
Spark played a significant role in the shaping and evolution of modern data
platforms like Delta Lake architecture, data lake house, etc. Delta Lake is
an open source storage layer that runs on top of data lakes and is basically
built on Spark; Delta Lake provides various capabilities like ACID
transactions on data lakes and integrating streaming and batch data
processing.
The growth of big data, leading to greater need for big data systems and
their closely associated tools and technologies, certainly contributed to
increased adoption of cloud computing technologies. As organizations
realized gathering analytics from vast amounts of data to gain competitive
intelligence, among other insights, it was a relatively easier choice to
migrate to the cloud. Cloud computing systems, regardless of vendors, offer
compute, memory, and storage at scale. It is much quicker to provision and
set up a data processing cluster; even better, cloud computing vendors offer
to charge only for the time you consume their resources. This is called
serverless computing and you will see more of this in later sections and
chapters.
One of the biggest and clear advantages is the pace or speed it takes to set
up a basic infrastructure. In the traditional model, computers need to be
procured and set up along with storage, memory and networking that need
to be configured, and to house that physical infrastructure in a suitable
office space is required. All of this may take longer than anticipated. With
cloud computing, it would take not more than an hour to provision various
computes, storage, memory, and other components related to the IT
infrastructure. This provides a significant advantage of time to market the
product for an organization.
Second, you can go global with cloud computing. The fact that you can
easily scale your app and sometimes the cloud computing vendor
dynamically adds computing resources to meet the demand enables you to
access new markets. As mentioned earlier, you no longer have to invest in
dedicated physical servers and take several weeks to set up and get up and
running.
Public Cloud
Public cloud is the most common method of deploying cloud computing for
a customer. It is also considered as the most economical option. In the
public cloud, the services are delivered over the Internet connection. And
so, the customer just needs to have a secured workstation and a strong
enough Internet bandwidth to be able to access services rendered by cloud
computing vendors, seamlessly. It is also the easiest to set up and may take
not more than a few minutes to create an account and begin to utilize the
cloud services.
Private Cloud
Private cloud is where an organizational customer has the exclusive rights
and control of the IT infrastructure. This is similar to running a traditional
IT infrastructure. You can have a third-party cloud vendor provide and
manage the exclusive IT infrastructure, or it can be the case that the
organization has deployed their own exclusive cloud that is managed by
them. You can still avail of the more recent technology, automations,
virtualization, etc. and reap the benefits of cloud computing. However, the
cost may be higher when compared with public cloud offerings.
Hybrid Cloud
As the name suggests, it is a mix of both private and public cloud concepts.
In hybrid cloud, some resources are hosted on-premises, while some are
hosted in the public cloud. This can be seen more often where organizations
stay on-premise with certain applications and move to the cloud for hosting
certain other applications. And so, organizations can keep their data within
their own premise while procuring and utilizing certain services in the
cloud.
Community Cloud
Government Cloud
Government cloud is made for government bodies and entities; they are a
sub-variant of community cloud except that they are made to cater specific
compliance and regulatory requirements. This would involve cloud
computing vendors providing extra layers of security and policies that
would comply towards specific requirements that government entities may
have.
Multi-cloud
Let us review some of the concepts that may directly or indirectly touch on
designing data pipelines of engineering and machine learning projects. You
may have seen or read the seven essential “-ilities” or the ten “-ilities” of
software architecture. In this section, we will look at such similar concepts
to keep in mind, when designing the data architecture for a data
engineering or machine learning project. The volume, veracity, and variety
of data have gradually grown over the past several years, and so that is the
motivation behind discussing the following concepts.
Scalability
Every data pipeline is meant to process and/or transport data. The amount
of computational resources consumed is directly proportional to the volume
of data that is processed. As the data pipeline processes more data, the
entire pipeline gets overworked. Scalability is the process of handling the
stress caused by increasing in usage while ensuring your data pipeline
effectively performs its functions smoothly and consistently. You have to
have a rough idea about the maximum workload your pipeline can process
and whether you may encounter any memory leaks or stack overflow.
Scalability can be achieved by two approaches, namely, horizontal scaling
and vertical scaling. Horizontal scaling is adding more compute nodes to a
cluster that can handle an increase in the processing traffic. Vertical scaling
is increasing the capacity of a given compute node to meet the traffic.
Elasticity
High Availability
Fault Tolerance
Disaster Recovery
Caching
There are three major cloud computing vendors currently in the industry,
Amazon, Google, and Microsoft. Amazon put cloud computing in light,
back in the year 2006, when it started offering Amazon Web Services.
Google launched their cloud computing platform called Google Cloud
Platform. In 2010, Microsoft launched their cloud computing platform
called Azure. In addition to these three major cloud vendors, there are also
several other companies offering cloud computing services. Some of them
are Oracle, DigitalOcean, IBM, and a few regional providers, catering to
specific nations.
Figure 14-1
Platform as a Service
Platform as a Service is where the cloud computing vendor provides you all
the infrastructure and the environment to develop your application or work
with the application. Imagine you were able to rent out a multi-core
processor with memory and storage configured, an operating system
installed with all the updates and patches done, and the software
development installed and managed so you can jump right into developing
your code or deploying your code. Platform as a Service is very helpful to
developers who are looking to develop something from the ground up
without having to worry about administrative efforts.
Software as a Service
Software as a Service is something you see all over you every day. You
check your free email account provided by Google by entering their
domain address and your user id and password or your social media
provider like Instagram or Facebook; you are using it for free of charge.
Basically, these companies provide you with everything from hardware,
operating system, memory, storage, and software application (Gmail,
Dropbox, etc.). As an end user you would access their service, consume it
to your heart’s content, and, perhaps, return to it another time.
Compute
Object Storage
Object storage is a specialized type of file system storage that can be used
to store any kind of files. The object storage manages data as objects, as
opposed to file systems where data is managed as files. Each data is an
object and comes with metadata and a unique identifier that is associated
with the single data. Object storage is used in a variety of applications
including cold storage of data (where data is infrequently accessed) and
archives of data.
Examples of object storage are Red Hat Ceph, MinIO, Cloudflare R2,
Azure Blob Storage (if you are in the Microsoft ecosystem), etc.
Databases
NoSQL
In addition to relational databases, cloud computing vendors also provide
NoSQL database solutions. NoSQL stands for Not Only SQL, meaning that
these are database management systems that are designed to handle a
variety of data models and data structures. The idea here is NoSQL
database systems complement and enable efficient storage of data that may
not fit into the cookie-cutter relational data models. However, NoSQL can
never replace the traditional relational database systems, the data integrity
they impose using various normalization forms, and the effectiveness
relational databases deliver in transactional systems. NoSQL systems allow
data to be inserted without having to predefine a schema.
Note
Document Databases
Document databases are data systems that are designed for managing
semistructured data. A good example is JSON data, where data is captured
in JSON documents. These document databases have flexible data models
and they are not defined prior to loading. Even though the JSON
documents may have complex structure and nested structures within each
document, they will always be consistent—as long as they come from a
single source. And so, the document databases will accommodate the
incoming structure of these documents. A famous example of a document
database is MongoDB. It is a fast, lightweight, simple-to-use database that
comes with a querying language that can be used to treat the JSON
documents as relational data and write queries on top of them.
Column-Oriented Databases
Key–value stores are types of NoSQL database systems where the data is
stored as a collection of key–value pairs (very similar to dictionary data
types in Python). The data model is very simple and straightforward, where
the keys are identifiers and can be used to retrieve the corresponding value.
Hence, these systems are incredibly fast for retrieval, as it does not involve
scanning a few tables and matching certain filters and conditions. Good
examples are Amazon DynamoDB, Redis, Azure Cosmos DB, etc.
Graph Databases
Figure 14-2
Vector Databases
Vector databases are newer in the NoSQL database systems, where they are
designed to store, index, and fast search vector data. Given a query vector,
the vector database would perform a fast search to retrieve the vector that
has the most similarity with the given vector. This is also known as
similarity search, which is very effective in image recognition and other
artificial intelligence environments. Here, vectors are simply an array of
numbers; depending upon the context, these arrays may carry certain
meaning. Once a vector is given, one can specify the type of measure that is
relied on to obtain the nearest neighbor vector that resembles the same. The
similarity measure can be Euclidean, Tanimoto, or Manhattan distance. All
these vectors are indexed for faster similarity search and retrieval.
Data Warehouses
Serverless Functions
Serverless means that you have the option of consuming service and pay
only for the resources that have been utilized. Serverless in a major cloud
means that one can run functions that interact with one or more components
and set up tasks to run without having to provision the underlying computer
instance. Amazon offers a serverless compute service called AWS Lambda,
Google Cloud Platform has Cloud Functions, and Microsoft has Azure
Functions—all three offer serverless compute services.
Containerization
Data Governance
Data Catalog
Compliance is the process of adhering to the rules and regulations that have
been set by an appropriate body. Data protection is the process of
identifying and protecting personally identifiable information that an
organization may currently have. For instance, in the healthcare industry,
there is HIPAA compliance that ensures that the personally identifiable
information is protected from fraud and theft. Amazon offers AWS Artifact
that generates audit and compliance reports that are helpful with various
regulatory agencies and policies. Microsoft offers Azure Policy, which
offers a drill-down report to assess compliance and enforce standards.
Google offers Security Command Center that helps monitor compliance
and other security standards.
Machine Learning
All the major cloud vendors offer a machine learning platform as a service
where you can build machine learning models, train these models, and be
able to connect with other cloud services for deployment or reporting the
results. Amazon offers AWS SageMaker, Microsoft offers Azure ML, and
Google offers Vertex AI. These services offer a complete lifecycle of
machine learning models and have the option to either program your own
custom machine learning model or use low code to build a machine
learning model.
Conclusion
Cloud computing has revolutionized the technology landscape. The amount
of flexibility and cost-effective procurement of services has enabled
engineering teams to deliver more value. We have seen how the rise in big
data paved the way for various evolutions of tools and technologies. You
can now provision a Hadoop cluster in a relatively shorter amount of time,
compared with a decade ago. However, it is important to know how we
evolved so we appreciate the magnitude of sophistication in the technology
and data systems we have. Though there may be reasons an organization
may or may not choose to go cloud, it remains evident that cloud
computing vendors offer ease of use and latest technologies, tools, and
processes to gain insights from data.
© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2024
P. K. Narayanan, Data Engineering for Machine Learning Pipelines
https://doi.org/10.1007/979-8-8688-0602-5_15
Introduction
Cloud computing is a concept of obtaining the necessary IT infrastructure for
development and deployment of data and software services. By providing compute
power, memory, storage, and various software installed on top of physical machines,
an organization can get up to speed faster than before. This way an organization
eliminates the capital expenses for buying physical equipment as cloud computing
providers offer pay-as-you-go pricing models for their services.
In this chapter we will look at Amazon Web Services in detail. Amazon is one of
the early adopters of cloud computing and one of the first few companies to provide
IT infrastructure services to businesses, as early as 2006. Cloud computing was
referred to as web services. Today, Amazon offers more than 200 products in cloud
computing including but not limited to compute, storage, analytics, developer tools,
etc.
Let us look at the AWS service offerings in the context of developing data and
machine learning pipelines in detail. AWS classifies the countries where it is hosting
its data centers into three regions, also known as partitions. They are standard regions,
China regions, and US GovCloud regions. AWS suggests choosing a region that is
geographically close to the physical presence to minimize latency.
As you embark on your journey exploring Amazon Web Services, please may I
request you to be mindful of managing the cloud resources, especially when you are
done using them. In the case of traditional infrastructure, you may switch off a
computer when you are done. However, in the case of cloud infrastructure, where you
often procure a plan that is based on a pay-as-you-go model, you will be billed for
your resources even if they are not being used. Signing off AWS does not mean
powering off the services you have procured. This is particularly important for virtual
machines, databases, and any other compute-intensive resources. Please, always,
make it a habit to review your active resources and shut down these specific
resources. If you are no longer planning on using a service, you may wish to delete
the same. And so, you are only paying for the services you actively use and keep your
costs under control. I also would like to thank AWS for providing free credits for new
users and having actively supported new learners in bringing their cost down.
By the end of the chapter, you will learn
– Fundamentals of Amazon Web Services and the AWS management console
– How to set up an AWS account and configure various parameters
– AWS’s object storage, database, and warehouse solutions
– Data processing and analytics tools
– How to build, train, and deploy machine learning models using AWS SageMaker
It may be possible to provision and access services directly from your root account,
but I recommend you create an admin account, to provision and access AWS services.
To accomplish that, let us utilize AWS Organizations service. AWS Organizations is
an account management service that enables one to manage multiple accounts. The
process is that, within your organization (the root account), you would set up
organizational units. Organizational units would basically be various departments
within an organization. This is followed by creating and assigning service control
policies to various accounts in various organizational units. For our illustration
purposes, we will create two organizational units, namely, “dev” and “prod.” We will
invite the team members for our organizational units.
Note To simulate team members, creating temporary email addresses and adding
them to your organizational units is an option.
Once you have completed the registration and have signed in, you would get the AWS
management console page. Here is how that looks:
Figure 15-2 Home page of the AWS console
Navigate your cursor to the search section, and enter “AWS Organizations” and click
“Create a new organization.” You will get the following screen:
Now that we have service control policies enabled, let us create a new policy for our
organization by clicking “Create policy”:
You will have the option of selecting one of many AWS services and choosing the
appropriate privileges to the policy, along with resources as well. Once you have
chosen the appropriate policy, choose “Save changes.” AWS will automatically create
a JSON file based on what we have chosen here. Here is how policy creation would
look:
Figure 15-7 Configuration of a new service control policy
Once these policies are created, it is time to attach a policy with an account. To
accomplish that, go to the main policies page, choose the appropriate policy, and click
“Actions” followed by “Attach Policy,” attaching the policy to the appropriate
organizational unit.
Here is how that looks:
Figure 15-8 Attaching the new service control policy to an AWS organizational unit
Once the respective policies have been attached, you can sign in to the user’s account
to see if the policies have been enforced. Now visit the management console and look
for “IAM Identity Center.” Please ensure that you have logged in with the account
used to create the AWS organization. AWS IAM Identity Center provides a single
sign-on (SSO) service within AWS. You need to enable AWS IAM Identity Center.
Here is how the page looks:
SSOs are considered secure, can be auditable as they provide log trails, and do help
organizations meet regulatory compliance requirements. Famous names include
SAML, OAuth, Okta, etc.
Once you have enabled the identity center, you need to create permission sets.
Permission sets are templates for creating policies of identity and access management.
Permission sets enable granting access to various services for groups and users within
the organization. Here is how that looks:
Now we will create a new permission set. Let us try to work with a permission set
that is predefined by Amazon. For our illustration, we have chosen the
“DatabaseAdministrator” permission set.
Here is how that looks:
Figure 15-11 Choosing the predefined permission set
This will enable you to enter further details about the permission set. Once this step is
successful, you will obtain the following screen:
Figure 15-12 Creation of a new permission set
Now we will create a user. The idea is that we will create users and assign various
users with various permission sets. You can also assign users to groups and assign
permission sets to the groups as well. It is an optional step though. Here is how that
looks:
Figure 15-13 Creation of a new user within IAM Identity Center
Now, we need to connect the users, permission sets, and organizations; revisit AWS
Organizations, select the appropriate organization, and select the user and the
permission sets to the respective user. Here is how AWS Organizations looks like:
Figure 15-14 Connecting new users, permission sets, and organizations
Click Assign users or groups, and choose the user we created. Upon clicking Next,
choose the permission sets for the respective user for the respective organization.
Click Submit and wait a few seconds for AWS to create the permission sets for the
user for the organization. Here is how that final screen looks:
Once you have completed the steps, click “Dashboard” in AWS IAM Identity Center
and obtain the AWS access portal for that respective user. Here is how that looks:
Figure 15-16 Obtaining the access portal for the respective user
You can use the AWS access portal URL to log in. Please use the user credentials we
have set up using AWS IAM Identity Center and not the root credentials. It may also
be beneficial to bookmark the link.
aws configure
aws s3 ls
aws s3 ls s3://<name-of-your-bucket>
AWS S3
AWS S3 stands for simple secure storage, which is an object storage service. S3
allows for retention of unstructured data. S3 is considered to be optimized for use
cases where you read many times and write once. S3 stores data in the form of
objects, as compared with file systems where data is stored as files. It is a great option
to store photos, videos, and similar content. AWS provides a programmatic interface
to allow apps to perform create, update, read, and delete operations on the data stored
on the S3 object storage.
Amazon offers a variety of storage types within their S3 service. Amazon S3
Standard is for data that has a high volume of reads and is an ideal option for
applications that seek high throughput and demand low latency. Amazon S3 IA
(Infrequent Access) is for data that is not accessed and not read as much; ideal
applications would include business recovery and disaster planning, long-term
backups, etc. Amazon S3 Glacier is for data that may never be accessed but requires
to be held; applications are end-of-lifecycle data, backup that is kept for regulatory
and compliance purposes, etc.
Note Let us learn a couple of terms. In the context of cloud object storage, we
have two terms that we use in context to access patterns and storage requirements.
These terms are “hot data” and “cold data,” respectively.
Hot data is frequently accessed or used by one or more users at the same time.
The data needs to be readily available and is in constant demand.
Cold data is accessed rarely; data includes periodic backups and historical
data. It may be wise to provide a cost-effective storage rather than performance
optimized (as compared with hot data).
AWS supports a global bucket system. To create a global bucket, it must be named
uniquely within a given partition. There is also an option to create a public bucket that
is accessible to everyone. Here is how to create an AWS S3 bucket:
Click Create bucket and you will be presented with a bucket creation page. While it is
best to leave the options preselected as is, here are some options you wish to consider
choosing that work best for you:
– AWS region: Choose a region that is geographically close to your location.
– Bucket type: If you are new to AWS, it may be better to choose general-purpose
buckets.
– Bucket name: In addition to following AWS guidelines for naming buckets, follow
a naming convention that includes your organization and department codes, project
type, and whether is it a development sandbox or a production bucket; it would
help you greatly when you have multiple buckets to sift through.
Here is how that looks:
Figure 15-18 Creation of a new bucket
Uploading Files
Here is an example of uploading objects (files) in a given bucket using the AWS
command line interface:
import boto3
import configparser
config = configparser.ConfigParser()
path =
'/Users/pk/Documents/de/aws/s3/.aws/credentials.cfg'
config.read(path)
aws_access_key_id = config.get('aws','access_key')
aws_secret_access_key = config.get('aws','access_secret')
aws_region = 'us-east-1'
session = boto3.Session(
aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key,
region_name=aws_region
)
# create a S3 client
s3_client = session.client('s3')
file_to_upload = '/Users/pk/Documents/abcd.csv'
bucket_name = 'bpro-dev-bucket-1'
try:
s3_client.upload_file(
file_to_upload,
bucket_name,
'uploaded_abcd.csv'
)
print(f"The file {file_to_upload} has been uploaded
successfully \
in {bucket_name} S3 bucket")
except Exception as e:
print(f"Exception {e} has occurred while uploading \
{file_to_upload}")
Don’t forget to place the .gitignore file in the root of the directory and commit the
file so others who are forking it may also benefit. Please ensure that ignoring certain
files according to .gitignore won’t affect the code in any way.
Amazon RDS
RDS stands for relational database service. As its name goes, Amazon RDS provides
relational databases with several options for database engines. Some of them include
but are not limited to MySQL, PostgreSQL, Oracle, MS SQL Server, etc. AWS RDS
is easier to set up within a few minutes and easier to operate and can scale up when
necessary. AWS will manage the common database administration tasks so you may
not have to worry about maintenance.
AWS RDS is a primary option for those who are looking to migrate their
databases to the cloud from the on-premise environment. By choosing the appropriate
database engine for their migration, one may be able to achieve faster speeds and
reliability during transfer of relational database systems.
Let us look at creating a new database instance. You may initiate the process by
searching for “RDS” in the AWS console, opening the appropriate page, and selecting
“Create a database.” You have the option of creating a new database either by
customizing individual options for availability, security, maintenance, etc. or choosing
the AWS-recommended configurations for your use case.
You can also specify the nature of this instance, whether you are in need of a
highly available production database instance, or a development sandbox with limited
performance might do it for you. Here is how this may look for you:
Figure 15-19 Creating a new database instance in AWS RDS
In addition, make sure to add an inbound network rule in the security configuration by
enabling all incoming IPs to talk through TCP/IP. You will find this option under the
VPC tab. Here is how that may look for you:
Figure 15-20 Enabling all incoming IPs to talk through TCP/IP (for learning and development purposes only)
Here is a Python application that connects to AWS RDS and uploads data to a table:
import csv
import psycopg2
import configparser
# CONFIG
config = configparser.ConfigParser()
cur = conn.cursor()
try:
with open(csv_file_path, 'r') as f:
reader = csv.reader(f)
next(reader) # Skip the header row
for row in reader:
cur.execute(
"INSERT INTO users (id, name, age)
VALUES (%s, %s, %s)",
row
)
conn.commit()
print("Data uploaded successfully")
except:
print("error loading data into AWS RDS")
finally:
cur.close()
conn.close()
Amazon Redshift
Redshift is a massively scalable cloud data warehouse. It allows you to load petabytes
of data, execute complex analytical queries to retrieve data using SQL, and connect
business intelligence applications to render reports and dashboards. You do not have
to pay when the data warehouse environment is in idle state as charges occur only
when you are actively interacting with Redshift by passing a query or loading data.
Amazon Redshift is based on open standard PostgreSQL; however, it has been
customized to support online analytical processing applications and other read-heavy
applications.
Amazon Redshift is a column-store-like data warehouse that is scalable and fully
managed by AWS. Storing data in a column store reduces the number of input–output
operations on Redshift and optimizes the performance of analytical queries.
Furthermore, you can also specify the data in Redshift to be compressed so that it
further reduces storage requirements and even optimizes the query. When you execute
a query against compressed data, the data is uncompressed and gets run.
Amazon Redshift instances are spun off in clusters. A cluster has a leader node
and compute nodes. The leader node manages communications, both with client
applications and the compute nodes. For an analytical query supplied by the user, the
leader node creates an execution plan and lists the execution steps necessary to return
the result. The leader node compiles the query and assigns a portion of the execution
to the compute node. The leader node would assign a query to a compute node that
has the underlying table mentioned in the query.
The compute nodes are themselves partitioned into slices. A portion of physical
memory and storage is allocated to each slice within a compute node. So, when the
leader node distributes the work to compute nodes, the work really goes to a given
slice within a given compute node. The compute node would then process the task in
parallel within the cluster. A cluster can consist of one or more databases where the
data is stored in compute nodes.
Let us try to set up a new Amazon Redshift cluster. To begin, we have to create a
workgroup. A workgroup in Redshift is a collection of compute resources that also
include VPC configuration, subnet groups, security groups, and an endpoint. Let us
visit the AWS management console and search for Redshift. Once we are on the main
page, let us choose to create a workgroup. You will be directed to a page where you
can specify the RPUs for the workgroup. RPUs are Redshift Processing Units, a
measure of available computing resources. You may also choose the VPC
configuration and subnet groups. Here is how that may look for you:
Figure 15-21 Creating workgroups in Amazon Redshift
Once you have specified the various parameters, you may wish to navigate to the
provisioned cluster dashboard to create a new cluster. When creating a new cluster,
you may specify the name of the cluster, the number of nodes you wish to have on the
cluster, and the type of compute, RAM, and storage you wish to obtain for your data
warehouse cluster.
Here is how that may look for you:
Figure 15-22 Creating a new cluster within Amazon Redshift
You may also specify the database credentials when creating the Redshift cluster. If
you wish to get started on querying the Redshift cluster using sample data, you may
click Load sample data. In addition, you can also choose to create a new role or
associate an existing IAM role with their cluster.
Figure 15-23 Configuring the database credentials for the Redshift cluster
Once you have chosen other options, then you are ready to create a new Redshift
cluster. After you click Create the new cluster, you may have to wait a few minutes
for the cluster to be created. Once you have the cluster ready and loaded with sample
data, you may wish to click the query editor on the cluster dashboard page. We have
the option of choosing query editor version 2, which is a much cleaner interface. Here
is how the query editor may look:
Figure 15-24 Redshift query editor 2 with a new interface
You can write queries, access various databases, and also explore AWS Glue Data
Catalog from your query editor. Furthermore, you can create views, stored
procedures, and functions within the query editor against the data in the Redshift
cluster. In the top-left corner, you have options to create a new database or table using
the graphical user interface or load data directly from one of your S3 buckets. Let us
look at how we may load the data from a S3 bucket to Redshift using Python. Here is
a sample code that picks up a flat file from a S3 bucket and loads it into a Redshift
database:
import boto3
import configparser
config = configparser.ConfigParser()
path = '/Users/<your-path-here>/credentials.cfg'
config.read(path)
aws_access_key_id = config.get('aws','access_key')
aws_secret_access_key = config.get('aws','access_secret')
aws_region = 'us-east-1'
s3_bucket_name = 'your-bucket'
s3_key = 'key'
table_to_be_copied = 'nba_shots.csv'
redshift_cluster_identifier =
config.get('redshift','cluster_id')
redshift_database = config.get('redshift','r_dbname')
redshift_user = config.get('redshift','r_username')
redshift_password = config.get('redshift','r_password')
redshift_copy_role_arn =
config.get('redshift','redshift_arn')
s3 = boto3.client(
's3',
aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key,
region_name=aws_region
)
redshift = boto3.client(
'redshift',
aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key,
region_name=aws_region
)
copy_query = f"""
COPY {table_to_be_copied}
FROM 's3://{s3_bucket_name}/{s3_key}'
IAM_ROLE '{redshift_copy_role_arn}'
FORMAT AS CSV;
"""
def execute_redshift_query(query):
try:
redshift_data = boto3.client(
'redshift-data',
aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key,
region_name=aws_region
)
response = redshift_data.execute_statement(
ClusterIdentifier=redshift_cluster_identifier,
Database=redshift_database,
DbUser=redshift_user,
Sql=query,
WithEvent=True
)
return response
except Exception as e:
print(f"Error executing Redshift query: {e}")
return None
response = execute_redshift_query(copy_query)
print(response)
Amazon Athena
Amazon Athena is an interactive querying service to analyze the data in the object
storage AWS S3 directly using standard SQL syntax. With Amazon Athena, you also
have the option to run data analytics using Apache Spark, directly on the S3 buckets.
There is no need to set up the underlying Spark infrastructure or set up a database
connection driver. Amazon will take care of all these for you. Amazon Athena is a
serverless service, and so you will be charged only for what you use, by the number
of requests.
As you have the ability to query the buckets in your object storage directly, you
can now analyze the unstructured, semistructured, and structured data within the
buckets. The data can be of any formats, including but not limited to CSV, JSON,
Parquet, and Apache ORC. All of these can be analyzed using Athena. By using
Amazon Athena, you can query various kinds of data within S3; therefore, you can
use S3 as your data lake, with Athena as a query layer for the data lake. The queries
will automatically be parallelized and executed, offering natively faster performance.
To get started with Amazon Athena, search for Athena in the AWS management
console and choose an option to query your S3 bucket data (either SQL or Spark).
Here is how that may look for you:
Figure 15-25 Home page of AWS Athena
You can also use Spark SQL on top of the objects in your S3 bucket, by interacting
with Jupyter Notebook. Here is how that may look for you:
Amazon Glue
Amazon Glue is a serverless data integration tool. You can use Glue to discover data
sources, integrate data, and prepare datasets for data analytics, machine learning, and
application development. Amazon Glue has Glue Studio, which lets you prepare your
data integration jobs visually, where you can drag and drop components into a
workspace. Once you have designed your jobs, you can run and monitor these jobs
directly from the Glue environment. You can trigger a job based on a scheduled time
or an event.
Let us create two folders, one for the source data and the other for running Glue-
based reports within the data lake bucket. Here is how that may look for you:
Figure 15-28 Creating folders within the object storage
Let us now configure the data lake. We will navigate to the AWS management
console and search for Lake Formation. We will create a database within AWS Lake
Formation and configure our data sources folder from the S3 bucket. Create a
database, assign a name, and choose the data source location for the database. Here is
how that may look for you:
Figure 15-29 Creating a new database in Lake Formation
Now let us configure AWS Glue, so we can automatically catalog the tables in the
data sources bucket. Before that, we need to create an IAM role within AWS that
AWS Glue will use to create a crawler. A crawler in AWS Glue is a program that can
scrape or obtain data from multiple sources. First, let us look for IAM within the
AWS management console, navigate to the service, and click Roles, to create a role.
Here is how that may look:
Figure 15-30 Creating a new role in IAM
For the trusted entity type, choose “AWS service,” and for the use case, click the
drop-down and select “Glue.” Here is how that may look:
Now, we will grant permission for this newly created role to create a catalog in our
Lake Formation. Navigate to AWS Lake Formation, choose our data lake, and grant
permission from Actions:
Here is how this may look for you:
Figure 15-33 Grant permission for the newly created role in Lake Formation
Select the newly created IAM role and select the data lake under the named data
catalog resources. Within the database permissions, provide the create table access
option as well. Let us now locate the AWS Glue service.
You can search for Glue within the AWS management console or click the crawler
option in AWS Lake Formation to directly navigate to AWS Glue service.
Figure 15-34 Choosing the IAM user within the permissions page
We are going to create a new crawler, where we will specify the “data sources” folder
from the data lake bucket as the input, assign the newly created “dl-glue” IAM role,
and choose the “data lake” bucket for the target database.
Here is how we can create a crawler:
Figure 15-35 Creating a new crawler with specified sources and IAM roles
Once you have created the crawler (may take a couple of minutes), you can see your
newly created crawler in AWS Glue.
Choose the crawler and click Run. You may have to wait a few more minutes for
the crawler to run.
During this process, the AWS Glue crawler would infer the structure of the input
dataset from the data sources folder and store the metadata in the data catalog.
Here is how that may look:
So far, we have successfully set up AWS Lake Formation, pointed the data sources at
the AWS S3 bucket, created the necessary service account using AWS IAM,
configured and ran the AWS Glue crawler, and populated the data catalog. We will
now use AWS Athena to query the data to extract insights. Visit the AWS
management console to navigate to AWS Athena service. Select the option of
querying the data with Trino SQL and click Launch the query editor. Here is how that
may look:
Let us now set up an output folder where we can store results of the query. At the top,
click the Settings option and then the Manage settings option. Click the Browse S3
option, where you navigate to the data lake S3 bucket and select the Output datasets
option. Save the settings.
Now, we are ready to write queries into AWS Athena and extract the insights from
the dataset. Let us write a simple query:
select
distinct year, season, city
from
events_csv
where
year < 2020
order by year desc
As we mentioned earlier, the results of this query are already stored in the S3 bucket
under the folder we specified earlier. Let us navigate to that folder to look at the query
results. You will see a complex hierarchy of folder structures created with date,
month, and year. Locate the CSV file type and click “Query with S3 Select.” You will
have a simple query already populated, and so just choose to run the query to obtain
the results.
Here is how that may look:
AWS SageMaker
Amazon SageMaker is a fully managed machine learning service by AWS. It enables
you to build, train, test, and deploy machine learning models at scale. You have the
option of developing a no-code or low-code machine learning model as well. AWS
SageMaker supports industry-leading machine learning toolkits like sk-learn,
TensorFlow, PyTorch, Hugging Face, etc.
AWS SageMaker comes in a few variants. AWS SageMaker Studio comes with a
fully functioning ML integrated development environment. It comes with a Jupyter
notebook, which gives you control over every step of the model being built. AWS
SageMaker Studio Lab is a user-friendly, simplified version of AWS SageMaker
Studio, but it comes with limited hardware support. You would get 16 gigabytes of
physical memory and 15 gigabytes of storage, and it cannot be extended, compared
with SageMaker Studio where you can specify larger compute and greater memory.
There is also SageMaker Canvas, which is a low-code environment for developing
predictive models.
To setup AWS SageMaker Studio, get to the service page, and choose the option
“Set up SageMaker Domain.” Here is how that may look:
Once the SageMaker Domain is created for you, you can proceed to launch
SageMaker Studio from the Domain. Here is how that may look:
Let us try to build and deploy a machine learning model using Amazon SageMaker.
We will build a simple machine learning model, host it on an instance, create an API
that can be used to interact with the model, and create a lambda function that would
access the endpoint.
To get started, let us open JupyterLab and choose the new JupyterLab space. You
will provide a name for the JupyterLab Instance and proceed to choose the underlying
hardware. My suggestion is to use the instance “m1.t3.medium” as you get 250 free
hours of compute when you start using SageMaker initially.
The Image option is basically a Docker image appropriately configured for
machine learning tasks.
Here is how the settings may look, for instance:
Figure 15-44 Creating and configuring a new Jupyter space on AWS SageMaker
For our simple illustration, we will choose the titanic dataset, where the outcome is
the likelihood of a passenger to survive, in case of a ship colliding with the iceberg.
For our illustration purposes, we are choosing a simpler dataset and have removed a
few columns as well. We are importing necessary libraries, preprocessing the dataset,
running the SageMaker’s linear learner model, and creating an endpoint that can be
hosted on another instance. We are using Amazon SageMaker’s linear methods to
obtain the model.
Here is the code:
import sagemaker
from sagemaker.sklearn.model import SKLearnModel
from sagemaker import get_execution_role
import numpy as np
import pandas as pd
df = pd.read_csv("titanic.csv")
df = df.dropna(
subset=["Age"]
)
df = df.drop(
['Name','Sex','Ticket','Cabin','Embarked'],
axis=1
)
df.to_csv("titanic_subset.csv")
rawdata = np.genfromtxt(
"titanic_subset.csv",
delimiter=',',
skip_header=1
)
linear = sagemaker.LinearLearner(
role = get_execution_role(),
instance_count = 1,
instance_type = 'ml.m5.large',
predictor_type='regressor',
sagemaker_session=sagemaker.Session()
)
train_data_records = linear.record_set(
Xtr.astype(np.float32),
labels=Ytr.astype(np.float32),
channel='train'
)
linear.fit(train_data_records)
predictor = linear.deploy(
initial_instance_count=1,
instance_type='ml.t3.medium'
)
Let us run this code on a JupyterLab notebook and make a note of the endpoint
that SageMaker generates. We will host this endpoint on another SageMaker instance.
We will then have a lambda function that would call the hosted API. Lambda
functions usually have a certain IAM role associated to work with other services.
Let us create a new IAM role that can be used to interact with machine learning
models. Let us find IAM in the AWS management console, locate the Identity and
Access Management option on the left, and click Roles; create a new role and assign
full access to Lambda, SageMaker, and CloudWatch services. Here is how it may
look for you:
Figure 15-45 Creating a new IAM role for ML
Let us now create a lambda function. Navigate to the AWS management console and
search for “Lambda” service. Choose to create a function, select the Author from
scratch option, choose Python as the runtime configuration, and leave the architecture
selection as is.
Here is how the Create function page in AWS Lambda looks like:
Ensure that you change the default execution role by using an existing role, and select
the IAM role you created from the drop-down option.
The AWS lambda function comes with starter code, where the lambda function
has two parameters, one that takes an event that triggers the function and another
containing runtime parameters.
Here is how the dashboard in AWS Lambda looks like:
We parse the JSON string to obtain the actual data that will be sent to SageMaker for
prediction. We then request the SageMaker endpoint and return the response. This is
again followed by JSON parsing of the response and returning the prediction score.
Here is the lambda function:
endpoint = os.environ['endpoint']
runtime = boto3.client('runtime.sagemaker')
In addition, you may wish to set the environment variable for this lambda function
by navigating to the Configuration option and choosing Environment variables. Here
is how that may look:
Figure 15-48 Setting the environment variables for the lambda function
Let us now deploy this function. From the AWS management console, navigate to the
API Gateway service and choose to create an API. Choose to build a REST API that
will provide you more control over the request and response. Enter a name for the
new API, leave the endpoint type as is, and choose to create an API.
Here is how this may look:
Since we will be sending data to the API to obtain a prediction score, we will create a
POST method that would post the data points to the machine learning API and return
a prediction score. And so, choose to create a method, select POST as method type,
and select the lambda function as the integration method.
From the Method type drop-down, please choose the POST method and not the
PUT method.
Here is how that may look:
Once you have created this POST method, you can test the API functionality to see if
it is working as expected. On API Gateway, navigate to the Test option and enter the
data to be tested within the Request Body section and click Test. You will get the
response code and the response that the API returns.
Once everything looks good, we can proceed to deploy the API that is publicly
accessible. In the top-right corner, choose the “Deploy API” option, choose “New
stage” for the Stage drop-down, and enter a name for the stage. You could either
mention Development, Testing, Production, or Alpha, Beta, etc.
Here is how that may look for you:
Conclusion
So far, I hope you gained a comprehensive view of Amazon Web Services, focusing
on its data engineering and machine learning capabilities. We have covered a wide
range of services starting from simple object storage to hosting your own machine
learning models on the cloud. We created IAM roles for secured access to services;
stood up databases with RDS, data warehouses with Amazon Redshift, data lakes
with Lake Formation, and machine learning models with AWS SageMaker; and also
gained experience in working with AWS Lambda, AWS Athena, and AWS Glue.
The examples provided here may seem to use comparatively generous IAM role
privileges. This is only for illustration purposes. In production, you should practice
the principle of least privilege, granting only the specific permissions required for a
given task. Overall the topics we covered may serve you well in your projects.
© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2024
P. K. Narayanan, Data Engineering for Machine Learning Pipelines
https://doi.org/10.1007/979-8-8688-0602-5_16
Introduction
In this chapter, we will look at Google Cloud Platform, one of the early adopters of cloud computing. We will look
at exploring various services of GCP, focusing on data engineering and machine learning capabilities. We will start
by understanding core concepts of Google Cloud Platform and looking at key services. This is followed by a
detailed look at the various data system services offered by Google Cloud Platform. We will finally look at Google
Vertex AI, a fully managed machine learning platform for building, training, and deploying machine learning
models within one service.
By the end of the chapter, you will learn
Fundamentals of Google Cloud Platform
How to set up Google CLI for managing resources
Google’s object storage and compute engine for virtual machines and data systems
Google’s data processing tool Dataproc for running Spark and Hadoop workloads
Developing, training, and deploying machine learning models using Google Vertex AI
As you embark on your journey exploring Google Cloud Platform, please may I request you to be mindful of
managing the cloud resources, especially when you are done using them. In the case of traditional infrastructure,
you may switch off a computer when you are done. However, in the case of cloud infrastructure, where you often
procure a plan that is based on a pay-as-you-go model, you will be billed for your resources even if they are not
being used. Signing off GCP does not mean powering off the services you have procured. This is particularly
important for virtual machines, databases, and any other compute-intensive resources. Please, always, make it a
habit to review your active resources and shut down these specific resources. If you are no longer planning on
using a service, you may delete the same. And so, you are only paying for the services you actively use and keep
your costs under control. I also would like to thank GCP for providing free credits for someone who is a new user
and having actively supported new learners in bringing their cost down.
2.
Import the Google Cloud CLI public key and add the GCloud CLI distribution as a package source:
3.
Run the update command again and install GCloud CLI:
4.
Run Google Cloud CLI to get started:
gcloud init
You will have to authenticate in a browser by entering your credentials and allowing the Google Cloud SDK to
access your account.
Here is how that may look:
Figure 16-2 Google Cloud SDK authentication page
If you are working on MS Windows operating system, then open a PowerShell tab and enter the following:
(New-Object
Net.WebClient).DownloadFile("https://dl.google.com/dl/cloudsdk/channels/rapid/
"$env:Temp\GoogleCloudSDKInstaller.exe")
& $env:Temp\GoogleCloudSDKInstaller.exe
Follow the prompts that the Google installer provides. Open Google Cloud CLI and enter the following to get
started with GCP CLI:
gcloud init
You are all set to utilize Google Cloud from your command line.
Click Create a VM, which will lead us to the following page, where you would be provided the option to enable
the Google Compute Engine API.
Click ENABLE, to enable the compute API.
Once enabled, you would be able to manage the API. It would appear like this:
With the Compute Engine API enabled, you may proceed to create a new virtual machine.
Now, we are going to create a Linux virtual machine. We will begin by creating a new compute instance,
selecting the appropriate Linux distribution and the version, and enabling access by allowing incoming HTTP
traffic.
First, let us visit the menu on the left, locate the Compute Engine option, and click VM Instances and create a
new instance. You would be provided with a page, where you can specify the name, location, and type of instance
to begin with. You also have the choice of either specifying a pre-packed machine type with a specific number of
cores and memory, or you can customize your configuration as well.
Based on the choices you make, Google would present you costs that may be incurred every month. Here is
how this may look for you:
You also have the option of specifying the access control, firewall rules, and other security configurations within
the same page. Once you have chosen your desired configuration, you may click Create. You may also choose the
most minimum configuration and go with the bare essentials. Here is how the dashboard may look:
Figure 16-7 List of all VM instances in Google Cloud
Please wait for a few minutes for the instance to be created. Once it is created, Google will provide an SSH-in-
browser session. Here is how that may look for you:
Google automatically performs backups, applies patches to databases, and scales up the storage capacity as
required. The criteria and settings among other things need to be defined though.
Google provides the option of choosing the database driver for your database instance. The current options are
Postgres, MySQL, and MS SQL Server. When you create a new instance of Cloud SQL with your choice of a
database driver, Google will provision the instance in a single zone by default. Let us visit the Google console,
select the Cloud SQL service, and create a new database.
Figure 16-10 Creating a new database instance in Google Cloud SQL
As we discussed earlier, Cloud SQL will provide you an option to choose the database engine. Once you have
chosen the database engine, proceed to provide the compute instance name, a password for the root user, a sandbox
for the preset, a region close to your location, and a single zone, and choose to create the instance.
Please wait a few minutes for the instance to be created. When it is created, here is how the screen would look:
Figure 16-11 A live database instance in Google Cloud SQL
Once you have provisioned the Cloud SQL service, you can proceed to create a database and enable shell access to
your Cloud SQL instance among other things. Let us create a database here. Go to your instance, choose Databases
in the left menu, and click “Create a database.” Provide a database name, leave the character set to UTF-8, and
click Create.
Here is a simple application to load data programmatically from your local workstation to your newly created
Cloud SQL database:
config = configparser.ConfigParser()
path = '/Users/<your-folder-path-here>/credentials.cfg'
config.read(path)
)
return conn
pool = sqlalchemy.create_engine(
"mysql+pymysql://",
creator=getconn,
)
filepath = "/Users/<your-folder-path-here>/gcp/polars_dataset.csv"
db_conn.execute(' \
CREATE TABLE onedata( \
firstname varchar(200), \
lastname varchar(200), \
gender varchar(10), \
birthdate varchar(100), \
type varchar(20), \
state varchar(100), \
occupation varchar(100) \
) \
')
db_conn.commit()
Google Bigtable
Bigtable is a NoSQL offering of Google Cloud. This NoSQL data store is designed for high-velocity incoming
data and low-latency database application. Bigtable is used in applications where data is required rapidly,
applications like stock trading, real-time gaming, and streaming media platforms. Bigtable is a denormalized
database; there is no support for joins between tables.
Note A low-latency database is a type of database management system that provides very fast response times,
say in the lines of milliseconds.
Google Bigtable is a key–value store, with no support for secondary indexes (other than the key), and columns
need not be in the same order. You can still store the data in tables, and Bigtable can have several tables within a
given instance. In the absence of a schema, you can attempt to store data of similar structure (or schema) in the
same table. Moreover, all operations are atomic at the row level. I believe that when an operation writes multiple
rows to Bigtable and it fails, then it may be possible that some rows may have been written and some may not have
been written at all.
When designing Bigtable row keys, it is important to elucidate how the data will be queried and how one
would retrieve data. You can incorporate multiple identifiers within the row key that are separated by delimiters.
Rows are sorted lexicographically by row keys in Bigtable, and so it may be beneficial to design a row key that
starts off with a common value and completes with the most granular value.
Google Bigtable defines a table as a logical organization of values indexed by row key. Let us create a Bigtable
instance. Navigate to the GCP console and type in “Bigtable.” Once you are able to locate the service, choose to
create an instance.
You can also install the Google Bigtable client for Python in your workstation. Here is how you can do that, as
per the documentation
(https://cloud.google.com/python/docs/reference/bigtable/latest):
{
"id": 2,
"first_name": "Rube",
"last_name": "Mackett",
"email": "rmackett1@japanpost.jp",
"gender": "Male",
"ip_address": "153.211.80.48"
}
config = configparser.ConfigParser()
path = '/Users/pk/Documents/de/gcp/credentials.cfg'
# Configuration
project_id = config.get('bigtable','project_id')
instance_id = config.get('bigtable','instance_id')
table_id = config.get('bigtable','table_id')
json_file_path = "/Users/pk/Downloads/randomdata.json"
max_versions_rule = column_family.MaxVersionsGCRule(2)
column_family_id = "cf1"
column_families = {column_family_id: max_versions_rule}
if not table.exists():
table.create(column_families=column_families)
column_family_id = "cf1"
# commmit
row.commit()
You may observe the output at Bigtable Studio by running a simple query. Here is how that may look for you:
Figure 16-14 Loading a JSON dataset in Google Bigtable
[bigtable]
project_id = "value-here"
instance_id = "value-here"
table_id = "value-here"
Google BigQuery
Google BigQuery is a fully managed analytical data warehousing solution by Google. As it is a fully managed
service, there is no need to set up, provision, or maintain the underlying infrastructure. The storage for BigQuery is
automatically provisioned and also automatically scaled when there is more data coming in. BigQuery supports
high-throughput streaming ingestion in addition to high-throughput reads. In Google BigQuery, the storage and
compute are separated. Both the storage and compute can be scaled independently, on demand.
Google BigQuery is a columnar store, where data is automatically replicated across multiple availability zones
to prevent data loss from machine failure or zone failure. In BigQuery, data is stored mostly in tabular format.
Tabular data includes standard data, table clones, materialized views, and snapshots. The standard tabular data is a
table data that has a schema, and each column has an assigned data type. The table clones are lightweight copies of
standard tables, where the delta of the table clone and base table is stored (and so it is lightweight).
Table snapshots are point-in-time copies of tables, where you have the option to restore a given table from a
given point in time. The last one is materialized views, where the results of a query are precomputed and cached
and updated periodically. In Google BigQuery, the cached results of a query are stored in temporary tables. Google
does not charge for cached query results.
Google BigQuery lets you manage the security and quality of data throughout its lifecycle through access
control, data stewardship, and data quality modules. You can manage and control the access of BigQuery users and
the level of access they have with respect to tables, views, snapshots, and so on. There are also provisions to
control and manage access to certain rows and columns, enabling fine-grained access to sensitive data. You have
the choice of obtaining the detailed user profiles and system activity by obtaining audit logs.
In addition, you can safeguard personally identifiable information in the BigQuery tables and views, by
applying data masking to the PII fields. This will safeguard the datasets against accidental data disclosure.
BigQuery automatically encrypts all data, both during the transit and at rest. You also have the option to check the
data quality by running statistical metrics (mean, median, unique values, etc.) on your dataset. You can also
validate your data against predefined rules and check data quality and troubleshoot data issues.
To get started using Google BigQuery, we need to locate the service within Google Cloud and enable it. Here
is how that process may look:
Figure 16-15 Enabling the Google BigQuery API
Once you click ENABLE, you may wait a few more minutes till you get automatically redirected to Google
BigQuery. There is no need to provision any infrastructure, compute, or even storage. When you start using
BigQuery, depending upon your usage, you will be billed for storage and compute.
Here is how that may look for you:
Google Dataproc
Google Dataproc is a fully managed service that provides you a Hadoop cluster or a Spark cluster for you to run
big data processing in a distributed environment to take advantage of big data tools. The Dataproc service also
integrates well with various other Google Cloud services like BigQuery and others, enabling you to set up a
complete data platform as well.
To get started with Google Dataproc, enable the API from the cloud console:
Google Dataproc’s Spark cluster is a serverless model, which means you pay only for your consumption. The
Hadoop cluster is also relatively less expensive. Google Dataproc takes very little time to set up a new cluster. It
also has an autoscaling feature, which would scale up or down depending upon the memory usage.
There are three types of clusters you can set up using Google Dataproc. They are single-node cluster, standard
cluster, and high-availability cluster. The single-node cluster consists of one master and no worker nodes. The
standard cluster consists of one master and “n” workers, whereas the high-availability cluster consists of three
master nodes and “n” worker nodes. For a single-node cluster, the autoscaling feature is not available.
Let us create a simple cluster in Google Dataproc with Compute Engine. To get started, let us click Create a
cluster and choose the Compute Engine option:
In our illustration, we will create a single-node cluster with 1 master and 0 workers. Let us provide a cluster name,
region, and type of cluster and choose a disk image for the compute engine. Here is how that looks:
Figure 16-19 Creating a new Dataproc cluster
As you scroll down, make sure check the Enable component gateway checkbox and also the Jupyter Notebook
option. You may wish to choose other options depending upon your needs. Here is how this may look:
Figure 16-20 Specifying the components for the new Dataproc cluster
Now, click Configure nodes, where you have the option to specify the type of compute you wish to use. If you are
creating a cluster for a development environment, then you may be able to work comfortably with a less powerful
compute.
In that case, you may wish to choose a general-purpose compute and select the “N1 series” option if you see it.
You may also select similar options for the worker node as well.
Here is how that may look for you:
Figure 16-21 Configure the node for the Dataproc cluster
You can also create a cluster from your Google Cloud command line interface.
Here is the command for that:
When the cluster is created, you may be able to see it in the dashboard. Here is how this may look for you:
Figure 16-22 Dashboard of active clusters in Dataproc
Let us create a Spark data pipeline and use Cloud Dataproc to execute the data pipeline. We will use Vertex AI
Workbench service to create a Jupyter notebook.
Vertex AI Workbench integrates well with Cloud Storage and BigQuery services. It comes with various machine
learning libraries preinstalled and supports both PyTorch and TensorFlow frameworks. You have the option of
using CPU-only instances or GPU-only instances.
Here is a sample code for you that you can run on Google Cloud Storage data:
bucket = "your-bucket"
filepath = "gs://your-bucket/polars_dataset.csv"
spark = SparkSession.builder\
.appName("PySpark-my-cluster")\
.config("spark.jars", "gs://spark-lib/bigquery/spark-bigquery-
latest.jar")\
.getOrCreate()
occ_report = df.crosstab(col1="state",col2="occupation")
occ_report.show(5)
Note The “spark.jars” file is used with Spark to read and write data with Google BigQuery. You may
obtain the appropriate Spark jars file at Google Dataproc service. Visit the dashboard and navigate to the
“Submit a job” page within Dataproc. Provide the appropriate information and click Submit to obtain the jars
file.
Google Vertex AI
Google’s Vertex AI is a fully managed machine learning platform that lets you train and deploy machine learning
models. Once you train a model, you can deploy it on Vertex, enabling the model consumers to receive real-time
intelligence and insights aimed at generating value to the business. This machine learning platform does have so
many features and is highly comparable to other leading machine learning platforms offered by AWS and
Microsoft.
Let us build a custom machine learning model from scratch, and this time we will use the AutoML feature.
AutoML stands for automated machine learning and is the process of automating machine learning models,
including preprocessing, feature engineering, model selection, and tuning the hyperparameters.
Google Vertex AI offers two development environments for building machine learning models; they are Vertex
AI Workbench and Colab notebooks. We have already seen Vertex AI Workbench in the previous section and
utilized the notebook to build machine learning models. These workbench environments are provided through
virtual machines, and they are customizable with support for GPU as well.
Google Colab notebooks are a managed notebook environment that lets you collaborate with others, while
Google manages the underlying infrastructure for you. You have the option of customizing the underlying compute
and configuring runtimes. The upload size for a tabular file cannot exceed 20 megabytes in Google Colab, whereas
in the case of Vertex AI Workbench, you can work with a tabular dataset with size up to 100 megabytes.
Let us look at building a machine learning model and host the model as an endpoint using Google Vertex AI.
We are using the iris flowers dataset for a multi-class classification problem. Here is the machine learning code for
your reference. I encourage you to use this code as a starter and experiment with other classifiers:
import pandas as pd
import pandas as pd
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
import pickle
training_dataset = pd.read_csv('iris.csv')
dataset = training_dataset.values
# split data into X and y
X = dataset[:,0:4]
Y = dataset[:,4]
label_encoder = LabelEncoder()
label_encoder = label_encoder.fit(Y)
label_encoded_y = label_encoder.transform(Y)
seed = 7
test_size = 0.33
X_train, X_test, y_train, y_test = train_test_split(
X,
label_encoded_y,
test_size=test_size,
random_state=seed
)
xgmodel = XGBClassifier()
xgmodel.fit(
X_train,
y_train
)
print(xgmodel)
filename = "model.pkl"
pickle.dump(xgmodel, open(filename, 'wb'))
Note Pickle files are binary files, used to serialize Python objects like dictionaries, lists, etc. The process of
converting a Python object to a byte stream is called pickling or serialization. The inverse also holds. You can
deserialize or unpickle a pickled file. They are considered secure as pickle files are Python specific. Pickling is
commonly used in machine learning modeling in the Python ecosystem.
Once we have the model available in pickle file format, we can now add this model in Google Vertex AI model
registry. To get started, visit Vertex AI from the Google console, navigate to deployment, and click Model
Registry.
Here is how that may look for you:
You can either create a new machine learning model interactively or import the pickle file you have generated. In
this case, we will import our model as a pickle file. You have the option of choosing the region for your model.
Figure 16-26 Importing the model, naming, and specifying the region
You may specify a name, description, and the region for this model. You can choose to import as a new model. If
by any chance you have retrained this same model (with the iris dataset) and generated another model pickle file,
then you may import the same as a newer version of the existing model. We have trained our model using the
XGBoost model framework. If you have used scikit-learn or TensorFlow as a model framework, then choose the
appropriate model framework and model framework version from the drop-down box. And for the model artifact
location, provide the cloud storage folder location, either by browsing visually or just by entering the URI.
For the model artifact location, please specify only the folder and not the exact model path. The column would
auto-populate the “gs://” prefix, and so be mindful of the cloud storage path.
Once you have deployed the machine learning model pickle file in the model registry, then it is time to create
an endpoint and expose the same for consumption. Let us navigate to the “DEPLOY AND USE” tab within Vertex
AI and click Online prediction. Here is how that may look:
Figure 16-28 Creating an endpoint for our ML model
The concept of online prediction is only when you are looking to consume a machine learning model by
submitting a request to an HTTP endpoint. Let us deploy our iris machine learning model as an endpoint. You may
choose a region where you wish to host this endpoint and click Create new endpoint.
Once you click to create an endpoint, provide the name for the endpoint and make a note of that name separately.
To access the endpoint, we will choose the standard option here.
The traffic split is the model is basically specifying how much of the incoming traffic should be routed to this
endpoint. For our illustrative purposes, we can leave the value as is. This parameter can be changed at a later point,
if we are hosting multiple endpoints for a machine learning model.
Furthermore, you can specify the number of compute nodes for this one endpoint. Also, you can specify the
maximum number of compute nodes you would like to have should there be higher traffic. You can also specify
custom machines with specific compute and physical memory; there are also options to choose from specialized
machines like CPU intensive, memory intensive, etc.
Here is how that may look:
Figure 16-30 Model settings for deploying to an endpoint
For the remaining options, you may leave as is and choose to deploy the model. The process may take several
minutes. Once the model is deployed, you may be able to see the entry in your deployment dashboard.
Here is how that may look for you:
Figure 16-31 A deployed model that can be seen in the Google Vertex AI dashboard
You can test the endpoint in various forms. Postman client would be simplest. Here is an illustration of using the
Google Cloud command line interface:
Conclusion
So far, we were able to look at Google Cloud Platform with a focus on data engineering and machine learning
projects. We explored a wide range of services from Cloud Storage to BigQuery. GCP offers flexible compute
options through the Compute Engine API and managed database services like Bigtable and Cloud SQL. We looked
at Dataproc and BigQuery for running distributed computing workloads and also Vertex AI Workbench for
developing machine learning models. This was followed by Google Vertex AI, where we looked at building a
custom machine learning model, deploying the same to model registry, and exposing it as an endpoint for real-time
predictions. As GCP evolves and continues to add more services, they continue to offer managed services that
reduce the operational overhead for data and ML engineers, enabling them to build data solutions that are stable
and reliable.
© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2024
P. K. Narayanan, Data Engineering for Machine Learning Pipelines
https://doi.org/10.1007/979-8-8688-0602-5_17
Introduction
Azure cloud computing is a fully managed computing service by Microsoft. The Azure concept was introduced
early in 2008 at a conference, and Microsoft launched the Windows Azure service in 2010, which was rebranded
as Microsoft Azure. Currently it provides a comprehensive suite of computing, database, analytics, and other
services through various deployment modes. Microsoft has data centers all across the globe and serves various
industries. In this chapter, we will look at some of Azure’s key components and services, with a focus on data
engineering and machine learning.
By the end of this chapter, you will learn
– How to set up a Microsoft Azure account and Azure’s resource hierarchy
– Microsoft’s object storage called Azure Blob Storage and its concepts and setup
– The relational database service called Azure SQL and its concepts and installation
– Cosmos DB, a NoSQL database, and its architecture and concepts
– A fully managed data warehousing environment
– Serverless functions and data integration services
– Azure Machine Learning Studio and its concepts
As you embark on your journey exploring the Microsoft Azure platform, please may I request you to be
mindful of managing the cloud resources, especially when you are done using them. In the case of traditional
infrastructure, you may switch off a computer when you are done. However, in the case of cloud infrastructure,
where you often procure a plan that is based on a pay-as-you-go model, you will be billed for your resources even
if they are not being used. Signing off Azure does not mean powering off the services you have procured. This is
particularly important for virtual machines, databases, and any other compute-intensive resources. Please, always,
make it a habit to review your active resources and shut down these specific resources. If you are no longer
planning on using a service, you may delete the same. And so, you are only paying for the services you actively
use and keep your costs under control. I also would like to thank Azure for providing free credits for someone who
is a new user and having actively supported new learners in bringing their cost down.
Introduction to Azure
Azure is Microsoft’s cloud computing platform offering several of the services that are comparable to Google’s
and Amazon’s cloud service offerings. Azure’s offerings are particularly attractive to organizations who are
heavily invested in Microsoft-based technologies.
Microsoft has had a long-standing reputation in the data management field. Microsoft is one of the early
adopters of building ETL tools, which were instrumental in facilitating data movements between various sources.
To get started with Microsoft Azure, you need to visit Microsoft’s website to sign up for a Microsoft account,
before attempting to obtain access to Azure. Once you have a Microsoft account, you may wish to sign up for free
Azure services. Like AWS and Google Cloud, you will enter various information to Microsoft to obtain access to
Azure services.
Let us get ourselves familiar with the resource hierarchy of Microsoft Azure. It helps to configure and govern
the provisioning of services and workload by a given account, especially an organizational account. The highest
level is called the management group. The management group governs utilization of various services by a given
account. Management groups enable an organization to apply various policies across all users within a
management group and apply access control to one or more Azure service offerings. From the root management
group, you can create one or more management groups that may correspond to a business group or a certain type
of billing model. Simply search for the term “management group” from your Azure dashboard to get into that
section, and you may wish to create one for your organization. You can name your management group as your
organization or your department within the organization or a specific project name.
If you visit a service offering page more often, then it may be convenient to pin that page to your dashboard.
To perform that, simply click the pin-like icon next to the title of your management group. Here is how that looks:
Within a given management group, you can have one or more subscriptions. You can see the option of
subscriptions in the Management groups section. A subscription is a scheme that enables one or more users to
access one or more Azure service offerings. For instance, you can have two subscriptions within a management
group for a specific team, namely, development and production. You may allow more resources on the production
subscription and enable certain constraints on the development subscription. Another example could be creating a
subscription for every product or service that the organization offers so that one may be able to calculate the costs.
Within any given subscription, we have a concept called resource group. This is a project-based grouping of
resources or one or more Azure services. For instance, if you are looking to set up a data warehousing project and
need databases, data warehouses, data integration services, secrets management, data modeling tool services, and
logging and monitoring services, then you can create one resource group that can contain these services. This can
be beneficial in the context of governance and setting up access policies.
Finally, the lowest abstraction is called the resources in Azure. These are various services that Azure offers. It
can be a database or a tool that you would pay to use. Resources are anything that consist of networking, storage,
or compute entities. For instance, a serverless data integration service that uses compute to move data from source
to sink is an example of a resource.
From a data products and services point of view, let us explore the storage systems, database services, and data
ingestion and integration services among others.
Note “Blob” is a database term that expands to binary, large objects. Blob can store any type of data into
binary objects. It is perfect for images, audio, video, or PDF content. These blob data are not indexable though.
There is also clob, which expands to character large objects, that stores large amounts of text data like
documents, large code snippets, etc.
To create a new blob storage, choose Storage from the main dashboard and create a new storage. Provide the name
of the instance and choose a closely located region. In addition, choose standard performance and low-cost
replication. You may be able to see other options like choosing customer-managed encryption keys among other
critically important options. Microsoft will take a moment to run validation to review your choices before enabling
the “Create” button. Once you have created it, wait a few moments for deployment of the resource in your account.
Once deployed, here is how that looks:
You now have a storage account configured for your use. Let us now create a container by clicking “Upload” and
choosing “Create new.”
Here is how that looks:
Let us now programmatically upload a flat file into Azure Blob Storage using Python. To begin with, we need to
install the software development kit for Azure Blob Storage for Python programming language. Open a terminal
and start a virtual environment session or conda environment session:
source azure/bin/activate
azure/Scripts/activate
In order for us to programmatically interact with the cloud storage service, we need the connection string,
container name, and local file path inputs, respectively. We will set the connection path to the file that needs to be
uploaded and define the container name and connection string, respectively. Then we will initialize the blob client
and the container client, respectively. A container is a logical grouping of objects in the blob storage service. You
can enforce container-level access control as well. Once we initially use the blob client, we can upload our file
from there on.
Here is how this all may look:
connection_string = 'DefaultEndpointsProtocol=https;\
AccountName=testblobforsampleusage;\
AccountKey=YjsfMmriMCRCn5Po/DdzCQAiidxPsfgfsdgs45SFGS$%
EndpointSuffix=core.windows.net'
container_name = "firstcontainer"
local_file_path = "/Users/username/hello.csv"
blob_service_client = BlobServiceClient.from_connection_string(connection_stri
container_client = blob_service_client.get_container_client(container_name)
# I am just using the file name from the local file path here
blob_name = os.path.basename(local_file_path)
# upload blob
with open(local_file_path, "rb") as data:
blob_client.upload_blob(data)
Similar to AWS and Google Cloud, Microsoft Azure offers various storage classes where you can organize
data in a specific tier based on the frequency of accessing that data. Currently Azure provides hot tier, cool tier,
cold tier, and archive tier for storage. The hot tier is the storage class that is designed for most frequent access
(reading or writing). The hot tier is the most expensive of all. The cool tier is for data that is infrequently accessed;
however, the data should be stored for a minimum of 30 days. The cold tier is similar to the idea of the cool tier
except the data should be stored for a minimum of 90 days. The minimum storage period is 180 days for the
archive period, which is also known as long-term backup.
Azure SQL
Azure SQL is a managed database service that can be availed in three different methods. Recall our discussion on
Infrastructure as a Service, Platform as a Service, and Software as a Service.
You can deploy Azure SQL on an Azure virtual machine. Azure provides you images of preinstalled SQL
Server on operating systems. You can pick the ones you prefer, though keep in mind you will be responsible for
operating system upgrades and patches. You have the option to choose memory optimized or storage optimized for
a virtual machine. The service is named “SQL database,” and this type of setup is the Infrastructure as a Service.
You can also avail an Azure managed database instance, where the fully managed Azure SQL database service
automates most of the software updates, security updates, and patches. Here we just choose the appropriate
database version and not worry about the underlying operating system. You can still configure the underlying
compute (and have the option to scale up or down the number of virtual cores currently allocated to the managed
instance). This is referred to as a Platform as a Service model.
The other option is availing a service called “SQL Server” that operates as a SaaS (Software as a Service). In
this option, you do not need to worry about the underlying infrastructure. Microsoft will manage both the
application and the underlying operating system. You have the option to go to the serverless model where you pay
only for what you consume. Azure bills you based on the number of seconds you have utilized SQL Server. And
you can choose various tiers for the underlying infrastructure, and you have the flexibility of changing to scale up
or down the compute as well.
Let us create an Azure SQL database. In the process of creating a new SQL database, we will also create an
Azure SQL logical server. The Azure SQL logical server is like a parent resource for the SQL database. With the
Azure SQL logical server, you can enable role-based access control, logins, threat protection, and firewalls to
name a few. It also provides an endpoint for accessing databases.
Let us search for “SQL database” and choose to create a new database instance. You will have to mention the
subscription and resource group details. You will also have to enter a name for the database and choose a server.
The server option is for the Azure SQL logical server. If no server currently exists, you may create a new one.
Here is how this may look:
Figure 17-4 Creating a new SQL database
You can also choose whether to use an elastic SQL pool. An elastic SQL pool is basically the underlying compute,
storage, and IO for the database that can be shared with other database instances as well. If you create “n” single
database instances, then you would have “n” underlying compute and other resources. If you use an elastic SQL
pool, then you can have less than “n” compute and storage resources and “share” them among “n” database
instances.
In addition to the option of a SQL elastic pool, we also have the choice of specifying whether this instance we
are creating is for development or production purposes. The SQL elastic pools are billed by eDTUs. An eDTU is a
unit of currency that represents a certain amount of compute, storage, and IO resources. You may wish to buy a
certain number of eDTUs in advance and set the limit for the databases on how many such units they can consume
in a given amount of time.
Before we create a new SQL database, we may have to create a new Azure SQL logical server. Let us click
Create SQL database server, which would provide the page for creating a new logical server. Here is how that may
look:
Figure 17-5 Creating a SQL database server in Azure
You may specify a name and a location for the logical server; also, you have the option of specifying the
authentication method. Once the information is supplied, you may click OK to create a new logical server for your
SQL database. For our instances, we are choosing to create a single instance of a SQL database. Hence, we are
choosing not to use an elastic SQL pool.
Once the logical server is ready, you may have the option of choosing the compute and storage for the SQL
database. Here is where you can choose the underlying hardware for your SQL database; you may wish to explore
various hardware that Azure offers along with their pricing, which will be computed and provided to you instantly.
In addition, you have the option of choosing the compute tier, whether you want pre-allocated compute resources
or you wish to choose serverless, where you will be billed only on the resources you have consumed.
Here is how this may look:
Figure 17-6 Allocation of hardware resources for our database server
When it comes to choosing backup storage redundancy, you may wish to choose either locally redundant backup
storage or the zone-redundant option for your development purposes. You may leave other options as is and choose
to create the database.
Here is a sample Python application that would connect to a database, create a table, and insert rows to that
table. If you do not have the “pyodbc” package already, please issue the following command in your terminal:
server = os.getenv("server")
database = os.getenv("database")
username = os.getenv("username")
password = os.getenv("password")
driver = os.getenv("driver")
try:
cursor = cnxn.cursor()
print('Connection established')
try:
query1 = "CREATE TABLE SampleTable \
(Id INT PRIMARY KEY, \
Name NVARCHAR(100));"
cursor.execute(query1)
cnxn.commit()
print("Table created")
try:
cursor.execute(query2)
cnxn.commit()
print("Rows inserted")
except:
cursor.close()
cnxn.close()
except:
except:
Here is how the contents of the environment file may look like:
server = "your-azure-database-server-instance"
database = "your-database"
username = "your-username"
password = "password"
driver = "driver"
Azure Cosmos DB
Azure Cosmos DB is a managed NoSQL database service provided by Microsoft. It is a versatile service offering
that can provide both SQL and NoSQL database development including relational data models, document stores,
graph databases, and vector databases as well. Cosmos DB offers multiple APIs, each one simulating a different
database engine. These include relational SQL, MongoDB for document databases, Cassandra DB for column-
oriented databases, Gremlin for graph databases, and Table for key–value databases. In a way, Cosmos DB has a
multimodel database service, where you can use appropriate APIs for your database application needs. In the case
of database migration, you may choose the appropriate API corresponding to your current relational or NoSQL
database that you intend to migrate.
You can avail Cosmos DB in two methods, namely, provisioned throughput and serverless models. In
provisioned throughput, as the name suggests, you provision how much throughput you expect the database to
deliver while also defining the performance levels you wish to obtain for your database application. The unit of
measurement here is called a request unit. The other method of availing Cosmos DB is through serverless.
The serverless model is such that it only charges you based on how many requests that it has received and
processed. As far as billing is concerned, provisioned throughput will compute the bill based on how many request
units have been provisioned and not consumed, whereas, in the case of serverless, billing is computed based on
how many number of request units consumed over an hour by performing database operations.
Note A request unit is a way of measuring how you have consumed the service. Every operation that you
perform with Cosmos DB will consume a certain amount of compute, storage, and physical memory. The
request unit is calculated based on such consumption. For instance, a query that is sent to Cosmos DB will cost
a certain number of request units to retrieve the results. The exact number of request units required to obtain the
results may depend on the size of the table, query compilation time, and number of rows, to name a few
parameters.
Let us look at creating a Cosmos DB instance with the MongoDB API. Once you make an API selection, it
remains the same for that instance. Should you choose to work with another API, you may wish to create another
instance with that respective API. In our case, we choose the MongoDB API. If you are familiar with MongoDB,
then the usage will be very similar to Cosmos DB. Here is where you may choose the API within Cosmos DB:
Figure 17-7 Choosing the right API for your Azure Cosmos DB instance
Next, you will be prompted to choose the resource for your instance. There are two categories: one is based on
request units, while the other is based on virtual cores. Both the options are fully managed by Azure and scalable
as the workload increases. Let us choose the virtual core option, where we have the option to choose the
underlying hardware.
Here is how this might look:
Figure 17-8 Choosing the hardware resource for Cosmos DB
Once you choose the virtual core option, you will be directed to configure your Cosmos DB. Here, you will have
to specify the subscription or billing account and appropriate resource group. A resource group is something that
shares similar policies and permissions. In addition, choose an administrator username and password and name and
location of the cluster, among other details.
Figure 17-9 Creating a new Cosmos DB instance for a MongoDB cluster
You have the option of choosing the free tier for the underlying cluster. Let us look at configuring our underlying
cluster for this Cosmos DB instance. Under configuration, you will have various development- and production-
level hardware instances, where you can customize your compute, storage, and physical memory. In addition, you
also have the option to choose high availability and database shards as well.
Database sharding in MongoDB is a method where you distribute the data between multiple worker nodes
within a given cluster. You can also create replications of your data, and so you will gain speed in read/write
operations. For instance, let us say a node that contains a specific shard may process “n” operations per second; by
introducing “m” shards containing the same data, you may be able to process “m × n” operations per second.
Here is how cluster configuration may look:
Figure 17-10 Hardware configuration for Cosmos DB for a MongoDB cluster
You also have the option of specifying the firewall rules and connectivity methods within the networking
option. Here is how that may look for you:
Figure 17-11 Firewall rules specification for Cosmos DB
You may leave the networking settings and other settings as is, and click Create. It might take several minutes for
Azure to provision a new cluster. Let us look at connecting and uploading a JSON document to Cosmos DB:
url = config.settings['host']
key = config.settings['master_key']
database_name = config.settings['database_id']
container_name = config.settings['container_id']
path_to_json = "/Users/pk/Documents/mock_data.json"
def connect():
try:
client = CosmosClient(url,credential=key)
print("Established connection with the client")
try:
database = client.get_database_client(database_name)
print("Database obtained")
try:
container = database.get_container_client(container_name)
print("Container obtained")
except:
print(f"unable to select the container {container_name}")
except:
print(f"unable to select the database, {database_name}")
except:
print("Unable to connect to the database")
return container
def upsert(container):
with open(path_to_json, "rb") as file:
json_doc = json.load(file)
for js in json_doc:
upload_json_doc = container.upsert_item(js)
print(f"id uploaded is : {js['id']}")
def primaryfunc():
c_name = connect()
upsert(c_name)
primaryfunc()
We also have to define the name and location of the workspace; in addition, we have to either specify the current
data lake or create a new data lake, part of configuring the workspace. Once we have entered the basics, let us
specify a local user in security settings.
Once Synapse Analytics is up and running, you can view it in your dashboard. Here is how that may look:
Figure 17-13 Dashboard of the Azure Synapse workspace
Using Azure Data Factory, it is possible to compress the data during loading from one system to another, thereby
consuming less bandwidth. To get started with Azure Data Factory, visit the link
https://adf.azure.com/en/ or find the same link by searching for Azure Data Factory in the console. A
new tab will open where you can specify a name for your new data factory and associate your subscription account
as well.
Here is how that may look:
Figure 17-15 Creating an Azure data factory page
When you submit the information, you may wait a few minutes for Azure to prepare the Azure Data Factory
service for you. Here is how it may look when it is completed:
On the left side, you will see five options, where Home is the main page, Author is your data integration IDE,
Monitor is where you can orchestrate and observe the job runs, Manage is where you can control the git integration
and other settings, and Learning Center is where you can obtain documentation and learning support for Azure
Data Factory.
Figure 17-17 Azure Data Factory console
In designing a data integration job using Azure Data Factory, you can author a data integration job, internally
called a pipeline, which is a group of activities logically arranged to be executed. Activities inside a pipeline are
what perform a single task. You have the option of performing various activities that would connect source and
sink data systems, transport data, and also perform various transformations.
You have to define a data system connection in order to connect to a data system. To define the data system
connection information, navigate to the Manage page and click “Linked Services” under the Connection tab.
Let us try to create or add Azure Blob Storage as a new connection in Linked Services. You can click a new
connection, search for Azure Blob on the data store, and click Continue. You may enter a name and description for
the connection, followed by choosing an Azure subscription under Account selection method; select your
appropriate subscription and container name.
You can test the connection and if it returns successful then you have successfully added Azure Blob Storage in
your Azure Data Factory. Here is how that may look for you:
Figure 17-18 Creating a new linked service in Azure Data Factory
To get acquainted with Azure Data Factory, let us look at copying data from Azure Blob Storage to an Azure SQL
database. We will obtain a comma-delimited data file from a source and upload it to Azure Blob Storage. We
would then use Azure Data Factory to initiate a copying pipeline from Blob Storage to an Azure SQL database. To
begin, navigate to the main page and click the Ingest option and choose Built-in copy task.
Here is how that may look:
We will specify the source data system in our next page, which is Azure Blob Storage. From the Source type drop-
down menu, choose Azure Blob Storage, and from the Connection drop-down, choose your blob storage. This
should be available if you have utilized the Linked Services option earlier. If not, the procedures are the same.
Once your blob storage is linked, you can browse the object storage to choose the appropriate flat file or a
folder that contains the flat file. Choose the “Recursive” option if you plan to upload multiple flat files from one
folder. The system would identify the type of flat file that is being selected and populate file format settings by
itself.
Here is how this may look:
Figure 17-20 Specifying the source data store in Azure Data Factory
Now, let us specify the destination data system. You may choose an Azure SQL database and specify the
connection, for which you may wish to create a new linked service, or choose the one that may have been already
created. Since this is a relational database, we do have the option of creating a table schema on the target database
and populating the flat file directly to the database. If not, we can choose to automatically create a data table and
populate this data table as well. Here is how that may look:
Figure 17-21 Specifying the destination data store in Azure Data Factory
Once this step is completed, let us name this data pipeline under Settings. Here is how this may look for you:
Figure 17-22 Adding the name and description of this data pipeline
Once all these steps are completed successfully, you will see a review screen that lists all the parameters of this
data pipeline. If there are any errors or you may find something that needs to be changed, you have the option of
editing those respective fields. If everything looks good, then you may click Next. Azure Data Factory will run the
validation for this entire data pipeline. Once the validation process finishes successfully, you may get the following
screen:
You may choose to click Finish, which will take you to the main page. On the dashboard of Azure Data Factory,
click the Monitor page, which will list the current data pipelines that are in your repository. You can click “Rerun”
for running the data pipeline. When the job completes successfully, you may get something like the following:
Figure 17-24 Monitoring and rerunning of data pipelines in Azure Data Factory
Azure Functions
Azure Functions is a serverless compute option that enables executing functions that perform certain tasks without
having to worry about the underlying hardware. Azure automatically provides updated servers for these functions
to execute and also scale up if required. These compute resources are allocated dynamically and autoscaled during
runtime. You would get charged only for the code you execute (duration of the execution and compute resources
used). You can use Azure Functions to build APIs, respond to an event happening on another service, etc.
The highlight of Azure Functions is the use of triggers and bindings. Using triggers and bindings, you can
create functions that interact with other services. A trigger is basically an event or a condition that initiates the
execution of a function. You can only have one trigger per function in Azure Functions. For instance, you can have
an Azure function that has an HTTP trigger that receives a request and triggers the function to provide a response;
you can have a service trigger that activates an Azure function when an event happens—new file uploaded in Blob
Storage or even a simple timer that sets the function execution on a specific schedule.
Bindings in Azure Functions provide a method to connect to various data sources; thereby, you can interact
(write and read data) with services. You can have more than one binding in a given Azure function. There are two
types of bindings that Azure Functions supports, namely, input and output binding. Input bindings may include
reading from data storage systems like Blob Storage, databases, file systems, etc., and output bindings could be
sending data to another service. Azure Functions actually supports multiple input and output bindings in a given
function.
To create a new Azure function, navigate to the Azure dashboard, and search for “Function app.” Here is how
that may look for you:
Once you have created the function app instance, you can create your Azure functions within the same. Let us
explore how to develop Azure functions locally by using the core tools package. Depending upon the operating
system, you may wish to refer to these instructions to install the core tools.
Here is the link: https://learn.microsoft.com/en-us/azure/azure-
functions/functions-run-local?tabs=macos%2Cisolated-process%2Cnode-
v4%2Cpython-v2%2Chttp-trigger%2Ccontainer-apps&pivots=programming-language-
python
To check if the installation is complete, you may open a terminal and type in “func,” and it will render output
relating to Azure functions. To create a new Azure function locally, enter
It would set up the entire project structure with appropriate JSON elements. Here is how that may look for you:
The folder structure on the left is created by Azure Functions core tools utility. You may navigate to the
function_app.py and proceed to write your Azure function. Here is a sample Azure function for you:
app = func.FunctionApp()
@app.function_name('helloworld')
@app.route(route="route66")
def helloworld(requests: func.HttpRequest) -> func.HttpResponse:
logging.info('received request for helloworld function')
user = req.params.get('username')
if user:
message = f"Hello, {user}"
else:
message = "welcome to Azure functions"
return func.HttpResponse(
message,
status_code=200
)
Once you have the Azure function ready, you may run the function using “func start” from the command line.
Here is how that looks:
The workspace option is to enable a centralized view of all resources that Azure Machine Learning provides.
When you provision a new Azure Machine Learning workspace, Azure would automatically create a set of
services dedicated to Azure machine learning. These are storage account, key vault, container registry, and
insights.
The storage account is a newly created storage space for Azure machine learning data assets, logs, and reports
(if any). There is also an option to create blobs. The key vault is a secrets management resource that is used to
securely store sensitive information that is utilized in the machine learning workspace. Insights would provide
operational insights and diagnostic information during monitoring of machine learning projects. The container
registry handles Docker containers for machine learning models, development environments, etc.
Once the service is created and available, here is how this may look for you:
Once you are on this page, click Create. You can specify the name and description of the dataset and choose either
File or Tabular from the “Type” drop-down.
Here is how this may look:
Azure ML Job
Azure machine learning is basically a set of machine learning tasks that runs on the underlying cluster. You can
build a ML training job, which may consist of a set of tasks that trains a machine learning model; you can also
build a sweep job, which may consists of a set of tasks that executes a machine learning model with multiple
hyperparameters, and select the best model that best minimizes the cost function. These jobs also generate log
entries and Azure ML Studio can track the metrics of each ML job.
To get started with developing machine learning models, you need a machine learning workspace and a
compute instance. In previous sections, we have already created a workspace. To create a new compute instance,
you may either choose the “Compute” option under “Manage” or simply click “Add compute instance” on the
dashboard.
Here is how they may look for you:
You can specify a name for the compute instance and choose whether you need a CPU-based instance or GPU-
based instance. The choice on the type of instance would be based upon the nature of the machine learning task
you are intending to perform.
GPUs are always associated with parallelism and high-performance computing, and so if your machine
learning model leverages parallelism (through Dask or Ray), you can go for this choice. Certain applications like
training a deep learning model with several layers or a compute vision task (image processing) are highly suited
for GPUs. Certain machine learning applications like text modeling, large models, or building recommender
engines where the underlying tasks benefit from parallelism can be trained by both CPU and GPU architectures. In
the case of tabular data, CPU-based computation may be sufficient.
Upon choosing the virtual machine, you may leave other options as is and proceed to create the compute
instance. Here is how this may look for you:
Once you have created the compute instance, you can start to utilize the compute for developing your machine
learning model. You may choose the Jupyter Notebook option or the web-based VS Code editor as well. Here is
how that may look for you:
Let us build a classifier model that classifies whether a person is diabetic or not based on the given set of
observations. Depending upon your convenience, you may either choose to work with a Jupyter notebook or
access Visual Studio Code. The goal for us is to train the model using Azure ML Studio and deploy the model
using Azure Functions.
Here is the model code for you:
import pandas as pd
import numpy as np
import pickle
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
df = pd.read_csv("diabetes.tsv",sep="\t")
#print("Top records of the dataset")
#print(df.head())
#print("Analytics on dataset")
#print(df.describe())
df = df.drop_duplicates()
df = df.fillna(0)
x = final_df.drop(["Y"], axis=1)
y = final_df["Y"]
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
model = RandomForestRegressor(**params)
file = 'model.pkl'
pickle.dump(model, open(file, 'wb'))
You can use this code as a starter, build your own features, and try out other classifiers as well. Once you have
the pickle file exported, we can proceed to create a new Azure function to deploy the model. If you have followed
along with the “Azure Functions” section earlier, then navigate to the terminal and create a new Azure function
named diabetes.
Here is how that may look for you:
We are going to deploy the model as a serverless API using Azure Functions. We will leverage the HTTP
triggers. Now, change directory into the diabetes folder and navigate to the function_app.py file. This function has
only been delayed locally. Let us deploy this function in the Azure function app.
Here is the code for deploying the model:
app = func.FunctionApp()
data = requests.get_json()["data"]
y_pred = model.predict(data)[0]
pred_class = classifier[y_pred]
Conclusion
So far, we have explored a wide range of services from object storage to advanced analytics, to NoSQL solutions,
to data integration using Azure Data Factory, to Azure Functions, and to Azure ML Studio. Azure Functions
demonstrated how serverless computing can be utilized to execute code. From creating object storage solutions to
deploying machine learning models, we looked at these services in detail along with code. While it’s difficult to
cover entire cloud computing offerings in a single chapter, I hope the ideas, services, and practical
implementations we’ve explored will serve as a solid foundation for your cloud computing journey.
Index
A
Advanced data wrangling operations, CuDF
apply() method
crosstab
cut function
factorize function
group-by function
transform function
window function
AGSI
See Asynchronous Server Gateway Interface (AGSI)
Airflow user
Alpha
Alternate hypothesis
Amazon
Amazon Athena
Amazon Glue
Amazon RDS
Amazon Redshift
Amazon SageMaker
AWS SageMaker Studio
create and configure, new Jupyter space
dashboard page, AWS Lambda
deploy API
home page
IAM role
lambda function
linear methods
machine learning toolkits
POST method
REST API endpoint
SageMaker Domain
set environment variables
Amazon’s DynamoDB
Amazon S3 Glacier
Amazon S3 Standard
Amazon Web Services (AWS)
Amazon SageMaker
See Amazon SageMaker
data systems
Amazon Athena
Amazon Glue
Amazon RDS
Amazon Redshift
Lake Formation
See AWS Lake Formation
Apache Airflow
architecture
configuration file
database
example
executor
scheduler
web server
background process
DAGs
See Diagonal acyclic graph (DAGs)
data storage and processing systems
definition
GUI
home screen
initialization
initialize the backend database
pip installation
scheduler
set up
user credentials, text file
workflow
workflow management and orchestration platform
Apache Arrow
Apache Cassandra
Apache Hadoop
Apache Kafka
architecture
admin client
brokers
consumers
events
Kafka Connect
Kafka Streams
ksqlDB
partitions
producers
replication
schema registry
topics
best practices
callback function
config.py file
consumer.py
consumers
dashboard
development environment
distributions
elastic
fault-tolerant
functionalities
Kafka Connect
message brokers
messaging queue systems architecture
messaging systems architectures
print
produce method
producer.py file
producers
Protobuf serializer
compiler
consumer/subscriber portion
guidelines
library
optional
ParseFromString() method
pb2.py file
producer script
protoc
.proto file
Python file contents
SerializeToString() method
student data
student model
publisher–subscriber models
publisher–subscriber program
pub–sub messaging system
Python virtual environment
scalable
schema registry
consumer configuration
consumer.py file
JSONSerializer
publisher.py
student data
secure
servers and clients
setup/development
API key/secrets
appropriate region
cluster dashboard
cluster type
config.py file
Confluent Kafka environment
topic
uses
Apache Spark
APIs
See Application programming interfaces (APIs)
Application programming interfaces (APIs)
development process
endpoints
GraphQL
internal APIs
OpenWeather API
partner APIs
REST
SOAP
typical process
Webhooks
apply() method
Artifacts
artifact key
data quality reports
debugging
documentation
link
markdown
persisted outputs
Prefect Cloud
Prefect Server
share information
table
track and monitor, objects
Async
Asynchronous Server Gateway Interface (AGSI)
Automated machine learning (AutoML)
AutoML
See Automated machine learning (AutoML)
Avro
AWS
See Amazon Web Services (AWS)
AWS Athena
AWS console
access portal for respective user
assign users/groups
assign users to organizations
AWS organizational unit
configuration, new service control policy
create new organization
create organizational unit
enable service control policies
home page
IAM Identity Center
installation, AWS CLI
organizational units
permission sets
predefined permission set
service control policy
sign up, new account
AWS Glue
AWS Glue Data Catalog
AWS IAM Identity Center
AWS Lake Formation
add permissions, IAM role
AWS Athena query editor
create database
create IAM role
create new bucket
create new crawler
data lake service
grant permission, newly created role
new crawler initialization in AWS Glue
query data lake using AWS Athena
query results
retrieval of results, AWS Athena
S3 buckets
trusted entity type
AWS Lambda
AWS Organizations
AWS RDS
AWS S3 (simple secure storage)
AWS region
Bucket name
Bucket type
creation, new bucket
global bucket system
home page
stores data
uploading objects (files)
AWS SageMaker Studio
Azure functions
binding
core tools
create
diabetes
execution
function app
sample
triggers
URLs
B
Big data
Apache Spark
Hadoop
BigQuery
Bigtable
Blob
Blocks
Bootstrapping
Branching
BranchPythonOperator
C
Caching
cancel() method
Capital expenditure
CD
See Continuous development (CD)
Central processing unit (CPU)
@check_input decorator
@check_output decorator
Checkpoints
Checks
Chi-square test
Chunking
Chunk loading
Chunks
CI
See Continuous integration (CI)
Clients
Client–server computing
Client–server interaction
client.submit function
Cloud computing
advantages
architecture concepts
caching
cloud computing vendors
disaster recovery
elasticity
fault tolerance
high availability
scalability
deployment models
community cloud
government cloud
hybrid cloud
multi-cloud
private cloud
public cloud
GCP
See Google Cloud Platform (GCP)
networking concepts and terminologies
See Networking concepts
security
as web services
Cloud computing services
CI
Cloud storage
compliance
Compute
containerization
databases
data catalog
data governance
data integration services
data lakes
data lifecycle management
data protection
data warehouses
identity and access management
machine learning
Not Only SQL (NoSQL)
See NoSQL database systems
object storage
real-time/streaming processing service
serverless functions
Cloud computing vendors
Cloud Functions
Cloud infrastructure
Cloud service models
categories
infrastructure as a service
platform as a service
software as a service
Cloud SQL
Cloud storage
Cloud vendors
Colab notebooks
Column-oriented databases
Combining multiple CuDF objects
inner join
left anti join
left join
left semi join
outer join
Combining multiple Pandas objects
cross join
Cross join
full outer joins
inner join
left join
merge() method
Pandas library
right join
Right join
Combining multiple Polars objects
cross join
inner join
left join
outer join
semi join
Comma-separated values (CSV)
Common table expression
Community cloud
compare() method
Compliance
Compute
compute() function
compute() method
Compute Unified Device Architecture (CUDA)
Concurrency
Concurrent processing
ConcurrentTaskRunner
Confluent’s Kafka website
Constrained types
Containerization
Container registry
Content Delivery Network
Context
Continuous development (CD)
Continuous integration (CI)
CPU
See Central processing unit (CPU)
CPU-based data processing
CPU environments
CPU vs. GPU
create_link_artifact() method
create_markdown_artifact() function
Cron daemon process
cronexample.sh
Cron job scheduler
applications
cron alternatives
database backup
data processing
email notification
concepts
cron job usage
cron logging
crontab file
in 1970s
shell script
Unix operating system
Cron logging
crontab
crontab file
crontab script
Cross join
crosstab() function
Cross tabulation
Cross validation
Cryptographic signing and verification
CSV
See Comma-separated values (CSV)
CUDA
See Compute Unified Device Architecture (CUDA)
CuDF
advanced operations
See Advanced data wrangling operations, CuDF
basic operations
column filtering
dataset
dataset sorting
row filtering
combining multiple objects
See Combining multiple CuDF objects
cudf.pandas
data formats
description
file IO operations
CSV
JSON
Parquet
GPU-accelerated data manipulation library
installation
vs. Pandas
testing installation
cut function
D
DaaS
See Data as a service (DaaS)
DAGs
See Diagonal acyclic graph (DAGs)
Dask
architecture
client
core library
initialization, Dask client
scheduler
task graphs
workers
banks
data structures
dask array
dask bags
Dask data frames
dask delayed
dask futures
features
chunking
Dask-CuDF
graph
lazy evaluation
partitioning
Picklingp
serialization
tasks
GPUs
healthcare
installation
national laboratories and organizations
optimize dask computations
client.submit function
data locality
priority work
Python garbage collection process
scheduler
worker
work stealing
Python library
set up
supports
Dask-CuDF
Dask data frames
Dask distributed computing
Dask-ML
data preprocessing techniques
cross validation
MinMaxScaler()
one hot encoding
RobustScaler()
hyperparameter tuning
compute constraint
grid search
incremental search
memory constraint
random search
installation
ML libraries
Keras
PyTorch
scikit-learn
TensorFlow
XGBoost
setup
SimpleImputer
statistical imputation
DaskTaskRunner
Data accuracy
Data analysis
Data as a service (DaaS)
Data assets
Database backup
Databases
Data catalog
Data cleaning
Data coercion
Data completeness
Data context
Data documentation
Data engineering tasks
Data exploration
Data extraction and loading
Avro
CSV
feather
features
HDF5
ORC
Parquet
pickle
Data formats
Data frames
DataFrameSchema
Data governance
Data integration
Data lakes
Data lifecycle management
Data locality
Data manipulation
Data munging
Data pipelines
Data preprocessing techniques
cross validation
MinMaxScaler()
one hot encoding
RobustScaler()
Data Processing
Dataproc service
Data products
Data projects
Data protection
Data quality reports
Data range
Data reshaping
crosstab()
melt() method
pivot() function
pivot table
stack() method
unstack()
Data sources
Data splitting
Data transformation
aggregations and group-by
basic operations
arithmetic operations
df.select() method
df.with_columns() method
expressions
with_columns() method
context
filter
group-by context
selection
dataset
machine learning pipelines
set of products
String operations
Data uniqueness
Data validation
advantages
definition
disadvantages
machine learning models
Pandera
See Pandera
principles
See Principles, data validation
Pydantic library
See Pydantic
RDBMSs
specifications and rules
Data validation workflow
copay_paid value
data context
data pipeline
data source
expectation suite
get_context() method
JSON document
validator
Data warehouses
Data wrangling
data structures
data frame
series
spreadsheets
indexing
Pandas
Debian-based operating system
Debugging
Decorators
Delta Lake
Deployment
df.select() method
df.with_columns() method
Diagonal acyclic graph (DAGs)
create variables
DAG runs
example
function
list
macros
nodes
operator
params
Python context manager
sensor
task flow
tasks
templates
variables
view
workflow control
branching
BranchPythonOperator
ShortCircuitOperator
triggers
typical scenario
workflow management and orchestration system
Xcom
Dict
Directed acyclic graph
Disaster recovery
Distributed computing
Django
DNS
See Domain Name System (DNS)
Docker
Documentation
Document databases
Domain Name System (DNS)
E
Eager evaluation
Edges
Elasticity
Email notification
Encryption
Endpoints
Engineering data pipelines
Enumerate functions
ETL
See Extract, transform, and load (ETL)
Event processing
Executor
Expectations
Expectation store
Expressions
Extract, transform, and load (ETL)
F
Factorize function
factorize() method
FastAPI
advantages
APIRouter objects
browser
core concepts
create
Curl command
database integration
commands
create database
database user
db1.py
MySQL
mysql prompt
table
dependency injection
documentation link
executing
filters
GET request documentation
GET request illustration
get_student() function
HTTP error
middleware
ML API endpoint
middleware
pickle file model
POST request
Open API standards documentation
pass variables
prediction
Pydantic integration
query parameters
response_model attribute
RESTful data API
database connection, SQLAlchemy
data validation class
define class
new student/transaction
setup/installation
students and course data
Fault tolerance
Feature engineering
Field function
Filter context
Firewalls
Flask
Flow runs
Flows
G
GCP
See Google Cloud Platform (GCP)
Generator
get_student() function
GIL
See Global interpreter lock (GIL)
Git
branching
cloning
code database
features
forking
GitHub
pull request
Python code
repository
and Secure Shell
SSH daemon process
tracking features
GitHub account’s settings
GitHub server
.gitignore file
Global interpreter lock (GIL)
Google BigQuery
Google Bigtable
Google Cloud
Google Cloud CLI
Google Cloud console page
Google Cloud Platform (GCP)
and AWS
cloud computing services
Cloud SQL
Cloud Storage
compute engine
See Google Compute Engine
Google BigQuery
Google Bigtable
Google Dataproc
Google Vertex AI Workbench service
new Google account/log
organizational node
Vertex AI
See Google’s Vertex AI
Google Cloud SDK
Google Cloud SQL
Google Colab
Google Colab notebooks
Google Compute Engine
accelerator optimized
access VM instance through SSH
API enabled
compute optimized
create virtual machine
enable Google Compute Engine API
general-purpose engine
Linux virtual machine
memory optimized
provision new virtual machine
storage optimized
types
VM instances list
Google Dataproc
Google Dataproc’s Spark cluster
Google IAM
Google’s Cloud Storage
traffic split
Google’s Vertex AI
AutoML feature
Colab notebooks
deployment dashboard
deploy to endpoint
endpoint, model
import new model
machine learning model
machine learning platform
model registry
online prediction
traffic split
and Vertex AI Workbench
Government cloud
GPU-accelerated environments
GPU programming
CPU instructions
Kernels
memory management
Gradient-boosted trees
Graph databases
Graphical user interface (GUI)
GraphQL
See Graph query language (GraphQL)
Graph query language (GraphQL)
Great Expectations
checkpoints
components
data documentation
data validation libraries
definition
Expectations store
functionality and environment
setup and installation
CLI
data source
data sources
project
project structure creation
Python packages
relational databases
SQLAlchemy package
stores
virtual environments
writing expectations
Grid search
Group-by context
Group-by function
groupby() method
GUI
See Graphical user interface (GUI)
H
Hadoop
Hadoop Distributed File System (HDFS)
Hadoop ecosystem
HDFS
See Hadoop Distributed File System (HDFS)
Hierarchical Data Format
High availability
Hive tool
Horizontal scaling
Host
h5py
HTTP
See HyperText Transfer Protocol (HTTP)
HTTP methods
HTTP status code
Hybrid cloud
Hyperparameters
Hyperparameter tuning
grid search
incremental search
random search
HyperText Transfer Protocol (HTTP)
Hypervisor software
Hypothesis testing
I
IAM Identity Center
Identity and access management
Incremental search
Indexing
loc and .iloc methods
multi-indexing
in Pandas data frame
parameter
query
rows and columns
time delta
Infrastructure as a service
Inner join
Internet
Internet protocol (IP) address
isna() method
J
JavaScript Object Notation (JSON)
Jinja
“joblib”-based algorithms
JSON
See JavaScript Object Notation (JSON)
JSON file formats
JSON schemas
K
Keras
Kernels
Key hashing
Key–value stores
Kubernetes cluster
L
Lambdas
Lazy evaluation
Lazy frame
Left anti join
Left join
Left semi join
Link artifact
Linux distribution
List comprehension
Logging
M
Machine learning (ML)
classifier model
compute instance
adding
create
illustration
naming
data assets
data engineering tasks
diabetes
function_app.py file
GPUs
home page
jobs
model code
model deployment
storage account
workspace
Machine learning data pipeline workflow
data clean
data exploration
data integration
data source
data splitting
data wrangling
deployment
feature engineering
feature selection
final test
hyperparameter tuning
model evaluation
model selection
monitoring
retraining
training
Macros
Markdown artifacts
mean() function
melt() method
Memory management
Merge() method
Microsoft
Microsoft Azure
authentication
blob storage
container
create
deployment
sessions
uploading data
Cosmos DB
APIs
categories
choosing API
cluster configuration
configuration
database sharding
firewall rules
hardware resources
instance
JSON document
MongoDB cluster
provisioned throughput
resource group
serverless model
units
data factory
blob storage
console
copy data tool
creation
data integration
data pipeline
data system connection
destination data system
home page
linked service
monitoring/rerunning
source data system
validation process
management group
ML
See Machine learning (ML)
resource group
resources
SQL
backup storage redundancy
database
database server
data warehouse
eDTU
elastic pool
hardware resources
logical server
Python application
SaaS
server
serverless pools
storage classes
subscriptions
synapse analytics
Microsoft’s Cosmos DB
MinMaxScaler()
Missing values
data cleaning and transportation
data entry
methods and treatments
isna() method
notna()
NA
NaN
NaT
None
ML
See Machine learning (ML)
Multi-cloud
Multi-indexing
arrays
columns
data frame
Multiprocessing
Multi-thread scheduler
myVariable
N
NA
See Not available (NA)
NAN
See Not a number (NAN)
NaT
See Not a time (NaT)
ndJSON
Networking concepts
DNS
Firewalls
IP address
Ports
virtualization
Virtual Private Cloud
Nodes
NoSQL database systems
column-oriented databases
document databases
graph databases
key–value stores
Schema on Write vs. Schema on Read
time series databases
vector databases
NoSQL data store
Not a number (NAN)
Not a time (NaT)
Not available (NA)
Notifications
NumPy arrays
NumPy’s data type object
NVIDIA drivers
NVIDIA GPU
NVIDIA offerings
O
Object relational mapping (ORM)
Alembic
SQLAlchemy
engine
query API
session
Object storage
Observability
One hot encoding
OpenWeather API
Operational expenditure
Optimized Row Columnar (ORC)
ORC
See Optimized Row Columnar (ORC)
Organizations
ORM
See Object relational mapping (ORM)
Outer join
P
Pandas
and NumPy
objects
Pandera
Checks
data coercion
data frame schema
declarative schema definition
definition
installation
lazy validation
statistical validation
Pandera decorators
@check_input
@check_output
DataFrameSchema
data validators
decoratortest
hello_world() function
validation condition
validation parameters
Parallelism
Parallel processing
concepts
core
GIL
history
identify CPU Cores
large programming task into several smaller tasks
multiprocessing library
process
Python
thread
Params
Parquet
Partitioning
Path parameters
Peer-to-peer computing
Pickle files
Pickling
Pig Latin
Pivot_table() method
Platform as a service
Polars
advanced methods
dataset
missing values identification
pivot function
unique values identification
combing objects
See Combing multiple Polars objects
data extraction and loading
CSV file
JSON
Parquet
data structures
data transformation
See Data transformation, Polars
definition
lazy evaluation and eager evaluation
lazy vs. eager evaluation
multi-threading and parallelization
objects
Python data analysis ecosystem
Rust
syntax
Polars CLI
SQL query
structure
Polars/SQL interaction
copay_paid
data type conversions
random dataset
random_health_data
SQL context object
SQL queries
Ports
Postgres database syntax
POST method
Preceding data frame
Prefect
async
backend Prefect database
caching
development
See also Prefect development
future
installation
logging
notifications
observability
open source workflow orchestration system
Prefect Cloud
Prefect Core
Python functions
retries
scheduling
second-generation
secrets management system
server
set up
shallow
user interface, web browser
Prefect development
artifacts
See Artifacts
blocks
flow run output
flow runs
flows
interface
log trace, prefect flow
persisting results
results
state change hooks
states
task runners
tasks
variables
Principles, data validation
data accuracy
data completeness
data consistency
data format
data range
data uniqueness
referential integrity
Private cloud
Protocol Buffers (Protobuf)
Public cloud
Pull requests
Pydantic
applications
constrained type
data schemas declaring
definition
field function
installation
JSON schemas
Pydantic models
type annotations
validation and serialization logic
validators
Pydantic V2
Python
Python 3
Python codes
Python decorators
Python library
Python object
Python programming language
concepts
f string
function arguments
args parameter
kwargs
functions
preceding code
Python script
Python type annotations
PyTorch
Q
Query parameters
R
Random module
choice function
getrandbits() method
randint()
range
sample method
seed() method
shuffle() method
Real-time data pipelines
Redshift
Referential integrity
Relational database service (RDS)
See Amazon RDS
Remote computing
Representational state transfer (REST)
REST
See Representational state transfer (REST)
RESTful services
RobustScaler()
Rust
S
Scalability
Scheduler
Scheduling
SchemaError
scikit-learn
Selection context
Semi join
Sensors
Sequential execution
SequentialTaskRunner
Serialization
Series
Serverless computing
ShortCircuitOperator
Similarity search
SimpleImputer
Simple object access protocol (SOAP)
Single sign-on (SSO)
Single-thread scheduler
Skorch library
Software as a Service (SaaS)
SOAP
See Simple object access protocol (SOAP)
Spark
“spark.jars” file
SQL
See Structured query language (SQL)
SQLAlchemy
SQL Server
SSH connection
SSO
See Single sign-on (SSO)
Stacking
Stack() method
Starlette
State change hooks
State object
Streaming data
Stream processing
Kafka Streams
kSQL
ksqlDB
API key and secret
apt repository
auto.offset.reset
CLI instructions
command line interface
Confluent Kafka
Confluent platform
creation
environment variable and path
help command
insert data
Java
naming/sizing
prerequisites
public key
query
Stateful processing
Stateless processing
topics
create schema
creation
naming/partitions
Structured query language (SQL)
queries
self-joins
tables
temporary tables
view
materialized
standard
window functions
sum() method
Systemd
T
Table artifact
Task graphs
Task runners
Tasks
T.compute()
Templates
TensorFlow
Time delta indexes
single value
subset
Time series databases
“to_hdf5” method
transform() function
Type annotations refresher
Type hinting
Typing module
U
Uniform resource identifier (URI)
Uniform resource locator (URL)
unstack() method
Unsupervised learning
URI
See Uniform resource identifier (URI)
URL
See Uniform resource locator (URL)
User-defined function
V
Validate method
Validators
Variables
Vector databases
Vertex AI Workbench
Vertical scaling
Virtualization
Virtual Private Cloud
Virtual Private Network
W
wait() function
Webhooks
Web server
Window function
Workflow configuration
example
organization
parameters
product/solution
Workflow orchestration
automations
centralized administration and management of process
data pipelines
define workflow
error handling and routing
ETL data pipeline workflow
example
integration with external systems
scheduling
tasks/activities
utilization, resources
workflow configuration
World Wide Web
X, Y, Z
Xcom
XGBoost