Data Analysis With Python & Pandas
Data Analysis With Python & Pandas
This is a project-based course for students looking for a practical, hands-on, and highly
engaging approach to learning data analysis with Python using the Pandas library
Quizzes & Assignments to test and reinforce key concepts, with step-by-step solutions
Interactive demos to keep you engaged and apply your skills throughout the course
Introduce Pandas & NumPy, two critical Python libraries that help structure
1 Intro to Pandas & NumPy data in arrays & DataFrames and contain built-in functions for data analysis
Introduce Pandas Series, the Python equivalent of a column of data, and cover
2 Pandas Series their basic properties, creation, manipulation, and useful functions for analysis
4 Aggregating & Reshaping Aggregate & reshape data in DataFrames by grouping columns, performing
aggregation calculations, and pivoting & unpivoting data
Learn the basics of data visualization in Pandas, and use the plot method to
5 Data Visualization create & customize line charts, bar charts, scatterplots, and histograms
Put your skills to the test on a new dataset by analyzing key metrics for a new
6 Mid-Course Project retailer and share insights into its potential acquisition by Maven MegaMart
Learn how to work with the datetime data type in Pandas to extract date
7 Time Series components, group by dates, and perform calculations like moving averages
Read in data from flat files and apply processing steps during import, create
8 Importing & Exporting Data DataFrames by querying SQL tables, and write data back out to its source
Combine multiple DataFrames by joining data from related fields to add new
9 Combining DataFrames columns, and appending data with the same fields to add new rows
10 Final Project Put the finishing touches on your project by joining a new table, performing
time series analysis, and optimizing your workflow, and writing out your results
You’ve just been hired as a Data Analyst for Maven MegaMart, a multinational
THE corporation that operates a chain of retail and grocery stores. They recently
SITUATION received a sample of data from a new retailer they’re looking to acquire, and
they need you to identify and deliver key insights about their sales history.
1) Once inside the Jupyter interface, create a folder to store your notebooks for the course
NOTE: You can rename your folder by clicking “Rename” in the top left corner
2) Open your new coursework folder and launch your first Jupyter notebook!
NOTE: You can rename your notebook by clicking on the title at the top of the screen
NOTE: When you launch a Jupyter notebook, a terminal window may pop up as
well; this is called a notebook server, and it powers the notebook interface
In this section we’ll introduce Pandas & NumPy, two critical Python libraries that help
structure data in arrays & DataFrames and contain built-in functions for data analysis
Pandas is Python’s most widely used library for data analysis, and contains
functions for accessing, aggregating, joining, and analyzing data
Its data structure, the DataFrame, is analogous to SQL tables or Excel worksheets
NumPy is an open-source library that is the universal standard for working with
numerical data in Python, and forms the foundation of other libraries like Pandas
Pandas DataFrames are built on NumPy arrays and can leverage NumPy functions
NumPy arrays are fixed-size containers of items that are more efficient than
Python lists or tuples for data processing
• They only store a single data type (mixed data types are stored as a string)
• They can be one dimensional or multi-dimensional
• Array elements can be modified, but the array size cannot change
Results Preview
NEW MESSAGE
June 16, 2022
Can you convert a Python list into a NumPy array and help me
get familiar with their properties?
Thanks!
numpy_assignments.ipynb
Solution Code
NEW MESSAGE
June 16, 2022
Can you convert a Python list into a NumPy array and help me
get familiar with their properties?
Thanks!
numpy_assignments.ipynb
Creates an array of ones of a given size, as float by default np.ones((rows, cols), dtype)
Creates an array of zeros of a given size, as float by default np.zeros((rows, cols), dtype)
Creates an array of integers with given start & stop values, and a step
size (only stop is required, and is not inclusive) np.arange(start, stop, step)
Creates an array of floats with given start & stop values with n
elements, separated by a consistent step size (stop is inclusive) np.linspace(start, stop, n)
np.ones((rows, cols), dtype) np.arange(start, stop, step) start is 0 and step is 1 by default
stop is inclusive
np.array.reshape(rows, cols)
You can create random number arrays from a variety of distributions using
NumPy functions and methods (great for sampling and simulation!)
You can create random number arrays from a variety of distributions using
NumPy functions and methods (great for sampling and simulation!)
PRO TIP: Even though it’s optional, make sure to set a seed when generating random numbers to
ensure you and others can recreate the work you’ve done (the value for the seed is less important)
Results Preview
NEW MESSAGE
June 17, 2022
Thanks!
numpy_assignments.ipynb
Solution Code
NEW MESSAGE
June 17, 2022
Thanks!
numpy_assignments.ipynb
Results Preview
NEW MESSAGE
May 1,
June 17,2017
2022
numpy_assignments.ipynb
section01_NumPy.ipynb
Results Preview
NEW MESSAGE
May 1,
June 17,2017
2022
numpy_assignments.ipynb
section01_NumPy.ipynb
Results Preview
NEW MESSAGE
May 1,
June 18,2017
2022
Ok, so now that we’ve gotten the basics down, we can start
using NumPy for our first tasks. As part of a promotion, we
want to apply a random discount to surprise our customers
and generate social media buzz.
numpy_assignments.ipynb
Solution Code
NEW MESSAGE
May 1,
June 18,2017
2022
Ok, so now that we’ve gotten the basics down, we can start
using NumPy for our first tasks. As part of a promotion, we
want to apply a random discount to surprise our customers
and generate social media buzz.
The where() NumPy function performs a logical test and returns a given value if
the test is True, or another if the test is False
value if True,
Value to return when
value if False) the expression is True
The where() NumPy function performs a logical test and returns a given value if
the test is True, or another if the test is False
Results Preview
NEW MESSAGE
May 1,
June 19,2017
2022
Hey there,
Once you’ve done that, modify your logic to force cola into
the list. Call this array ‘fancy_feast_special’.
Thanks!
numpy_assignments.ipynb
Solution Code
NEW MESSAGE
May 1,
June 19,2017
2022
Hey there,
Once you’ve done that, modify your logic to force cola into
the list. Call this array ‘fancy_feast_special’.
Thanks!
numpy_assignments.ipynb
Array aggregation methods let you calculate metrics like sum, mean, and max
array.sum() Returns the sum of all values in an array array.max() Returns the largest value in an array
array.mean() Returns the average of the values in an array array.min() Returns the smallest value in an array
array.sum() Returns the sum of all values in an array array.sum(axis=0) Aggregates across rows
Array functions let you perform other aggregations like median and percentiles
You can also return a unique list of values or the square root for each number
Results Preview
NEW MESSAGE
May 1,
June 19,2017
2022
Top 3 Price Stats
From: Ross Retail (Head of Analytics)
Subject: Top
Indexing
Tier Products
and Slicing Arrays
Hey there,
Thanks for all your hard work. I know we’re working with
small sample sizes, but we’re proving that analysis can be Unique Price Tiers
done in Python!
Can you calculate the mean, min, max, and median of our 3
most expensive product prices? Sorting the array first should
help!
Thanks!
numpy_assignments.ipynb
Solution Code
NEW MESSAGE
May 1,
June 19,2017
2022
Top 3 Price Stats
From: Ross Retail (Head of Analytics)
Subject: Top
Indexing
Tier Products
and Slicing Arrays
Hey there,
Thanks for all your hard work. I know we’re working with
small sample sizes, but we’re proving that analysis can be Unique Price Tiers
done in Python!
Can you calculate the mean, min, max, and median of our 3
most expensive product prices? Sorting the array first should
help!
Thanks!
numpy_assignments.ipynb
1 2 3 1 2 3 4
1 2 3 + = 2 3 4
1 2 3 2 3 4
1 2 3 3 2 1 4 4 4
1 2 3 + = 4 4 4
1 2 3 4 4 4
1 2 3 3 4 5 6
1 2 3 + 2 = 3 4 5
1 2 3 1 2 3 4
1 2 3 2 1
1 2 3 + =
1 2 3
1 2 3 2 3 4 5
+ 2 = 3 4 5
2 3 4 5
1 1 1 1 1 1
1 1 1 1 1 1 2 2 2
1 1 1 + = 1 1 1
1 1 1 2 2 2
1 1 1 1 1 1
1 1 1 2 2 2
Results Preview
NEW MESSAGE
May 1,
June 20,2017
2022
Alright, our new data scientist set up a little test case for us.
She provided code to read in data from a csv and convert two
columns to arrays.
Thanks!
numpy_assignments.ipynb
Solution Code
NEW MESSAGE
May 1,
June 20,2017
2022
Alright, our new data scientist set up a little test case for us.
She provided code to read in data from a csv and convert two
columns to arrays.
Thanks!
numpy_assignments.ipynb
NumPy arrays are more efficient than base Python lists and tuples
• They are semi-mutable data structures capable of storing many data types
• Their values can be modified, but their size cannot
In this section we’ll introduce Pandas Series, the Python equivalent of a column of data,
and cover their basic properties, creation, manipulation, and useful functions for analysis
The index is a
range of integers
from 0 to 10
Pandas data types mostly expand on their base Python and NumPy equivalents
You can convert the data type in a Pandas Series by using the .astype() method
and specifying the desired data type (if compatible)
Results Preview
NEW MESSAGE
June 21, 2022
Make sure to include name, dtype, size, index, then take the
mean of the values array. Finally, convert the series to an
integer data type and recalculate the mean.
Thanks!
section02_Series.ipynb
Solution Code
NEW MESSAGE
June 21, 2022
Make sure to include name, dtype, size, index, then take the
mean of the values array. Finally, convert the series to an
integer data type and recalculate the mean.
Thanks!
section02_Series.ipynb
The index lets you easily access “rows” in a Pandas Series or DataFrame
There are cases where it’s applicable to use a custom index for accessing rows
There are cases where it’s applicable to use a custom index for accessing rows
The .iloc[] method is the preferred way to access values by their positional index
• This method works even when Series have a custom, non-integer index
• It is more efficient than slicing and is recommended by Pandas’ creators
Series or DataFrame The row position(s) for the The column position(s) for the
to access values from value(s) you want to access value(s) you want to access
Examples:
• 0 (single row)
• [5, 9] (multiple rows) We’ll use the column position
• [0:11] (range of rows) argument once we start working
with Pandas DataFrames
The .iloc[] method is the preferred way to access values by their positional index
• This method works even on Series with a custom, non-integer index
• It is more efficient than slicing and is recommended by Pandas’ creators
This returns the values from the 3rd to the 4th position (stop
is non-inclusive)
The .loc[] method is the preferred way to access values by their custom labels
Series or DataFrame The custom row index for the The custom column index for
to access values from value(s) you want to access the value(s) you want to access
Examples:
• "pizza" (single row)
• ["mike", "ike"] (multiple rows)
• ["jan":"dec"] (range of rows)
The .loc[] method is the preferred way to access values by their custom labels
Note that ‘coffee’ is used as an index value twice Warning! Duplicate index values are
generally not advised, but there are
some edge cases where they are useful
You can reset the index in a Pandas Series or DataFrame back to the default
range of integers by using the .reset_index() method
• By default, the existing index will become a new column in a DataFrame
Results Preview
NEW MESSAGE
June 21, 2022
Thanks for picking up this work, but this data isn’t really useful
without dates since I need to understand trends over time to
improve my forecasts.
Then, take the mean of the first 10 and last 10 prices. After
that, can you grab all oil prices from January 1st, 2017 to
January 7th, 2017 and revert the index of this slice back to
integers?
Thanks!
section02_Series.ipynb
Solution Code
NEW MESSAGE
June 21, 2022
Thanks for picking up this work, but this data isn’t really useful
without dates since I need to understand trends over time to
improve my forecasts.
Then, take the mean of the first 10 and last 10 prices. After
that, can you grab all oil prices from January 1st, 2017 to
January 7th, 2017 and revert the index of this slice back to
integers?
Thanks!
section02_Series.ipynb
You can filter a Series by passing a logical test into the .loc[] accessor (like arrays!)
You can use these operators & methods to create Boolean filters for logical tests
Equal == .eq()
You can use these operators & methods to create Boolean filters for logical tests
Equal == .eq()
Specify ascending=False
to sort in descending order
Results Preview
NEW MESSAGE
June 22, 2022
First, can you get me the 10 lowest prices from the data,
sorted by date, starting with the most recent and ending with
the oldest?
Thanks!
section02_Series.ipynb
Solution Code
NEW MESSAGE
June 22, 2022
First, can you get me the 10 lowest prices from the data,
sorted by date, starting with the most recent and ending with
the oldest?
Thanks!
section02_Series.ipynb
You can use these operators & methods to perform numeric operations on Series
Addition + .add()
.sub(),
Subtraction -
.subtract()
.mul(),
Multiplication *
.multiply()
These both add two to every row
.div(),
Division / .truediv(),
.divide()
The Pandas str accessor lets you access many string methods
• These methods all return a Series (split returns multiple series)
.count("string") Counts all instances of a given string The str accessor lets you
access the string methods
.contains("string") Returns True if a given string is found; False if not
.len() Returns the length of each string in a Series This is removing the dollar sign,
then converting to float
.startswith("string"), Returns True if a string starts or ends with given
.endswith("string") string; False if not
Results Preview
NEW MESSAGE
June 22, 2022
Hey there,
Finally, extract the month from the string dates in the index,
and store them as an integer.
Thanks!
section02_Series.ipynb
Solution Code
NEW MESSAGE
June 22, 2022
Hey there,
Finally, extract the month from the string dates in the index,
and store them as an integer.
Thanks!
section02_Series.ipynb
.argmax(), .argmin() Returns the index for the smallest or largest values
Specify normalize=True to
return the percentage of total
for each category
Results Preview
NEW MESSAGE
June 23, 2022
Hi again!
I need a few more metrics. Can you calculate the sum and
mean of prices in the month of march? Next, how many prices
did we have in Jan and Feb?
Then, calculate the 10th and 90th percentiles across all data.
Finally, how often did integer dollar values (e.g. 51, 52) occur
in the dataNormalize the results to a percentage.
Thanks!
section02_Series.ipynb
SOLUTION: SERIES AGGREGATIONS
Solution Code
NEW MESSAGE
June 23, 2022
Hi again!
I need a few more metrics. Can you calculate the sum and
mean of prices in the month of march? Next, how many prices
did we have in Jan and Feb?
Then, calculate the 10th and 90th percentiles across all data.
Finally, how often did integer dollar values (e.g. 51, 52) occur
in the dataNormalize the results to a percentage.
Thanks!
section02_Series.ipynb
MISSING DATA
Pandas released its own missing data type, NA, in December 2020
• This allows missing values to be stored as integers, instead of needing to convert to float
• This is still a new feature, but most bugs end up converting the data to NumPy’s NaN
The .isna() and .value_counts() methods let you identify missing data in a Series
• The .isna() method returns True if a value is missing, and False otherwise
The .dropna() and .fillna() methods let you handle missing data in a Series
• The .dropna() method removes NaN values from your Series or DataFrame
It’s important to be thoughtful and deliberate in how you handle missing data
Do you keep them? Do you remove them? Do you replace them with zeros?
Results Preview
NEW MESSAGE
June 27, 2022
Hey,
Thanks!
section02_Series.ipynb
Solution Code
NEW MESSAGE
June 27, 2022
Hey,
Thanks!
section02_Series.ipynb
The .apply() method lets you apply custom functions to Pandas Series
• This function will not be vectorized, so it’s not as efficient as native functions
Pandas’ .where() method lets you manipulate data based on a logical condition
value if False,
Value to return when
inplace=False) the expression is False
Pandas’ .where() method lets you manipulate data based on a logical condition
Pandas’ .where() method lets you manipulate data based on a logical condition
The first where method applies a 90% discount if a price is greater than 20
The second applies a value of 0 when a price is NOT greater than 10
NumPy’s where function is often more convenient & useful than Pandas’ method
Results Preview
NEW MESSAGE
June 29, 2022
Thanks!
section02_Series.ipynb
Solution Code
NEW MESSAGE
June 29, 2022
Thanks!
section02_Series.ipynb
The .loc() & .iloc() methods are key in working with Pandas data structures
• These methods allow you to access rows in Series (and later columns in DataFrames), either by their
positional index or by their labels
Pandas & NumPy have similar operations for filtering, sorting & aggregating
• Use built-in Pandas and NumPy functions and methods to take advantage of vectorization, which is
much more efficient than writing for loops in base Python
In this section we’ll introduce Pandas DataFrames, the Python equivalent of an Excel or
SQL table which we’ll use to store and analyze data
You can create a DataFrame from a Python dictionary or NumPy array by using
the Pandas DataFrame() function
You’ll more likely create a DataFrame by reading in a flat file (csv, txt, or tsv) with
Pandas read_csv() function
Results Preview
NEW MESSAGE
July 5, 2022
Hi there,
I heard you did some great work for our finance department. I
need some help analyzing our transactions to make sure there
aren’t any irregularities in the numbers. The data is stored in
transactions.csv.
Can you read this data in and remind me how many rows are
in the data, which columns are in the data, and their
datatypes? More to come soon!
Thanks!
section03_DataFrames.ipynb
Solution Code
NEW MESSAGE
July 5, 2022
Hi there,
I heard you did some great work for our finance department. I
need some help analyzing our transactions to make sure there
aren’t any irregularities in the numbers. The data is stored in
transactions.csv.
Can you read this data in and remind me how many rows are
in the data, which columns are in the data, and their
datatypes? More to come soon!
Thanks!
section03_DataFrames.ipynb
Returns key details on a DataFrame’s size , columns, and memory usage df.info()
The .head() and .tail() methods return the top or bottom rows in a DataFrame
• This is a great way to QA data upon import!
The .info() method returns details on a DataFrame’s properties and memory usage
Position, name,
and data type for
each column
Memory usage
The .info() method returns details on a DataFrame’s properties and memory usage
Non-null values
Quartile values
Results Preview
NEW MESSAGE
July 7, 2022
Hi there,
Can you dive a bit more deeply into the retail data and check if
there are any missing values?
Finally, can you report the mean, median, min and max of
“transactions”? I want to check for any anomalies in our data.
Thanks!
section03_DataFrames.ipynb
Solution Code
NEW MESSAGE
July 7, 2022
Hi there,
Can you dive a bit more deeply into the retail data and check if
there are any missing values?
Finally, can you report the mean, median, min and max of
“transactions”? I want to check for any anomalies in our data.
Thanks!
section03_DataFrames.ipynb
PRO TIP: Even though you’ll see many examples of dot notation in use, stick
to bracket notation for single columns of data as it is less likely to cause issues
You can use Series operations on DataFrame columns (each column is a Series!)
First 5 unique values in a column with their frequencies Rounded sum of values in a column
You can select multiple columns with a list of column names between brackets
• This is ideal for selecting non-consecutive columns in a DataFrame
The .iloc() accessor filters DataFrames by their row and column indices
• The first parameter accesses rows, and the second accesses columns
First 5 rows,
all columns
The .iloc() accessor filters DataFrames by their row and column indices
• The first parameter accesses rows, and the second accesses columns
All rows,
2-4 columns
The .iloc() accessor filters DataFrames by their row and column indices
• The first parameter accesses rows, and the second accesses columns
First 5 rows,
2-4 columns
The .loc() accessor filters DataFrames by their row and column labels
• The first parameter accesses rows, and the second accesses columns
All rows,
“date” column
The .loc() accessor filters DataFrames by their row and column labels
• The first parameter accesses rows, and the second accesses columns
Results Preview
NEW MESSAGE
July 9, 2022
Hi there,
I noticed that the first row is the only one from 2013-01-01.
Thanks!
section03_DataFrames.ipynb
Solution Code
NEW MESSAGE
July 9, 2022
Hi there,
I noticed that the first row is the only one from 2013-01-01.
Thanks!
section03_DataFrames.ipynb
Results Preview
NEW MESSAGE
July 9, 2022
Hi there,
Can you drop the first row of data this time? A slice is fine for
viewing but we want this permanently removed. Also, drop
the date column, but not in place.
Then, can you get me a DataFrame that includes only the last
row for each of our stores? I want to take a look at the most
recent data for each.
Thanks!
section03_DataFrames.ipynb
Solution Code
NEW MESSAGE
July 9, 2022
Hi there,
Can you drop the first row of data this time? A slice is fine for
viewing but we want this permanently removed. Also, drop
the date column, but not in place.
Then, can you get me a DataFrame that includes only the last
row for each of our stores? I want to take a look at the most
recent data for each.
Thanks!
section03_DataFrames.ipynb
You can identify missing data by column using the .isna() and .sum() methods
• The .info() method can also help identify null values
Like with Series, the .dropna() and .fillna() methods let you handle missing data
in a DataFrame by either removing them or replacing them with other values
Use a dictionary
to specify a value
for each column
Results Preview
NEW MESSAGE
July 12, 2022
Hi again,
I’d like to see the mean oil price if we fill in the missing values
with 0 and compare it to the mean oil price if we fill them in
with the mean.
Thanks!
section03_DataFrames.ipynb
Solution Code
NEW MESSAGE
July 12, 2022
Hi again,
I’d like to see the mean oil price if we fill in the missing values
with 0 and compare it to the mean oil price if we fill them in
with the mean.
Thanks!
section03_DataFrames.ipynb
You can filter the rows in a DataFrame by passing a logical test into the .loc[]
accessor, just like filtering a Series or a NumPy array
You can filter the columns in a DataFrame by passing them into the .loc[]
accessor as a list or a slice
You can apply multiple filters by joining the logical tests with an “&” operator
• Try creating a Boolean mask for creating filters with complex logic
The .query() method lets you use SQL-like syntax to filter DataFrames
• You can specify any number of filtering conditions by using the “and” & “or” keywords
The .query() method lets you use SQL-like syntax to filter DataFrames
• You can specify any number of filtering conditions by using the “and” & “or” keywords
• You can reference variables by using the “@” symbol
… output truncated
Results Preview
NEW MESSAGE
July 12, 2022
section03_DataFrames.ipynb
Solution Code
NEW MESSAGE
July 12, 2022
section03_DataFrames.ipynb
You can sort a DataFrame by its indices using the .sort_index() method
• This sorts rows (axis=0) by default, but you can specify axis=1 to sort the columns
You can sort a DataFrame by its indices using the .sort_index() method
• This sorts rows (axis=0) by default, but you can specify axis=1 to sort the columns
You can sort a DataFrame by its values using the .sort_values() method
• You can sort by a single column or by multiple columns
+ -
Results Preview
NEW MESSAGE
July 13, 2022
Hi there,
Can you get me a dataset that includes the 5 days with the
highest transactions counts? Any similarities between them?
Thanks!
section03_DataFrames.ipynb
Solution Code
NEW MESSAGE
July 13, 2022
Hi there,
Can you get me a dataset that includes the 5 days with the
highest transactions counts? Any similarities between them?
Thanks!
section03_DataFrames.ipynb
Reorder columns with the .reindex() method when sorting won’t suffice
Results Preview
NEW MESSAGE
July 15, 2022
Hi again,
Just some quick work, but can you send me the transaction
data with the columns renamed?
Thanks!
section03_DataFrames.ipynb
Solution Code
NEW MESSAGE
July 15, 2022
Hi again,
Just some quick work, but can you send me the transaction
data with the columns renamed?
Thanks!
section03_DataFrames.ipynb
You can create columns with arithmetic by assigning them Series operations
• Simply specify the new column name and assign the operation of interest
Results Preview
NEW MESSAGE
July 18, 2022
Thanks!
section03_DataFrames.ipynb
Solution Code
NEW MESSAGE
July 18, 2022
Thanks!
section03_DataFrames.ipynb
NumPy’s select() function lets you create columns based on multiple conditions
• This is more flexible than NumPy’s where() function or Pandas’ .where() method
Results Preview
NEW MESSAGE
July 20, 2022
Thanks!
section03_DataFrames.ipynb
Solution Code
NEW MESSAGE
July 20, 2022
Thanks!
section03_DataFrames.ipynb
Keys Values
The .assign() method creates multiple columns at once and returns a DataFrame
• This can be chained together with other data processing methods
Results Preview
NEW MESSAGE
July 23, 2022
Hi there,
Drop the columns that have been created so far (keep only
date, store_number, and transaction count), and recreate
them using the assign method.
Then sum the seasonal bonus owed once again to make sure
the numbers are correct.
Thanks!
section03_DataFrames.ipynb
Solution Code
NEW MESSAGE
July 23, 2022
Hi there,
Drop the columns that have been created so far (keep only
date, store_number, and transaction count), and recreate
them using the assign method.
Then sum the seasonal bonus owed once again to make sure
the numbers are correct.
Thanks!
section03_DataFrames.ipynb
Pandas data types mostly expand on their base Python and NumPy equivalents
The Pandas categorical data type stores text data with repeated values efficiently
• Python maps each unique category to an integer to save space
• As a rule of thumb, only consider this data type when unique categories < number of rows / 2
The categorical data type has some quirks during some data manipulation operations that will
force it back into an object data type , but it’s not something we’ll cover in depth in this course
Like Series, you can convert data types in a DataFrame by using the .astype()
method and specifying the desired data type (if compatible)
Like Series, you can convert data types in a DataFrame by using the .astype()
method and specifying the desired data type (if compatible)
Use memory_usage=“deep”
with the .info() method to get
total memory usage along
with the column data types Text data, usually including dates,
is read in as an object by default
Integers and floats are cast as 64-bit by default to handle any possible value, but
you can downcast numeric data to a smaller bit size to save space if possible
• 8-bits = -128 to 127
• 16-bits = -32,768 to 32,767
• 32-bits = -2,147,483,648 to 2,147,483,647
• 64-bits = -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807
Integers and floats are cast as 64-bit by default to handle any possible value, but
you can downcast numeric data to a smaller bit size to save space if possible
Use the categorical data type if you have string columns where the number of
unique values is less than half of the total number of rows
BEFORE AFTER
Results Preview
NEW MESSAGE
July 25, 2022
Hi,
They have a very, very old computer, so do you think you can
reduce the memory usage to below 5MB, without losing any
of the information?
Thanks!
section03_DataFrames.ipynb
Solution Code
NEW MESSAGE
July 25, 2022
Hi,
They have a very, very old computer, so do you think you can
reduce the memory usage to below 5MB, without losing any
of the information?
Thanks!
section03_DataFrames.ipynb
You can easily filter, sort, and modify DataFrames with methods & functions
• DataFrames rows & columns can be sorted by index or values, and filtered using multiple conditions
• Columns can be created with arithmetic or complex logic, and multiple columns can be created with .assign()
In this section we’ll cover aggregating & reshaping DataFrames, including grouping
columns, performing aggregation calculations, and pivoting & unpivoting data
You can aggregate a DataFrame column by using aggregation methods (like Series!)
But what if you want multiple aggregate statistics, or summarized statistics by groups?
To group data, use the .groupby() method and specify a column to group by
• The grouped column becomes the index by default
To group data, use the .groupby() method and specify a column to group by
• The grouped column becomes the index by default
Results Preview
NEW MESSAGE
August 3, 2022
Thanks!
section04_aggregations.ipynb
Solution Code
NEW MESSAGE
August 3, 2022
Thanks!
section04_aggregations.ipynb
You can group by multiple columns by passing the list of columns into .groupby()
• This creates a multi-index object with an index for each column the data was grouped by
You can group by multiple columns by passing the list of columns into .groupby()
• This creates a multi-index object with an index for each column the data was grouped by
• Specify as_index=False to prevent the grouped columns from becoming indices
Results Preview
NEW MESSAGE
August 4, 2022
Thanks!
section04_aggregations.ipynb
Solution Code
NEW MESSAGE
August 4, 2022
Thanks!
section04_aggregations.ipynb
The .loc[] accessor lets you access multi-index DataFrames in different ways:
The .loc[] accessor lets you access multi-index DataFrames in different ways:
Reset the index Swap the index level Drop an index level
Moves the index levels back Changes the hierarchy for Drops an index level from
to DataFrame columns the index levels the DataFrame entirely
PRO TIP: In most cases it’s best to reset the index and
avoid multi-index DataFrames – they’re not very intuitive! Be careful! You may lose
important information
Results Preview
NEW MESSAGE
August 5, 2022
section04_aggregations.ipynb
Solution Code
NEW MESSAGE
August 5, 2022
section04_aggregations.ipynb
The .agg() method lets you perform multiple aggregations on a “groupby” object
You can name aggregated columns upon creation to avoid multi-index columns
Results Preview
NEW MESSAGE
August 6, 2022
By Store:
From: Chandler Capital (Accounting)
Subject: Bonus Rate and Bonus Payable
Hey again,
Calculate the total bonuses payable to each store and sort the
DataFrame from highest bonus owed to lowest.
By Weekday:
Then do the same for day of week and month.
Thanks!
section04_aggregations.ipynb
NOTE: Only the top 5 rows for each DataFrame are included here
Solution Code
NEW MESSAGE
August 6, 2022
By Store:
From: Chandler Capital (Accounting)
Subject: Bonus Rate and Bonus Payable
Hey again,
By Month:
I’m performing some further analysis on our bonuses.
Can you create a table that has the average number of days
each store hit the target?
Calculate the total bonuses payable to each store and sort the
DataFrame from highest bonus owed to lowest. By Weekday:
Then do the same for day of week and month.
Thanks!
section04_aggregations.ipynb
Results Preview
NEW MESSAGE
August 6, 2022
Hey again,
Thanks!
section04_aggregations.ipynb
Solution Code
NEW MESSAGE
August 6, 2022
Hey again,
Thanks!
section04_aggregations.ipynb
Use a dictionary to
apply specific functions
to specific columns
If the column argument isn’t specified in a pivot table, it will return a table that’s
identical to one grouped by the index columns
If the column argument isn’t specified in a pivot table, it will return a table that’s
identical to one grouped by the index columns
axis=None adds a red-yellow-green heatmap to the whole table axis=1 adds a red-yellow-green heatmap to each row
The .melt() method will unpivot a DataFrame, or convert columns into rows
Use the “id_vars” argument to specify the column to unpivot the DataFrame by
You can also select the columns to melt and name the “variable” & “value” columns
Results Preview
NEW MESSAGE
August 7, 2022
Hey again,
Can you create a pivot table that has the sum of bonus
payable by day of week? Make sure to filter out any rows that
had 0 bonus payable first and add a heatmap across the rows.
Then unpivot the table so we have one row for each store and
day of the week with the corresponding total owed.
Thanks
section04_aggregations.ipynb
Solution Code
NEW MESSAGE
August 7, 2022
Hey again,
Can you create a pivot table that has the sum of bonus
payable by day of week? Make sure to filter out any rows that
had 0 bonus payable first and add a heatmap across the rows.
Then unpivot the table so we have one row for each store and
day of the week with the corresponding total owed.
Thanks
section04_aggregations.ipynb
Use the .agg() method to specify multiple aggregation functions when grouping
• Named aggregations allow you to set intuitive column names and prevent multi-index columns
The .pivot_table() and .melt() methods let you pivot and unpivot DataFrames
• Pandas pivot tables work just like Excel, and make data “wide” by converting unique row values into columns
• With .melt(), you can make “wide” tables “long” in order to analyze the data traditionally
In this section we’ll cover basic data visualization in Pandas, and use the plot method to
create & customize line charts, bar charts, pie charts, scatterplots, and histograms
Pandas uses the Matplotlib API to create charts and visualize data
• This is an integration with the main Matplotlib library
This course only covers Pandas’ limited data visualization capabilities, but other full Python
libraries like Matplotlib, Seaborn, and Plotly offer more chart types, options and capabilities
You can change the x-axis by setting a different index or using the “x” argument
This sets ‘date’ as the index before plotting the chart This sets ‘date’ as the x-axis in the .plot() parameters
It automatically splits
the date components!
You can select series to plot with the .loc[] accessor or using the “y” argument
This filters the ‘grocery’ column before plotting the chart This plots ‘grocery’ values along the y-axis
Results Preview
NEW MESSAGE
August 14, 2022
Hi there,
I don’t have a great grasp on the trends in oil prices, there are
way too many values to understand from table data.
Can you plot the oil prices with a simple line chart?
Thanks!
section05_data_viz.ipynb
Solution Code
NEW MESSAGE
August 14, 2022
Hi there,
I don’t have a great grasp on the trends in oil prices, there are
way too many values to understand from table data.
Can you plot the oil prices with a simple line chart?
Thanks!
section05_data_viz.ipynb
You can modify the chart formatting by using .plot() method arguments
• title = “title” – title to use for the chart
• xlabel = “title” – name to use for the x-axis
• ylabel = “title” – name to use for the y-axis
• color = “color” or “#hexacode” – color(s) to use for the data series
• cmap = “color palette” – preset color palette to apply to the chart
• style = “symbol” – style for the line (dashed, dotted, etc.)
• legend = True/False – adds or removes the legend from the chart
• rot = degrees – degrees to rotate the x-axis labels for readability
• figsize = (width, height) – size for the chart in inches
• grid = True/False – adds gridlines to the chart when True
• subplots = True/False – creates a separate chart for each series when True
• sharex = True/False – each series in the subplot shares the x-axis when True
• sharey = True/False – each series in the subplot shares the y-axis when True
• layout = (rows, columns) – the rows and columns to break the subplots into
For a full list of arguments, visit: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html *Copyright Maven Analytics, LLC
CHART TITLES
You can add a custom chart title and axis labels to increase understanding
You can modify the series colors by using common color names or hex codes
You can modify the series colors by using common color names or hex codes
Hexadecimal codes work too! If you have more than one series,
you can pass the colors in a list
You can also modify the entire color palette for the series in the chart
For more on color palettes, visit: https://matplotlib.org/3.5.0/tutorials/colors/colormaps.html *Copyright Maven Analytics, LLC
LINE STYLE
Line charts are solid by default, but you can change the line style if needed
- Solid (default)
-- Dashed
-. Dash-dot
.. Dotted
The “legend” .plot() argument lets you add or remove the legend
• This can be useful in some scenarios, but in others you’ll want to reposition the legend
You can reposition the legend by chaining the .legend() method to .plot() and
specifying its location with the “loc” or “bbox_to_anchor” arguments
best (default)
upper right
upper left
upper center
lower right
lower left
lower center
center right
center left
center
You can reposition the legend by chaining the .legend() method to .plot() and
specifying its location with the “loc” or “bbox_to_anchor” arguments
best (default)
upper right
upper left
upper center
lower right
“bbox_to_anchor” lets you specify
lower left the coordinates for the legend
center right
center left
center
Matplotlib & Seaborn have premade style templates that can be applied to charts
• Once a style is set, it will automatically be applied to all charts
Results Preview
NEW MESSAGE
August 15, 2022
Our company tends to use “darkgrid” for the style, but I’d also
like a chart title, axis labels, the line to be in black, and to
rename the price column to a more intuitive name.
Thanks!
section05_data_viz.ipynb
Solution Code
NEW MESSAGE
August 15, 2022
Our company tends to use “darkgrid” for the style, but I’d also
like a chart title, axis labels, the line to be in black, and to
rename the price column to a more intuitive name.
Thanks!
section05_data_viz.ipynb
You can leverage subplots to create a separate chart for each series
You can leverage subplots to create a separate chart for each series
You can adjust the size of the plot or subplots (in inches) using “figsize”
The default size is 6.4 x 4.8 inches This has been resized to 8 x 8 inches
Results Preview
NEW MESSAGE
August 16, 2022
Thanks again!
section05_data_viz.ipynb
Solution Code
NEW MESSAGE
August 16, 2022
Thanks again!
section05_data_viz.ipynb
You can change chart types with the “kind” argument or the attribute for each chart
This selects the chart type within the .plot() parameters This chains the chart type as an attribute to .plot()
PREFERRED
*Copyright Maven Analytics, LLC
COMMON CHART TYPES
For more chart types, visit: https://pandas.pydata.org/docs/user_guide/visualization.html *Copyright Maven Analytics, LLC
LINE CHARTS
Line charts are used for showing trends and changes over time
Bar charts are used for making comparisons with categorical data
Bar charts are used for making comparisons with categorical data
Plotting multiple series with a bar plot will create a grouped bar chart
Specify “stacked=True” when plotting multiple series to create a stacked bar chart
• This still lets you compare the categories, but also shows the composition of each category
Calculate percent of total values by index to create a 100% stacked bar chart
• This emphasizes the difference in composition between categories (instead of absolute values)
Results Preview
NEW MESSAGE
August 17, 2022
Thanks!
section05_data_viz.ipynb
Solution Code
NEW MESSAGE
August 17, 2022
Thanks!
section05_data_viz.ipynb
Pie charts are used for showing composition with categorical data
• # of categories <= 5
• Slices are sorted
• First slice starts at 0°
Scatterplots are used for showing the relationship between numerical series
Results Preview
NEW MESSAGE
August 18, 2022
Hi there,
Thanks!
section05_data_viz.ipynb
Solution Code
NEW MESSAGE
August 18, 2022
Hi there,
Thanks!
section05_data_viz.ipynb
Results Preview
NEW MESSAGE
August 20, 2022
Hi there,
Thanks!
section05_data_viz.ipynb
Solution Code
NEW MESSAGE
August 20, 2022
Hi there,
Thanks!
section05_data_viz.ipynb
There is much more to data visualization in Python, and most libraries are built
to be compatible with Pandas Series and DataFrames
For more chart types, customization, and interactivity, consider these libraries:
Matplotlib is what we’ve used to visualize data so far, but there are more chart
types and customization options available (which can make it challenging to work with)
Plotly is used for creating beautiful, interactive visuals, and has a sister app (Dash)
that allows you to easily create dashboards as web applications
Folium is great for creating geospatial maps, especially since it contains features
that are not available in the rest of the libraries
Key Objectives
NEW MESSAGE
August 23, 2022 1. Read in data from multiple csv files
From: Ross Retail (Head of Analytics) 2. Explore the data (millions of rows!)
Subject: Maven Acquisition Target Data
3. Create new columns to aid in analysis
Hi there, 4. Filter, sort, and aggregate the data to pinpoint
This is going to be a deep dive analysis presented to senior and summarize important information
management at our company. Maven MegaMart is planning
to acquire another retailer to expand our market share. As 5. Build plots to communicate key insights
part of the due diligence process, they’ve sent us over several
tables relating to their customers and sales.
I’ve taken a quick look, but given your great work so far I want
you to lead on this. There are some more detailed questions in
the attached notebook.
Thanks!
section06_midcourse_project.ipynb
In this section we’ll cover time series in Pandas, which leverages the datetime data type to
extract date components, group by dates, and perform calculations like moving averages
Base Python’s datetime data type lets you work with time series data
• Datetimes include both a date and time portion by default
NumPy’s datetime64 data type lets you work with datetimes in DataFrames
• Pandas will treat dates as “object” datatypes until converted to “datetime64”
The .astype() method can convert strings to datetimes, but has some limitations:
1. Any values that Pandas can’t correctly identify as dates will return an error
2. Ignoring the errors will return the Series as an “object” data type, not datetime64
You can use these datetime codes to format dates or extract date components:
Date: Time:
%D 07/14/2022 Zero-padded date (based on OS settings) %T 16:36:30 Zero-padded time (24-hour format)
For a full list of codes, visit: https://strftime.org/ *Copyright Maven Analytics, LLC
FORMATTING DATETIMES
Use datetime codes with the .strftime() method to customize the datetime format
The .dt accessor lets you use .strftime() on datetimes in Pandas DataFrames
For a full list of codes, visit: https://strftime.org/ *Copyright Maven Analytics, LLC
EXTRACTING DATETIME COMPONENTS
While you can also use the .strftime() method to extract datetime components,
it’s preferred to use the dedicated .dt accessors
For a full list of datetime accessors, visit the official documentation: https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.date.html *Copyright Maven Analytics, LLC
EXTRACTING DATETIME COMPONENTS
While you can also use the .strftime() method to extract datetime components,
it’s preferred to use the dedicated .dt accessors
Results Preview
NEW MESSAGE
August 29, 2022
Hi,
Can you create columns for year, month, and day of week?
Thanks
section07_time_series.ipynb
Solution Code
NEW MESSAGE
August 29, 2022
Hi,
Can you create columns for year, month, and day of week?
Thanks
section07_time_series.ipynb
Time deltas represent the amount or time, or difference, between two datetimes
• A time delta is returned when subtracting two datetime values
Time deltas represent the amount or time, or difference, between two datetimes
• A time delta is returned when subtracting two datetime values
The .to_timedelta() method lets you use frequencies to return timedelta values
“W” Week
“H” Hour
“T” Minute
“S” Second
You can use time delta arithmetic to offset dates by a specified period of time
Results Preview
NEW MESSAGE
August 29, 2022
Hi again,
Thanks
section07_time_series.ipynb
Solution Code
NEW MESSAGE
August 29, 2022
Hi again,
Thanks
section07_time_series.ipynb
Use datetimes as the index to allow for intuitive slicing of your DataFrame
• Like with integers, make sure your time index is sorted or you will get very odd results!
Use datetimes as the index to allow for intuitive slicing of your DataFrame
• Like with integers, make sure your time index is sorted or you will get very odd results!
Having a time series index provides methods for fixing missing data beyond .fillna()
Results Preview
NEW MESSAGE
August 30, 2022 Mean with Missing Values: 67.71
From: Rachel Revenue (Sr. Financial Analyst) Mean with Forward Fill: 67.67
Mean with Backfill: 67.67
Subject: Missing Oil Prices
Mean with Interpolation: 67.66
Hey rockstar,
Take a look at the mean value for the oil price using forward
fill, backfill, and interpolation. Are they very different?
Then, plot the series with forward fill for the year 2014, the
month of December 2014, and the days from December 1st to
December 15th, 2014.
section07_time_series.ipynb
Solution Code
NEW MESSAGE
August 30, 2022
Hey rockstar,
Take a look at the mean value for the oil price using forward
fill, backfill, and interpolation. Are they very different?
Then, plot the series with forward fill for the year 2014, the
month of December 2014, and the days from December 1st to
December 15th, 2014.
section07_time_series.ipynb
You can shift a Series by a specified number of rows using the .shift() method
• This is helpful when working with time series to compare values against previous periods
Current Period
% Growth = −1
Previous Period
The .diff() method calculates the difference between the values in a Series and
those same values shifted a specified number of periods
• This is useful in measuring absolute changes over time
Results Preview
NEW MESSAGE
August 31, 2022
Hello,
I’m looking into a few different year over year trends related
to year over year changes for store number 47.
Then, can you plot year over year growth percent in average
monthly transactions?
Thanks
section07_time_series.ipynb
Solution Code
NEW MESSAGE
August 31, 2022
Hello,
I’m looking into a few different year over year trends related
to year over year changes for store number 47.
Then, can you plot year over year growth percent in average
monthly transactions?
Thanks
section07_time_series.ipynb
Resampling is a special form of time series aggregation that modifies date indices
and fills any gaps to create continuous date values
• Frequencies are used to define the granularity (you can use years, quarters, and months now!)
Results Preview
NEW MESSAGE
August 31, 2022
Hey again,
Really quick, can you also get me charts of the average oil
price by month and year!
Thank you!!!!
section07_time_series.ipynb
Solution Code
NEW MESSAGE
August 31, 2022
Hey again,
Really quick, can you also get me charts of the average oil
price by month and year!
Thank you!!!!
section07_time_series.ipynb
Results Preview
NEW MESSAGE
August 31, 2022
Hey,
Can you plot the 90-day moving average for transactions for
store 47?
I want to present them with less noise than the daily figures.
Thanks!
section07_time_series.ipynb
Solution Code
NEW MESSAGE
August 31, 2022
Hey,
Can you plot the 90-day moving average for transactions for
store 47?
I want to present them with less noise than the daily figures.
Thanks!
section07_time_series.ipynb
The datetime64 data type lets you work with time series in Pandas
• Conversion can be deceptively complicated, so use the .to_datetime() method to manage errors, or explicitly
state the datetime format for Pandas to interpret it correctly
Use datetime codes & accessors to format dates and extract date components
• There are dozens of options, but you only need to memorize the common parts and formats for most business
analysis scenarios – you can always reference the documentation if needed!
Leverage datetime indices to slice time series data intuitively & efficiently
• This also provides alternative methods for fixing missing data, like forward-filling, back-filling, and interpolation
Shifting & aggregation methods let you perform time intelligence calculations
• Shifting and differencing let you calculate period-over-period changes, grouping and resampling let you perform
unique time series summaries, while rolling aggregations let you calculate things like moving averages
In this section we’ll cover importing & exporting data in Pandas, including reading in data
from flat files and SQL tables, applying processing steps on import, and writing back out
The read_csv() function only needs a file path to read in a file, but it also has many
capabilities for preprocessing the data using other arguments:
• file = “path/name.csv” – file path & name to read in (can also be a URL)
• sep = “/” – character used to separate each column of values (default is comma)
• header = 0 – row number to use as the column names (default is “infer”)
• names = [“date”, “sales”] – list of column names to override the existing ones
• index_col = “date” – column name to be used as the DataFrame index
• usecols = [“date”, “sales”] – list of columns to keep in the DataFrame
• dtype = {“date”: “datetime64”, “sales” : “Int32”} – dictionary with column names and data types
• parse_dates = True – converts date strings into datetimes when True
• infer_datetime_format = True – makes parsing dates more efficient
• na_values = [“-”, “#N/A!”] – strings to recognize as NaN values
• nrows = 100 – number of rows to read in
• skiprows = [0, 2] – line numbers to skip (accepts lambda functions)
• converters = {“sales”: lambda x: f"${x}") – dictionary with column names and functions to apply
For a full list of arguments, visit: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html *Copyright Maven Analytics, LLC
COLUMN NAMES
Pandas will try to infer the column names in your file by default, but there are
several options to override this behavior:
• Specify header=None to keep all the rows in the file as data, and use integers as column names
• Specify header=0 and use the names argument to pass a list of desired column names
You can set the index for the DataFrame with the “index_col” argument
• Pass a list of column names to create a multi-index DataFrame (not recommended!)
• Specify parse_dates=True to convert index date column to a datetime data type
You can select the columns to read in with the “use_cols” argument
• This can save a lot of processing time and memory
You can select the rows to read in from the top with the “nrows” argument, and
specify any rows to skip with “skiprows”
You can specify strings (or other values) to treat as missing values with the
“na_values” argument
• They are replaced with NumPy NaN values
Dates are read in as object data types by default, but you can parse dates with the
“parse_dates” argument to convert them to datetime64
• Specifying infer_datetime_format=True will speed up the date parsing
You can set the data type for each column with the “dtype” argument by passing in
a dictionary with column names as keys and desired data types as values
• Get your data into its most efficient format from the start!
You can apply functions to columns of data by using the “converters” argument
• Pass a dictionary with the column names as keys and the functions as values
Results Preview
NEW MESSAGE
September 2, 2022
Hey there!
Can you streamline the code and send it over for analysis?
Thanks!
section08_import_export.ipynb
Solution Code
NEW MESSAGE
September 2, 2022
Hey there!
Can you streamline the code and send it over for analysis?
Thanks!
section08_import_export.ipynb
The read_csv() function can also read in .txt files, and other types of flat files
• Simply use the “sep” argument to specify the delimiter
• You can also read in .tsv (tab separated values) files and URLs pointing to text files
0-indexed!
You can use Pandas’ concat() function to append data from multiple sheets
• We’ll cover combining DataFrames in the next section, but this is a sneak peek!
The to_csv() and to_excel() functions let you export DataFrames to flat files
PRO TIP:
Multiple tabs!
Results Preview
NEW MESSAGE
September 2, 2022
Hey again!
If you don’t have Excel, then a csv file for each year is ok too.
Thanks!
section08_import_export.ipynb
Solution Code
NEW MESSAGE
September 2, 2022
Hey again!
If you don’t have Excel, then a csv file for each year is ok too.
Thanks!
section08_import_export.ipynb
The read_sql() function lets you create a DataFrame from a SQL query
The to_sql() function lets you create a SQL table from your DataFrame
• This lets you clean data using Pandas before storing it in a SQL Database
• You will likely need permission from your database administrator (DBA) to do this
Short for Java Script Open Notation, a common format returned by APIs
JSON read_json to_json
(similar to a nested dictionary in Python)
Feather is a relatively new file format - designed to read, write, and store
Feather read_feather to_feather
DataFrames as efficiently as possible
Can be used to read and write data in webpage formats, which makes it
HTML read_html to_html
nice for scraping tables on sites like Wikipedia
You can convert many Python data types to DataFrames, and can convert
Python Dictionary pd.DataFrame to_dict
them to Python dictionaries with to_dict()
Pandas lets you easily read in & write to flat files like CSV and Excel
• Just make sure to specify the correct column delimiter for .tsv and .txt files, and the desired Excel worksheet
In this section we’ll cover combining multiple DataFrames, including joining data from
related fields to add new columns, and appending data with the same fields to add new rows
There are many scenarios where working with multiple DataFrames is necessary:
• Relational Databases save a lot of space by not repeating redundant data (normalization)
• Data may come from multiple sources (like complementing company numbers with external data)
• Multiple files can be used to store the same data split by different time periods or groups
Results Preview
NEW MESSAGE
September 5, 2022
Hey again,
Thanks!
section09_joining_dfs.ipynb
Solution Code
NEW MESSAGE
September 5, 2022
Hey again,
Thanks!
section09_joining_dfs.ipynb
“Right” DataFrame
left_df.merge(right_df, to join with “Left”
For a full list of arguments, visit: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html *Copyright Maven Analytics, LLC
JOINING DATAFRAMES
EXAMPLE Adding the “transactions” by date and store number to the family-level sales table
There are 4 primary types of joins: Inner, Left, Right, and Outer
• Inner and Left joins are the most common, while Right is rarely used
n=5
n=10
n=6
n=10
n=11
*Copyright Maven Analytics, LLC
EXAMPLE: INNER JOIN
Left Table Inner Join
Right Table
Right Table
A left join will preserve all rows in the left table, and simply
includes NaN values in “transactions” when no match is found
PRO TIP: If you’re performing a join between two tables for the
first time, start with a left join to grasp the potential data loss from
an inner join and decide which works best for your analysis
Results Preview
NEW MESSAGE
September 10, 2022
Hey there!
Once you have that, plot total average sales by city, the sum
of sales by “type” over time, and a stacked bar chart with sales
by month, with “type” as the “stacks”.
Thanks!
section07_time_series.ipynb
Solution Code
NEW MESSAGE
September 10, 2022
Hey there!
Once you have that, plot total average sales by city, the sum
of sales by “type” over time, and a stacked bar chart with sales
by month, with “type” as the “stacks”.
Thanks!
section07_time_series.ipynb
The .join() method can join two DataFrames based on their index
PRO TIP: You will see examples of joining tables with the join
method, but it’s rare to have perfect alignment of DataFrame
indices in practice; since merge() can join on indices as well as
columns, it is the only method you need to know
Key Objectives
NEW MESSAGE
September 20, 2022 1. Read in data from multiple csv files
From: Ross Retail (Head of Analytics) 2. Explore the data (millions of rows!)
Subject: Maven Acquisition Target Data
3. Join multiple DataFrames
Hey again, 4. Create new columns to aid in analysis
We’re getting closer to a final decision on the acquisition. 5. Filter, sort, and aggregate the data to pinpoint
Management was impressed by your findings, and we need to and summarize important information
discuss a promotion for you soon.
6. Work with datetime fields to analyze time series
They did have a few more questions we need to answer
before they feel comfortable moving forward. 7. Build plots to communicate key insights
Those questions are in the attached notebook. 8. Optimize the import workflow
section10_final_project.ipynb