0% found this document useful (0 votes)

29 views76 pages

Q-Step WS 06112019 Data Analysis and Visualisation With Python

This document provides an overview of NumPy and Pandas, two popular Python libraries for data analysis and visualization. It discusses how NumPy introduces the ndarray for efficient storage and manipulation of multi-dimensional arrays. It also explains how Pandas builds upon NumPy by introducing data frames for working with tabular data. Key concepts covered include creating and manipulating arrays and data frames, indexing, filtering, and reading/writing data to files.

Uploaded by

ainul yaqin

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

Download as pptx, pdf, or txt

0% found this document useful (0 votes)

29 views76 pages

Q-Step WS 06112019 Data Analysis and Visualisation With Python

Uploaded by

ainul yaqin

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

Download as pptx, pdf, or txt

You are on page 1/ 76

Data Analysis and

Visualisation with Python

Q-Step Workshop – 06/11/2019

Lewys Brace
l.brace@Exeter.ac.uk
Numerical Python (NumPy)
• NumPy is the most foundational package for numerical computing in Python.
• If you are going to work on data analysis or machine learning projects, then
having a solid understanding of NumPy is nearly mandatory.
• Indeed, many other libraries, such as pandas and scikit-learn, use NumPy’s
array objects as the lingua franca for data exchange.
• One of the reasons as to why NumPy is so important for numerical
computations is because it is designed for efficiency with large arrays of data.
The reasons for this include:
- It stores data internally in a continuous block of memory,
independent of other in-built Python objects.
- It performs complex computations on entire arrays without the
need for for loops.
What you’ll find in NumPy
• ndarray: an efficient multidimensional array providing fast array-orientated
arithmetic operations and flexible broadcasting capabilities.
• Mathematical functions for fast operations on entire arrays of data without
having to write loops.
• Tools for reading/writing array data to disk and working with memory-
mapped files.
• Linear algebra, random number generation, and Fourier transform
capabilities.
• A C API for connecting NumPy with libraries written in C, C++, and
FORTRAN. This is why Python is the language of choice for wrapping legacy
codebases.
The NumPy ndarray: A multi-dimensional
array object
• The NumPy ndarray object is a fast and flexible container for large
data sets in Python.
• NumPy arrays are a bit like Python lists, but are still a very different
beast at the same time.
• Arrays enable you to store multiple items of the same data type. It is
the facilities around the array object that makes NumPy so convenient
for performing math and data manipulations.
Ndarray vs. lists
• By now, you are familiar with Python lists and how incredibly useful
they are.
• So, you may be asking yourself:

“I can store numbers and other objects in a Python list and do all sorts
of computations and manipulations through list comprehensions, for-
loops etc. What do I need a NumPy array for?”

• There are very significant advantages of using NumPy arrays overs

lists.
Creating a NumPy array
• To understand these advantages, lets create an array.
• One of the most common, of the many, ways to create a NumPy array
is to create one from a list by passing it to the np.array() function.

In Out
: :
Differences between lists and ndarrays
• The key difference between an array and a list is that arrays are
designed to handle vectorised operations while a python lists are not.
• That means, if you apply a function, it is performed on every item in
the array, rather than on the whole array object.
• Let’s suppose you want to add the number 2 to every item in the list.
The intuitive way to do this is something like this:

In Out
: :

• That was not possible with a list, but you can do that on an array:

In Out
: :
• It should be noted here that, once a Numpy array is created, you
cannot increase its size.
• To do so, you will have to create a new array.
Create a 2d array from a list of list
• You can pass a list of lists to create a matrix-like a 2d array.

In
Out
:
:
The dtype argument
• You can specify the data-type by setting the dtype() argument.
• Some of the most commonly used NumPy dtypes are: float, int, bool,
str, and object.

In
Out
:
:
The astype argument
• You can also convert it to a different data-type using the astype method.

In Out
: :

• Remember that, unlike lists, all items in an array have to be of the same
type.
dtype=‘object’
• However, if you are uncertain about what data type your array will
hold, or if you want to hold characters and numbers in the same array,
you can set the dtype as 'object'.

In Out
: :
The tolist() function
• You can always convert an array into a list using the tolist() command.

In Out
: :
Inspecting a NumPy array
• There are a range of functions built into NumPy that allow you to
inspect different aspects of an array:

In
: Out
:
Extracting specific items from an array
• You can extract portions of the array using indices, much like when
you’re working with lists.
• Unlike lists, however, arrays can optionally accept as many parameters
in the square brackets as there are number of dimensions

In Out
: :
Boolean indexing
• A boolean index array is of the same shape as the array-to-be-filtered,
but it only contains TRUE and FALSE values.

In Out
: :
Pandas
• Pandas, like NumPy, is one of the most popular Python libraries for
data analysis.
• It is a high-level abstraction over low-level NumPy, which is written in
pure C.
• Pandas provides high-performance, easy-to-use data structures and
data analysis tools.
• There are two main structures used by pandas; data frames and
series.
Indices in a pandas series
• A pandas series is similar to a list, but differs in the fact that a series associates a label
with each element. This makes it look like a dictionary.
• If an index is not explicitly provided by the user, pandas creates a RangeIndex ranging
from 0 to N-1.
• Each series object also has a data type.

In O
: ut:
• As you may suspect by this point, a series has ways to extract all of
the values in the series, as well as individual elements by index.

In O
: ut:

• You can also provide an index manually.

In
:
Ou
t:
• It is easy to retrieve several elements of a series by their indices or
make group assignments.

Out
In :
:
Filtering and maths operations
• Filtering and maths operations are easy with Pandas as well.

In O
: ut:
Pandas data frame
• Simplistically, a data frame is a table, with rows and columns.
• Each column in a data frame is a series object.
• Rows consist of elements inside series.

Case ID Variable one Variable two Variable 3

1 123 ABC 10
2 456 DEF 20
3 789 XYZ 30
Creating a Pandas data frame
• Pandas data frames can be constructed using Python dictionaries.
In
:

Out
:
• You can also create a data frame from a list.

In Out
: :
• You can ascertain the type of a column with the type() function.

In
:

Out
:
• A Pandas data frame object as two indices; a column index and row
index.
• Again, if you do not provide one, Pandas will create a RangeIndex
from 0 to N-1.
In
:

Out
:
• There are numerous ways to provide row indices explicitly.
• For example, you could provide an index when creating a data frame:

In Out
: :

• or do it during runtime.
• Here, I also named the index ‘country code’.
Out
In :
:
• Row access using index can be performed in several ways.
• First, you could use .loc() and provide an index label.

In Out
: :

• Second, you could use .iloc() and provide an index number

In Out
: :
• A selection of particular rows and columns can be selected this way.

In Out
: :

• You can feed .loc() two arguments, index list and column list, slicing
operation is supported as well:

In Out
: :
Filtering
• Filtering is performed using so-called Boolean arrays.
Deleting columns
• You can delete a column using the drop() function.
In Out
: :

In Out
: :
Reading from and writing to a file
• Pandas supports many popular file formats including CSV, XML, HTML,
Excel, SQL, JSON, etc.
• Out of all of these, CSV is the file format that you will work with the
most.
• You can read in the data from a CSV file using the read_csv() function.

• Similarly, you can write a data frame to a csv file with the to_csv()
function.
• Pandas has the capacity to do much more than what we have covered
here, such as grouping data and even data visualisation.
• However, as with NumPy, we don’t have enough time to cover every
aspect of pandas here.
Exploratory data analysis (EDA)
Exploring your data is a crucial step in data analysis. It involves:
• Organising the data set
• Plotting aspects of the data set
• Maybe producing some numerical summaries; central tendency and
spread, etc.

“Exploratory data analysis can never be the whole story, but nothing
else can serve as the foundation stone.”
- John Tukey.
Download the data
• Download the Pokemon dataset from:

https://github.com/LewBrace/da_and_vis_python

• Unzip the folder, and save the data file in a location you’ll remember.
Reading in the data
• First we import the Python packages we are going to use.
• Then we use Pandas to load in the dataset as a data frame.

NOTE: The argument index_col argument states that we'll treat the first
column of the dataset as the ID column.
NOTE: The encoding argument allows us to by pass an input error created
by special characters in the data set.
Examine the data set
• We could spend time staring at these
numbers, but that is unlikely to offer
us any form of insight.
• We could begin by conducting all of
our statistical tests.
• However, a good field commander
never goes into battle without first
doing a recognisance of the terrain…
• This is exactly what EDA is for…
Plotting a histogram in Python
Bins
• You may have noticed the two histograms we’ve seen so far look different,
despite using the exact same data.
• This is because they have different bin values.
• The left graph used the default bins generated by plt.hist(), while the one on the
right used bins that I specified.
• There are a couple of ways to manipulate bins in matplotlib.
• Here, I specified where the edges of the bars of the histogram are; the
bin edges.
• You could also specify the number of bins, and Matplotlib will automatically
generate a number of evenly spaced bins.
Seaborn
• Matplotlib is a powerful, but sometimes unwieldy, Python library.
• Seaborn provides a high-level interface to Matplotlib and makes it
easier to produce graphs like the one on the right.
• Some IDEs incorporate elements of this “under the hood” nowadays.
Benefits of Seaborn
• Seaborn offers:
- Using default themes that are aesthetically pleasing.
- Setting custom colour palettes.
- Making attractive statistical plots.
- Easily and flexibly displaying distributions.
- Visualising information from matrices and DataFrames.
• The last three points have led to Seaborn becoming the exploratory
data analysis tool of choice for many Python users.
Plotting with Seaborn
• One of Seaborn's greatest strengths is its diversity of plotting
functions.
• Most plots can be created with one line of code.
• For example….
Histograms
• Allow you to plot the distributions of numeric variables.
Other types of graphs: Creating a scatter plot
Name of our
Name of variable we dataframe fed to the
want on the x-axis “data=“ command

Seaborn “linear Name of variable we

model plot” want on the y-axis
function for
creating a scatter
graph
• Seaborn doesn't have a dedicated scatter plot function.
• We used Seaborn's function for fitting and plotting a regression line;
hence lmplot()
• However, Seaborn makes it easy to alter plots.
• To remove the regression line, we use the fit_reg=False command
The hue function
• Another useful function in Seaborn is the hue function, which enables
us to use a variable to colour code our data points.
Factor plots
• Make it easy to separate plots by categorical classes.

Colour by stage.
Separate by stage.
Generate using a swarmplot.
Rotate axis on x-ticks by 45 degrees.
A box plot
• The total, stage, and legendary entries are not combat stats so we should
remove them.
• Pandas makes this easy to do, we just create a new dataframe
• We just use Pandas’ .drop() function to create a dataframe that doesn’t include
the variables we don’t want.
Seaborn’s theme
• Seaborn has a number of themes you can use to alter the appearance
of plots.
• For example, we can use “whitegrid” to add grid lines to our boxplot.
Violin plots
• Violin plots are useful alternatives to box plots.
• They show the distribution of a variable through the thickness of the violin.
• Here, we visualise the distribution of attack by Pokémon's primary type:
• Dragon types tend to have higher Attack stats than Ghost types, but they also have greater
variance. But there is something not right here….
• The colours!
Seaborn’s colour palettes
• Seaborn allows us to easily set custom colour palettes by providing it
with an ordered list of colour hex values.
• We first create our colours list.
• Then we just use the palette= function and feed in our colours list.
• Because of the limited number of observations, we could also use a
swarm plot.
• Here, each data point is an observation, but data points are grouped
together by the variable listed on the x-axis.
Overlapping plots
• Both of these show similar information, so it might be useful to
overlap them.
Set size of print canvas.

Remove bars from inside the violins

Make bars black and slightly transparent

Give the graph a title

Data wrangling with Pandas
• What if we wanted to create such a plot that included all of the other
stats as well?
• In our current dataframe, all of the variables are in different columns:
• If we want to visualise all stats, then we’ll have to “melt” the
dataframe.
We use the .drop() function again to re-
create the dataframe without these three
variables.
The dataframe we want to melt.

The variables to keep, all others will be

melted.
A name for the new, melted, variable.

• All 6 of the stat columns have been "melted" into one, and
the new Stat column indicates the original stat (HP, Attack,
Defense, Sp. Attack, Sp. Defense, or Speed).
• It's hard to see here, but each pokemon now has 6 rows of
data; hende the melted_df has 6 times more rows of data.
• This graph could be made to look nicer with a few tweaks.

Enlarge the plot.

Separate points by hue.

Use our special Pokemon colour palette.
Adjust the y-axis.
Move the legend box outside of
the graph and place to the right of
it..
Plotting all data: Empirical cumulative
distribution functions (ECDFs)
• An alternative way of visualising a
distribution of a variable in a large dataset
is to use an ECDF.
• Here we have an ECDF that shows the
percentages of different attack strengths of
pokemon.
• An x-value of an ECDF is the quantity you
are measuring; i.e. attacks strength.
• The y-value is the fraction of data points
that have a value smaller than the
corresponding x-value. For example…
75% of Pokemon have an attack
level of 90 or less

20% of Pokemon have an attack

level of 50 or less.
Plotting an ECDF
• You can also plot multiple ECDFs
on the same plot.
• As an example, here with have an
ECDF for Pokemon attack, speed,
and defence levels.
• We can see here that defence
levels tend to be a little less than
the other two.
The usefulness of ECDFs
• It is often quite useful to plot the ECDF first as part of your workflow.
• It shows all the data and gives a complete picture as to how the data
are distributed.
Heatmaps
• Useful for visualising matrix-like data.
• Here, we’ll plot the correlation of the stats_df variables
Bar plot
• Visualises the distributions of categorical variables.

Rotates the x-ticks 45 degrees

Joint Distribution Plot
• Joint distribution plots combine information from scatter plots and
histograms to give you detailed information for bi-variate distributions.
Any questions?

Peer Graded Assignment
No ratings yet
Peer Graded Assignment
6 pages
Advance Data Analysis and Visualisation - With - Python For Executives and Business Management
No ratings yet
Advance Data Analysis and Visualisation - With - Python For Executives and Business Management
76 pages
NumPy and Pandas
No ratings yet
NumPy and Pandas
72 pages
Data Analysis and Visualisation With Python
No ratings yet
Data Analysis and Visualisation With Python
75 pages
Data Visualization1
No ratings yet
Data Visualization1
52 pages
DAY6 Pandas Seaborn
No ratings yet
DAY6 Pandas Seaborn
97 pages
Handout6 - visualization
No ratings yet
Handout6 - visualization
75 pages
19ITP11_Unit_III_1722502325118
No ratings yet
19ITP11_Unit_III_1722502325118
126 pages
3 Python
No ratings yet
3 Python
16 pages
Introduction to Python Programming
No ratings yet
Introduction to Python Programming
9 pages
Ip File Library Stock
No ratings yet
Ip File Library Stock
36 pages
unit 5
No ratings yet
unit 5
28 pages
Introduction To Python: Name: Juhi Sawon Course:BCA 6 (B) Roll Number: 1021605 (14) Submitted To: Atul Bhandari Sir
No ratings yet
Introduction To Python: Name: Juhi Sawon Course:BCA 6 (B) Roll Number: 1021605 (14) Submitted To: Atul Bhandari Sir
25 pages
Areer: A Warm Welcome To Careerera Family
No ratings yet
Areer: A Warm Welcome To Careerera Family
131 pages
Data Science - Sec5
No ratings yet
Data Science - Sec5
16 pages
Pandas
No ratings yet
Pandas
20 pages
Python Brief Intro
No ratings yet
Python Brief Intro
25 pages
AdCom Midterms Reviewer
No ratings yet
AdCom Midterms Reviewer
11 pages
Q-Step WS 02102019 Practical Introduction To Python
No ratings yet
Q-Step WS 02102019 Practical Introduction To Python
88 pages
Introduction To Data Science Using Python Part2
No ratings yet
Introduction To Data Science Using Python Part2
45 pages
PPT1 - Introduction To Data Structure
No ratings yet
PPT1 - Introduction To Data Structure
37 pages
Lec 02-04 Ch2 Numpy Part 1
No ratings yet
Lec 02-04 Ch2 Numpy Part 1
66 pages
cls10datascience_24082024_113123
No ratings yet
cls10datascience_24082024_113123
4 pages
Unit 04 Pandas
No ratings yet
Unit 04 Pandas
46 pages
Pandas
No ratings yet
Pandas
41 pages
Topic 1 IntroductionToNumpy-2
No ratings yet
Topic 1 IntroductionToNumpy-2
7 pages
Data Structures: by Baidyanath Sou Dept. of Computer Science. J.K. College
No ratings yet
Data Structures: by Baidyanath Sou Dept. of Computer Science. J.K. College
37 pages
Packages
No ratings yet
Packages
37 pages
Python 3 Introduction
No ratings yet
Python 3 Introduction
59 pages
Essential Python Libraries
100% (1)
Essential Python Libraries
41 pages
Python
No ratings yet
Python
71 pages
Unit_III_part_2_1725700061785
No ratings yet
Unit_III_part_2_1725700061785
85 pages
1.1 (Co1, Co2)
No ratings yet
1.1 (Co1, Co2)
25 pages
RA Continuing Education (Data Processing With Pandas)
No ratings yet
RA Continuing Education (Data Processing With Pandas)
77 pages
DOC-20230519-WA0004. (1)
No ratings yet
DOC-20230519-WA0004. (1)
89 pages
Python For Data Science
No ratings yet
Python For Data Science
45 pages
Lecture 2 Python Data Structures
No ratings yet
Lecture 2 Python Data Structures
52 pages
Oops Combined PDF
No ratings yet
Oops Combined PDF
439 pages
Oops Combined PDF - 9 Dec 2021
No ratings yet
Oops Combined PDF - 9 Dec 2021
582 pages
Python Study Material
No ratings yet
Python Study Material
38 pages
Ass-1 Prac
No ratings yet
Ass-1 Prac
23 pages
Screenshot 2024-04-16 at 8.52.48 AM
No ratings yet
Screenshot 2024-04-16 at 8.52.48 AM
51 pages
Pointer and Array Review & Introduction To Data Structure
No ratings yet
Pointer and Array Review & Introduction To Data Structure
39 pages
Lecture - 09 - Python DS - NumPy
No ratings yet
Lecture - 09 - Python DS - NumPy
48 pages
NUMPYA03
No ratings yet
NUMPYA03
36 pages
Introduction and Array
No ratings yet
Introduction and Array
36 pages
QB Samplealllllll Hemu
No ratings yet
QB Samplealllllll Hemu
19 pages
Babaoskag
No ratings yet
Babaoskag
76 pages
Python Numpyandpandas 170922144956
No ratings yet
Python Numpyandpandas 170922144956
14 pages
Practical_introduction_to_Python - UG Class
No ratings yet
Practical_introduction_to_Python - UG Class
84 pages
M1 6 (Dsa)
No ratings yet
M1 6 (Dsa)
142 pages
Data Structures & Its Application-2
No ratings yet
Data Structures & Its Application-2
356 pages
Data Structures & Its Application
No ratings yet
Data Structures & Its Application
138 pages
C Sharp (C#) : Benadir University
No ratings yet
C Sharp (C#) : Benadir University
33 pages
PP Unit 4 Q&A
No ratings yet
PP Unit 4 Q&A
25 pages
Python
No ratings yet
Python
16 pages
2.python Basic
No ratings yet
2.python Basic
85 pages
Dsappt
No ratings yet
Dsappt
28 pages
Lec16 - Arrays Misc, Wrapper Classes and and ArrayList
No ratings yet
Lec16 - Arrays Misc, Wrapper Classes and and ArrayList
10 pages
DSBDA Lab Manual
No ratings yet
DSBDA Lab Manual
155 pages
The Numpy Pocketbook: Essentials on the Go
From Everand
The Numpy Pocketbook: Essentials on the Go
Silas Meadowlark
No ratings yet
ssas_polestar_ssas-management
No ratings yet
ssas_polestar_ssas-management
4 pages
Java OPP Programming 7
No ratings yet
Java OPP Programming 7
10 pages
519 - 1 - Fun Skills 2. Student's Book - 2020, 87p Pages 1-50 - Flip PDF Download - FlipHTML5
No ratings yet
519 - 1 - Fun Skills 2. Student's Book - 2020, 87p Pages 1-50 - Flip PDF Download - FlipHTML5
87 pages
Futures Algo SMC BETA AlgoPoint
No ratings yet
Futures Algo SMC BETA AlgoPoint
23 pages
DOC-20250112-WA0000
No ratings yet
DOC-20250112-WA0000
26 pages
Hacks 9 - Problem Statements (1)
No ratings yet
Hacks 9 - Problem Statements (1)
7 pages
SlickEdit User Guide-V28
No ratings yet
SlickEdit User Guide-V28
1,630 pages
Changing Smart Hub Region On A Samsung Smart TV
No ratings yet
Changing Smart Hub Region On A Samsung Smart TV
7 pages
Quizzes
No ratings yet
Quizzes
37 pages
HP StoreEver Tape Drive MSL Family Call Guide
No ratings yet
HP StoreEver Tape Drive MSL Family Call Guide
2 pages
Students Information Database: Thelijjawila Central College
No ratings yet
Students Information Database: Thelijjawila Central College
20 pages
UNIT-I Basics of System Programming
No ratings yet
UNIT-I Basics of System Programming
88 pages
Advanced Foreign Currency Valuation in SAP S - 4HANA
No ratings yet
Advanced Foreign Currency Valuation in SAP S - 4HANA
18 pages
Practice 22- Using Oracle GoldenGate in a Multitenant Container Database
No ratings yet
Practice 22- Using Oracle GoldenGate in a Multitenant Container Database
23 pages
Cause Effect
No ratings yet
Cause Effect
71 pages
Chettinad Vidyashram STD X - Record Programs - 2021 - 2022 Artificial Intelligence
No ratings yet
Chettinad Vidyashram STD X - Record Programs - 2021 - 2022 Artificial Intelligence
15 pages
JDS-G113.1: Drafting General Drawing Requirements
No ratings yet
JDS-G113.1: Drafting General Drawing Requirements
18 pages
Akarsh Synopsis
No ratings yet
Akarsh Synopsis
18 pages
32pht4112 12 Fin Ron PDF
No ratings yet
32pht4112 12 Fin Ron PDF
9 pages
tbs8510 Data Sheet
No ratings yet
tbs8510 Data Sheet
3 pages
Umesh CV-vEPC 5G
No ratings yet
Umesh CV-vEPC 5G
3 pages
How To Activate Animal Planet Via Animalplant
No ratings yet
How To Activate Animal Planet Via Animalplant
3 pages
Engineering Change Notice (Field Change Notice) : ECN No. 2015-E-0360
No ratings yet
Engineering Change Notice (Field Change Notice) : ECN No. 2015-E-0360
2 pages
Module 13: Deploying Disconnected Analytics
No ratings yet
Module 13: Deploying Disconnected Analytics
18 pages
CBD Aisc 360 16
100% (1)
CBD Aisc 360 16
98 pages
UMAPI .Net Integration Manual v2.8 PDF
No ratings yet
UMAPI .Net Integration Manual v2.8 PDF
83 pages
Machine Learning and Predictive Analytics Guidebook For Water Engineers Ge Digital
No ratings yet
Machine Learning and Predictive Analytics Guidebook For Water Engineers Ge Digital
10 pages
Walmart, Myntra, and 20+ Other Companies, Packages Up To INR 32 LPA
No ratings yet
Walmart, Myntra, and 20+ Other Companies, Packages Up To INR 32 LPA
2 pages
Introduction Google Classroom PDF
No ratings yet
Introduction Google Classroom PDF
48 pages