Lab Course - II (Foundations of Data Science)
Lab Course - II (Foundations of Data Science)
Semester-V
Work Book For
Lab Course - II (Foundations of Data Science)
Name
Academic Year
Objectives –
1) Students should carry this book during practical sessions of Computer Science.
2) Students should maintain separate journal for the source code and outputs.
3) Students should read the topics mentioned in reading section of this Book before coming
for practical.
4) Students should solve all exercises which are selected by Practical in-charge as a part of
journal activity.
5 Complete 4
6 Well-done 5
Instructions to the practical in-charge
1. Explain the assignment and related concepts in around ten minutes using whiteboard if
required or by demonstrating the software.
Name of Student:
Total Marks: / 20
Converted Marks: / 10
Signature of Incharge
Date:
Reading
You should read the following topics before starting this exercise:
Typing and executing a Python script.
Introductory concepts of Data Science and Data Science Lifecycle.
Ready Reference
Installing Python
There are 2 approaches to install Python:
• You can download Python directly from its project site and install individual components and
libraries (https://www.python.org/downloads/)
• Alternately, you can download and install a package like Anaconda, which comes with pre-installed
libraries. (https://www.anaconda.com/products/individual)
Choosing a development environment
Once you have installed Python, there are various options for choosing an environment. Here are the 4 most
common options:
1. Terminal / Shell based
2. IDLE
3. iPython notebook (Jupyter notebook)
4. Google Colaboratory
1. Terminal/Shell
Open the terminal by searching for it in the dashboard or pressing Ctrl + Alt + T. Right click on the desktop
and click Terminal and in terminal type python
You should see a window like this. At the command prompt, type any Python command and press enter to
execute the command.
5
Executing a Python script:
Python programs are very similar to text files; they can be written with something as simple as a basic text
editor. Type and save the script as a text file with extension .py
To run a Python script (program), Navigate the terminal to the directory where the script is located using the
cd command. Type python program_name.py in the terminal to execute the script.
2. IDLE
IDLE stands for Integrated Development and Learning Environment. IDLE is Python’s Integrated
Development and Learning Environment. IDLE, is a very simple and sophisticated IDE developed primarily
for beginners. It offers a variety of features like Python shell with syntax highlighting, Multi-window text
editor, Code autocompletion, Intelligent indenting, Program animation and stepping for debugging, etc.
(https://docs.python.org/3/library/idle.html)
3. Jupyter notebook
The Jupyter Notebook is an open source web application that you can use to create and share documents that
contain live code, equations, visualizations, and text. It needs to be installed separately. You can use a handy
tool that comes with Python called pip to install Jupyter Notebook like this:
$ pip install jupyter
The other most popular distribution of Python is Anaconda. Anaconda has its own installer tool called
conda that is used for installing packages. However, Anaconda comes with many scientific libraries
preinstalled, including the Jupyter Notebook (https://jupyter.org/)
6
4. Google Colaboratory
Colaboratory, or “Colab” for short, is a product from Google Research. Colab allows anybody to write and
execute arbitrary python code through the browser, and is especially well suited to data science, data
analysis and machine learning. (https://colab.research.google.com/notebooks/intro.ipynb)
DATA SCIENCE
Data science is the field of study that combines domain expertise, programming skills, and knowledge of
mathematics and statistics to extract meaningful insights from data.
Data is the foundation of data science. Data comes from various sources, in various forms and sizes.
A dataset is a complete collection of all observations arranged in some order.
DATA SCIENCE TASKS
1. Data Exploration – finding out more about the data to understand the nature of data that we have to
work with. We need to understand the characteristics, format, and quality of data and find
correlations, general trends, and outliers. The basic Steps in Data Exploration are:
7
a. Import the library
b. Load the data
c. What does the data look like?
d. Perform exploratory data analysis
e. Visualize the data
2. Data Munging – cleaning the data and playing with it to make it better suited for statistical modeling
3. Predictive Modeling – running the actual data analysis and machine learning algorithms
For the above process, we need to get acquainted with some useful Python libraries.
PYTHON LIBRARIES FOR DATA SCIENCE
There are several libraries that have been developed for data science and machine learning. Following are a
list of libraries, you will need for any data science tasks:
8
• Statsmodels for statistical modeling. Statsmodels is part of the Python scientific stack, oriented
towards data science, data analysis and statistics. It allows users to explore data, estimate statistical
models, and perform statistical tests. An extensive list of descriptive statistics, statistical tests,
plotting functions, and result statistics are available for different types of data and each estimator.
Python Libraries for Data Visualization
• Matplotlib for plotting vast variety of graphs like bar charts, pie charts, histograms, scatterplots,
histograms to line plots to heat plots. Matplotlib can be used in Python scripts, the Python and
IPython shells, the Jupyter notebook, web application servers etc. It can be used to embed plots into
applications. The Pyplot module also provides a MATLAB-like interface that is just as versatile and
useful.
• Seaborn for statistical data visualization. Seaborn is a Python data visualization library that is based
on Matplotlib and closely integrated with the numpy and pandas data structures. It is used for making
attractive and informative statistical graphics in Python.
• Plotly Plotly is a free open-source graphing library that can be used to form data visualizations.
Plotly (plotly.py) is built on top of the Plotly JavaScript library (plotly.js) and can be used to create
web-based data visualizations that can be displayed in Jupyter notebooks or web applications using
Dash or saved as individual HTML files.
• Bokeh for creating interactive plots, dashboards and data applications on modern web-browsers. It
empowers the user to generate elegant and concise graphics in the style of D3.js. Moreover, it has the
capability of high-performance interactivity over very large or streaming datasets.
Python libraries for Data extraction
• BeautifulSoup is a parsing library in Python that enables web scraping from HTML and XML
documents.
• Scrapy for web crawling. It is a very useful framework for getting specific patterns of data. It has the
capability to start at a website home url and then dig through web-pages within the website to gather
information. It gives you all the tools you need to efficiently extract data from websites, process
them as you want, and store them in your preferred structure and format.
Python Libraries for Machine learning
• Scikit Learn for machine learning. Built on NumPy, SciPy and matplotlib, this library contains a lot
of efficient tools for machine learning and statistical modeling including classification, regression,
clustering and dimensionality reduction.
• TensorFlow Developed by Google Brain Team, TensorFlow is an open-source library used for deep
learning applications.
• Keras It is an open-source library that provides an interface for the TensorFlow library and enables
fast experimentation with deep neural networks.
Installing a library
Python comes with a package manager for Python packages called pip which can be used to install a library.
PIP is a recursive acronym for “Preferred Installer Program” or “PIP Installs Packages”
pip install package-name
Importing a library
9
The first step to use a package is to import them into the Programming environment. There are several ways
of doing so in Python:
i) import package-name
ii) import package-name as alias
iii) from package-name import *
In i, the specified package name is imported in to the environment. Subsequently, we have to use the whole
package-name each time we have to access any function or method.
Example:
import numpy
a = numpy.array([2, 3, 4,5])
In ii, we have defined an alias to the package-name. Instead of using the ehole package-name, a short alias
can be used.
import numpy as np
a = np.array([2, 3, 4,5])
In iii, we have imported the entire name space in the package. i.e. you can directly use all methods and
operations without referring to the package-name.
from numpy import *
a = array([2, 3, 4,5])
Importing the dataset
We can find various dataset sources which are freely available for the public to work on. Few popular websites
through which we can access datasets are:
1. Kaggle : https://www.kaggle.com/datasets.
2. UCI Machine learning repository: https://archive.ics.uci.edu/ml/index.php.
3. Datasets at AWS resources: https://registry.opendata.aws/.
4. Google’s dataset search engine https://datasetsearch.research.google.com/
5. Government datasets: There are different sources to get government-related data. Various countries
publish government data for public use collected by them from different departments. The Open
Government Data Platform, India publishes several Datasets for general use at https://data.gov.in/
10
To perform operations on a dataset, it is stored as a dataframe in Python. A dataframe is a 2-dimensional
labeled data structure with columns of potentially different types. The Pandas library is most useful for
working with dataframes.
There are multiple ways to create own dataframes - Lists, dictionary, Series, Numpy ndarrays, using
another dataframe. To carry out operations on the dataframe, the following pandas functions are
commonly required:
Function Description Example
read_csv() read_csv() function helps read a comma-separated values import pandas as pd
(csv) file into a Pandas DataFrame. Other functions are df =
read_json(), read_html(), read_excel() pd.read_csv('filename.csv')
head() head(n) is used to return the first n rows of a dataset. By df.head(6)
tail() default, df.head() will return the first 5 rows of the
DataFrame.
tail() gives rows from the bottom of the dataframe
describe() describe() is used to generate descriptive statistics of the df.describe()
data in a Pandas DataFrame to get a quick overview of
the data.
info() It displays information about data such as the column df.info()
names, no. of non-null values in each column(feature),
type of each column, memory usage, etc.
dtypes Dtypes shows the data type of each column. df.dtypes
shape shape returns the number of rows and columns in the df.shape
size dataframe. Size returns the size of a dataframe which is
the number of rows multiplied by the number of df.size
columns.
value_coun To identify the different categories in a feature as well as df[‘column’].value_counts()
ts() the count of values per category.
sample() Sample method allows you to select values randomly df.sample(10)
from a Series or DataFrame. It is useful when we want to
select a random sample from a distribution.
drop_dupli drop_duplicates() returns a Pandas DataFrame with df.drop_duplicates(inplace=Tru
cates() duplicate rows removed. inplace=True makes sure the e)
changes are applied to the original dataset.
sort_value To sort column in a Pandas DataFrame by values in df.sort_values(by=columnname
s() ascending or descending order. By specifying the or list of columns,
inplace=True, you can make a change directly in the inplace=True)
original DataFrame.
loc[:] To access a row or group of rows (a slice of the dataset). df.loc[:] #All rows
index starts from 0 in Python, and that loc[:] is inclusive
on both values df.loc[0] #First row
11
df.loc[1:4] #Rows 1,2,3 and 4
df.loc[3:] #All Rows from row
3
df.loc[1:3,[‘column’,’column’]
] #Rows 1,2,3 with specific
columns
drop() To drop a particular column in the dataframe. df.drop()
inplace=True makes changes in the dataframe.
query() To filter a dataframe based on a condition or apply a df.query(‘year’>2019)
mask to get certain values. Query the columns by
applying a Boolean expression.
insert() Inserts a new column in a dataframe. df.insert(pos,’columnname’,
values)
isnull() Detect missing values. Return a boolean same-sized df.isnull()
notnull() object indicating if the values are NA. notnull() detects
non missing values
dropna() Pandas has a built-in method called dropna. When df.dropna()
fillna() applied against a DataFrame, the dropna method will
df.fillna(“No value
remove any rows that contain a NaN value. In many
available”)
cases, you will want to replace missing values in a
Pandas DataFrame instead of dropping it completely.
The fillna method is designed for this.
Self Activity
I. To create a dataset
The above dataset can be created in several ways. Type and execute the code in the following activities.
Activity 1: Create empty dataframe and add records
#Import the library
import pandas as pd
#Create an empty data frame with column names
df = pd.DataFrame(columns = ['company','model','year'])
#Add records
df.loc[0] = ['Tata', 'Nexon', 2017]
df.loc[1] = ['MG', 'Astor', 2021]
df.loc[2] = ['KIA', 'Seltos', 2019]
df.loc[3] = ['Hyundai', 'Creta', 2015]
#Print the dataframe
df
12
import numpy as np
# Pass a 2D numpy array - each row is the corresponding row in the dataframe
data = np.array([['Tata','Nexon',2017],
['MG','Astor',2021],
['KIA','Seltos', 2019],
['Hyundai','Creta',2015]])
# pass column names in the columns parameter of the constructor
df = pd.DataFrame(data, columns = ['company', 'model','year'])
df
13
model = ['Nexon','Astor','Seltos','Creta']
#Year list
year = [2017,2021,2019,2015]
# and merge them by using zip().
data = list(zip(company, model,year))
# Create pandas Dataframe using data
df = pd.DataFrame(data,columns = ['company','model','year'])
# Print dataframe
df
Activity 8: Adding rows and columns with Invalid, duplicate or missing data
#Add rows to the dataset with empty or invalid values.
df.loc[4] = ['Honda', 'Jazz', None]
df.loc[5]= [None, None, None]
df.loc[6]=['Toyota', None , 2018]
df.loc[7]=[ 'Tata','Nexon',2017]
Note: a new column can be added with specific values. For example, if we want the new column to have
model name + year name, the command will be:
df['new']=df['model']+' '+df['year'].astype(str) #Convert year to string
14
https://www.kaggle.com/uciml/iris
If you are using google colab, then run the following code to load the dataset.
from google.colab import files
iris_uploaded = files.upload()
This will prompt you for selecting the csv file to be uploaded. To load the file, run the following code:
import io
df = pd.read_csv(io.BytesIO(iris_uploaded['Iris.csv']))
df
15
Activity 14: Visualize the dataset
Matplotlib.pyplot library is most commonly used in Python for plotting graphs both in 2d and 3d format. It
provides features of legend, label, grid, graph shape, grid and many more. Seaborn is another library that
provides different and attractive graphs and plots. Type and execute the following commands:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
16
To create a scatterplot of Sepal length vs Petal Length using scatterplot.
We can plot by matplotlib or seaborn. For matplotlib, we can either call the .plot.scatter() method in pandas,
or use plt.scatter(). For seaborn, we use sns.scatterplot() function.
1. df.plot(kind ="scatter", x='SepalLengthCm', y='PetalLengthCm')
2. df.plot.scatter(x='SepalLengthCm', y='PetalLengthCm')
3. plt.scatter(df['SepalLengthCm'],df['PetalLengthCm'])
To plot the comparative graph of sepal length vs sepal width for all species using scatterplot
plt.figure(figsize=(17,9))
plt.title('Comparison between various species based on sepal length and width')
sns.scatterplot(df['SepalLengthCm'], df['SepalWidthCm'],hue=df['Species'],s=50)
Set A
1. Write a Python program to create a dataframe containing columns name, age and percentage. Add 10
rows to the dataframe. View the dataframe.
2. Write a Python program to print the shape, number of rows-columns, data types, feature names and
the description of the data
3. Write a Python program to view basic statistical details of the data.
4. Write a Python program to Add 5 rows with duplicate values and missing values. Add a column
‘remarks’ with empty values. Display the data
5. Write a Python program to get the number of observations, missing values and duplicate values.
6. Write a Python program to drop ‘remarks’ column from the dataframe. Also drop all null and empty
values. Print the modified data.
17
7. Write a Python program to generate a line plot of name vs percentage
8. Write a Python program to generate a scatter plot of name vs percentage
Set B
1. Download the heights and weights dataset and load the dataset from a given csv file into a dataframe.
Print the first, last 10 rows and random 20 rows. (https://www.kaggle.com/burnoutminer/heights-
and-weights-dataset)
2. Write a Python program to find the shape, size, datatypes of the dataframe object.
3. Write a Python program to view basic statistical details of the data.
4. Write a Python program to get the number of observations, missing values and nan values.
5. Write a Python program to add a column to the dataframe “BMI” which is calculated as :
weight/height2
6. Write a Python program to find the maximum and minimum BMI.
7. Write a Python program to generate a scatter plot of height vs weight.
Assignment Evaluation
18
Assignment 2: Statistical Data Analysis
No. of sessions: 01
Objectives
• To understand an experimental foundation for data analysis using statistical methods
learned in the theory.
• To learn statistical libraries in Python and use these on actual data.
• To write basic Python scripts using various functions defined in these libraries.
• To understand importance of data analysis using statistical methods in various fields.
Reading
You should read the following topics before starting this exercise:
Why to do data analysis, Difference between inferential statistics and descriptive statistics,
Features of Python Programming Language, Basics of Python, Basic scripting in Python, Types
of attributes,
Ready Reference
Understanding Descriptive Statistics
Descriptive statistics is about describing and summarizing data.
It uses two main approaches:
Proximity Measures
Proximity measures are basically mathematical techniques that are used to check similarity
between objects or one can say dissimilarity between objects. These measures are based on
distance measures. Although it can be based on probabilities also.
19
Distance measures play vital role in case of Machine Learning (or in algorithms from data
science field.) algorithms such as K-Nearest Neighbors, Learning Vector Quantization (LVQ),
Self-Organizing Map (SOM), K-Means Clustering use this technique while computing.
There are four types of distances mostly getting used in the field of machine learning viz.
Hamming Distance, Euclidean Distance, Manhattan Distance, Minkowski Distance.
We may have numeric, ordinal or categorical attributes and hence types of objects. In such
cases, steps to be taken to get value of proximity measures are different.
e. g. Proximity measures for ordinal attributes
An ordinal attribute is an attribute whose possible values have a meaningful order or ranking
among them. Here, we do not aware of gap between two values. So, first we scale up all values
starting from low to high or small to large with specific range of numbers. i. e. consider movie
rating (1-10), satisfaction rate at shopping mall (1-5) etc. then normalize the values using
formula. Then find dissimilarity matrix for it. Such measures are helpful/required in clustering,
outlier analysis, and nearest neighbour classification.
Outliers
An outlier is a data point that differs significantly from the majority of the data taken from a
sample or population. There are many possible causes of outliers, but here are a few to start
you off:
Self Activity
Choosing Python Statistics Libraries
20
Python’s statistics is a built-in Python library for descriptive statistics. You can use it if your
datasets are not too large or if you can’t rely on importing other libraries.
NumPy is a third-party library for numerical computing, optimized for working with single-
and multi-dimensional arrays. Its primary type is the array type called ndarray. This library
contains many routines for statistical analysis.
SciPy is a third-party library for scientific computing based on NumPy. It offers additional
functionality compared to NumPy, including scipy. stats for statistical analysis. It is an open-
source library and it is useful when we want to solve problems from mathematics, scientific,
engineering, and technical situations/cases. It allows us to manipulate the data and visualize
results using Python command.
Pandas is a third-party library for numerical computing based on NumPy. It excels in handling
labeled one-dimensional (1D) data with Series objects and two-dimensional (2D) data with
DataFrame objects. Function describe() is used to view some basic statistical details like
percentile, mean, std etc. of a data frame or a series of numeric values. This function describe()
takes argument 'include' that is used to pass necessary information regarding which columns
need to be considered for summarizing. Values can be either ‘all’ or specific value.
df.head() is a method that displays the first 5 rows of the dataframe. Similarly, tail() is the
function that displays last 5 rows of the dataframe.
Matplotlib is a third-party library for data visualization. It works well in combination with
NumPy, SciPy, and Pandas.
Pandas has a describe() function that provides basic details about dataframe. But, when we are
interested in more details for EDA (exploratory data analysis) pandas_profiling will be useful.
It states about types, unique values, missing values, statistics basics, etc. When we use pandas
profiling we can show results into formatted output such as HTML
Self-Activities
Example 1:
import numpy as np
demo = np.array([[30,75,70],[80,90,20],[50,95,60]])
print demo
print '\n'
print np.mean(demo)
print '\n'
print np.median(demo)
print '\n'
Output
[[30 75 70]
[80 90 20]
[50 95 60]]
63.3333333333
70.0
Example 2:
21
import pandas as pd
import numpy as np
d = {'Name':pd.Series(['Ram','Sham','Meena','Seeta','Geeta','Rakesh','Madhav']),
'Age':pd.Series([25,26,25,23,30,29,23]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}
df = pd.DataFrame(d)
print(df.sum())
Output
Name RamShamMeenaSeetaGeetaRakeshMadhav
Age 181
Rating 25.61
dtype: object
Example 3:
import pandas as pd
import numpy as np
md= {‘Name':pd.Series(['Ram','Sham','Meena','Seeta','Geeta','Rakesh','Madhav']),
'Age':pd.Series([25,26,25,23,30,29,23]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}
df = pd.DataFrame(md)
print(df.describe())
Output
Age Rating
count 7.000000 7.000000
mean 25.857143 3.658571
std 2.734262 0.698628
min 23.000000 2.560000
25% 24.000000 3.220000
50% 25.000000 3.800000
75% 27.500000 4.105000
max 30.000000 4.600000
Example 4:
import numpy as np
data = np.array([13,52,44,32,30, 0, 36, 45])
print("Standard Deviation of sample is % s "% (np.std(data)))
Output
Standard Deviation of sample is 16.263455967290593
Example 5:
import pandas as pd
22
import scipy.stats as s
score={'Virat': [92,97,85,74,71,55,85,63,42,32,71,55], 'Rohit' :
[89,87,67,55,47,72,76,79,44,92,99,47]}
df=pd.DataFrame(score)
print(df)
print("\nArithmetic Mean Values")
print("Score 1", s.tmean(df["Virat"]).round(2))
print("Score 2", s.tmean(df["Rohit"]).round(2))
Output
Virat Rohit
0 92 89
1 97 87
2 85 67
3 74 55
4 71 47
5 55 72
6 85 76
7 63 79
8 42 44
9 32 92
10 71 99
11 55 47
Example 6:
import pandas as pd
from scipy.stats import iqr
import matplotlib.pyplot as plt
d={'Name': ['Ravi', 'Veena', 'Meena', 'Rupali'], 'subject': [92,97,85,74]}
df=pd.DataFrame(d)
df = pd.DataFrame(d)
df['Percentile_rank']=df.subject.rank(pct=True)
print("\n Values of Percentile Rank in the Distribution")
print(df)
print("\n Measures of Dispersion and Position in the Distribution")
r=max(df["subject"]) - min(df["subject"])
print("\n Value of Range in the Distribution = ", r)
s=round(df["subject"].std(),3)
23
print("Value of Standard Deviation in the Distribution = ", s)
v=round(df["subject"].var(),3)
print("Value of Variance in the Distribution = ", v)
Output
Values of Percentile Rank in the Distribution
Name subject Percentile_rank
0 Ravi 92 0.75
1 Veena 97 1.00
2 Meena 85 0.50
3 Rupali 74 0.25
Example 7
mydata = np.array([24, 29, 20, 22, 24, 26, 27, 30, 20, 31, 26, 38, 44, 47])
q3, q1 = np.percentile(mydata, [75 ,25])
iqrvalue = q3 - q1
print(iqrvalue)
Output
6.75
Lab Assignments
SET A:
1. Write a Python program to find the maximum and minimum value of a given flattened
array.
Expected Output:
Original flattened array:
[[0 1]
[2 3]]
Maximum value of the above flattened array:
3
Minimum value of the above flattened array:
0
2. Write a python program to compute Euclidian Distance between two data points in a
dataset. [Hint: Use linalgo.norm function from NumPy]
3. Create one dataframe of data values. Find out mean, range and IQR for this data.
4. Write a python program to compute sum of Manhattan distance between all pairs of
points.
24
5. Write a NumPy program to compute the histogram of nums against the bins.
Sample Output:
nums: [0.5 0.7 1. 1.2 1.3 2.1]
bins: [0 1 2 3]
Result: (array([2, 3, 1], dtype=int64), array([0, 1, 2, 3]))
6. Create a dataframe for students’ information such name, graduation percentage and age.
Display average age of students, average of graduation percentage. And, also describe
all basic statistics of data. (Hint: use describe()).
SET B:
1. Download iris dataset file. Read this csv file using read_csv() function. Take samples
from entire dataset. Display maximum and minimum values of all numeric attributes.
2. Continue with above dataset, find number of records for each distinct value of class
attribute. Consider entire dataset and not the samples.
3. Display column-wise mean, and median for iris dataset from Q.4 (Hint: Use mean() and
median() functions of pandas dataframe.
SET C:
1. Write a python program to find Minkowskii Distance between two points.
2. Write a Python NumPy program to compute the weighted average along the specified
axis of a given flattened array.
From Wikipedia: The weighted arithmetic mean is similar to an ordinary arithmetic
mean (the most common type of average), except that instead of each of the data points
contributing equally to the final average, some data points contribute more than others.
The notion of weighted mean plays a role in descriptive statistics and also occurs in a
more general form in several other areas of mathematics.
Sample output:
Original flattened array:
[[0 1 2]
[3 4 5]
[6 7 8]]
Weighted average along the specified axis of the above flattened array:
[1.2 4.2 7.2]
3. Write a NumPy program to compute cross-correlation of two given arrays.
25
Sample Output:
Original array1:
[0 1 3]
Original array2:
[2 4 5]
Cross-correlation of the said arrays:
[[2.33333333 2.16666667]
[2.16666667 2.33333333]]
4. Download any dataset from UCI (do not repeat it from set B). Read this csv file using
read_csv() function. Describe the dataset using appropriate function. Display mean
value of numeric attribute. Check any data values are missing or not.
5. Download nursery dataset from UCI. Split dataset on any one categorical attribute.
Compare the means of each split. (Use groupby)
6. Create one dataframe with 5 subjects and marks of 10 students for each subject. Find
arithmetic mean, geometric mean, and harmonic mean.
7. Download any csv file of your choice and display details about data using pandas
profiling. Show stats in HTML form.
Assignment Evaluation
26
Assignment 3 : Data Preprocessing
Number of Slots = 1
Objectives
Reading
You should read the following topics before starting this exercise:
Improting libraries and Datasets, Handlng Dataframe, Basic concepts of data preprocessing,
Handling missing data, Data munging process includes operations such as Cleaning Data, Data
Transformation, Data Reduction and Data Discretization.
Ready Reference
Data Munging
Data Data
Cleaning Data Data Reduction
Transformation Discretization
27
Step.1 importing the libraries
1. NumPy:- it is a library that allows us to work with arrays and as most machine learning
models work on arrays NumPy makes it easier
2. matplotlib:- this library helps in plotting graphs and charts, which are very useful while
showing the result of your model
3. Pandas:- pandas allows us to import our dataset and also creates a matrix of features
containing the dependent and independent variable.
Syntax:
• import numpy as np # used for handling numbers
• import pandas as pd # used for handling the datasetfrom sklearn.impute
• import SimpleImputer # used for handling missing datafrom
sklearn.preprocessing
• import LabelEncoder, OneHotEncoder # used for encoding categorical
data from sklearn.model_selection
28
Syntax:
data=pd.read_csv('../input/Data_1.csv')
DataFrame
A DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and
columns.
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
.loc[ ] is label based data selecting method which means that we have to pass the name of the row or
column which we want to select. This method includes the last element of the range passed in it1. Selecting
data according to some conditions :
For example:
.iloc[] is Purely integer-location based indexing for selection by position and it is primarily integer position based
(from 0 to length-1 of the axis), but may also be used with a boolean array.
For eg:
>>>lst = [{'a': 1, 'b': 2, 'c': 3, 'd': 4},
… {'a': 10, 'b': 20, 'c': 30, 'd': 40},
… {'a': 100, 'b': 200, 'c': 300, 'd': 400 }]
>>>d1 = pd.DataFrame(lst)
>>>d1
a b c d
0 1 2 3 4
1 10 20 30 40
2 100 200 300 400
29
• Type1( with integer inputs)
>>>d1.iloc[[0]]
ab c d
0 1 2 3 4
>>>d1.iloc[[0, 1]]
a b c d
0 1 2 3 4
1 10 20 30 40
>>>d1.iloc[:3]
a b c d
0 1 2 3 4
1 10 20 30 40
2 100 200 300 400
Describing Dataset:
describe() is used to view some basic statistical details like percentile, mean, std etc. of a data frame
or a series of numeric values.
Dataset shape:
To get the shape of DataFrame, use DataFrame.shape. The shape property returns a tuple representing the
dimensionality of the DataFrame. The format of shape would be (rows, columns).
Some values in the data may not be filled up for various reasons and hence are considered missing.
In some cases, missing data arises because data was never gathered in the first place for some
entities. Data analyst needs to take appropriate decision for handling such data.
1. Ignore the Tuple.
2. Fill in the Missing Value Manually.
3. Use a Global Constant to Fill in the Missing Value.
4. Use a Measure of Central Tendency for the Attribute (e.g., the Mean or Median) to Fill in
the Missing Value.
5. Use the Attribute Mean or Median for all Samples belonging to the same Class as the given
Tuple.
6. Use the most probable value to fill in the missing value.
30
# import the pandas library
import pandas as pd
import numpy as np
#Creating a DataFrame with Missing Values
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e',
'f','h'], columns=['C01', 'C02', 'C03'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print("\n Reindexed Data Values")
print(" ")
print(df)
#Method 1 - Filling Every Missing Values with 0
print("\n\n Every Missing Value Replaced with '0':")
print(" ")
print(df.fillna(0))
#Method 2 - Dropping Rows Having Missing Values
print("\n\n Dropping Rows with Missing Values:")
print(" ")
print(df.dropna())
#Method 3 - Replacing missing values with the Median
Valuemedian = df['C01'].median()
df['C01'].fillna(median, inplace=True)
print("\n\n Missing Values for Column 1 Replaced with Median
Value:")
print(" ")
print(df)
Data transformation is the process of converting raw data into a format or structure that would be
more suitable for data analysis.
1. Rescaling:
Rescaling means transforming the data so that it fits within a specific scale, like 0-100 or 0-1.
Rescaling of data allows scaling all data values to lie between a specified minimum and
maximum value (say, between 0 and 1).
31
2. Normalizing:
The measurement unit used can affect the data analysis. For example, changing measurement
units from meters to inches for height, or from kilograms to pounds for weight,may lead to
very different results.
3. Binarizing:
Binarizing is the process of converting data to either 0 or 1 based on a threshold value.
4. Standarizing Data:
In other words, Standardization is another scaling technique where the values are centered
around the mean with a unit standard deviation.
32
C01 C02 C03
0 1 12 22
1 3 2 34
2 7 7 -11
3 4 1 9
Data Scaled Between 0 to 1
Min Max Scaled Data
[[0. 1. 0.73 ]
[0.33 0.09 1. ]
[1. 0.55 0. ]
[0.5 0. 0.44 ]]
L1 Normalized Data
[[0 1 1]
[0 0 1]
[1 1 0]
[0 0 1]]
Standardizing Data
Original Data
[[ 1. -1. 2.]
[ 2. 0. 0.]
[ 0. 1. -1.]]
Initial Mean : 0.44
Initial Standard Deviation : 1.07
Standardized Data
[[ 0. -1.22 1.34]
[ 1.22 0. -0.27]
[-1.22 1.22 -1.07]]
Scaled Mean : 0.0
Scaled Standard Deviation : 1.0
5. Label Encoding:
The label encoding process is used to convert textual labels into numeric form in order to
prepare it to be used in a machine-readable form. In label encoding, categorical data is
converted to numerical data, and the values are assigned labels (such as 1, 2, and 3).
6. One Hot Coding:
One hot encoding refers to splitting the column which contains numerical categorical data to
many columns depending on the number of categories present in that column. Each column
33
contains "0" or "1" corresponding to which column it has been placed. This creates a binary
column for each category and returns a sparse matrix or dense array
During encoding, we transform text data into numeric data. Encoding categorical data involves
changing data that fall into categories to numeric data.
Data_1.csv dataset has two categorical variables. The Country variable and the Feedback variable.
These two variables are categorical variables because they contain categories. The Country
contains three categories- India, USA & Brazil and the Feedback variable contains two categories.
Yes and No that’s why they’re called categorical variables. The following example converts these
categorical variables into numerical.
LabelEncoder encodes labels by assigning them numbers. Thus, if the feature is country with
values such as [‘India’, ‘Brazil’, ‘USA’]., LabelEncoder may encode color label as [0, 1, 2]. We
will apply LabelEncoder to Country (column 0) and Feedback column (column 3).
As you can see, the country labels are replaced by numbers [0,1,2] where ‘Brazil’ is assigned 0,
‘India’ is 1 and ‘USA’ is 2. Similarly, we will encode the Feedback column.
labelencoder = LabelEncoder()
df['Feedback'] = labelencoder.fit_transform(df['Feedback'])
34
We could have encoded both columns together in the following way:
cols = ['Country', 'Feedback']
df[cols] = df[cols].apply(LabelEncoder().fit_transform)
As you can see, the countries are assignned values 0, 1 and 2. This introduces an order among
the values although there is no relationship or order between the countries. This may cause
problems in processing. Feedback values are 0,1,2,3 which is ok because there is an order
between the feedback labels ‘Poor’,’Fair’, Good’ and ‘Excellent’.
To avoid this problem, we use One hot encoding. Here, the country labels will be encoded in 3
bit binary values (Additional columns will be introduced in the dataframe).
After execution of above code the Country variable is encoded in 3 bit binary variable. The left
most bit represents Brazil, 2nd bit represents India and the last bit represents USA.
35
The two dataframes can be merged into a single one
df = df.join(enc_df)
The last three additional columns are the encoded country values. The ‘Country’ column can be
dropped from the dataframe for further processing.
Data Discretization
Item Price
0 Shirt 1250
1 Sweater 1160
2 BodyWarmer 2842
3 Baby_Napkin 1661
BINNED DATASET
2. Equal Width Binning: bins have equal width with a range of each bin are defined as [min
+ w], [min + 2w] …. [min + nw]
where, w = (max – min) / (no of bins).
Equal Frequency:
Input: [5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]
Output:
[5, 10, 11, 13]
[15, 35, 50, 55]
[72, 92, 204, 215]
Equal Width:
Input: [5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]
Output:
[10, 11, 13, 15, 35, 50, 55, 72]
[92]
[204]
37
Set A
2. Handling Missing Value: a) Replace missing value of salary,age column with mean of that
column.
3. Data.csv have two categorical column (the country column, and the purchased column).
a. Apply OneHot coding on Country column.
b. Apply Label encoding on purchased column
Set B
38
4. Normalizing Data ( rescale each observation to a length of 1 (a unit norm). For this, use
the Normalizer class.)
5. Binarizing Data using we use the Binarizer class (Using a binary threshold, it is possible
to transform our data by marking the values above it 1 and those equal to or below it, 0)
Set C
The dataset consists of student details such as Student_id, Age, Grade, Employed, and marks.
Write a program in python to perform following task
1) Write python code to import the required libraries and load the dataset into a pandas
dataframe.
2) Display the first five rows of the dataframe.
3) Discretized the marks column into five discrete buckets, the labels need to be populated
accordingly with five values: Poor, Below_average, Average, Above_average, and
Excellent. Perform bucketing using the cut () function on the marks column and display the
top 10 columns.
Assignment Evaluation
39
Assignment 4: Data Visualization
Number of Slots = 2
Objectives
• To learn different statistical methods for Data visualization.
• To learn about packages matplotlib and seaborn.
• To learn functionalities and usages of Seaborn.
• Apply data visualization tools on various data sets.
Reading
You should read the following topics before starting this exercise:
Introduction to Exploratory Data Analysis, Data visualization and visual encoding, Data visualization libraries,
Basic data visualization tools, Histograms, Bar charts/graphs, Scatter plots, Line charts, Area plots, Pie charts,
Donut charts, Specialized data visualization tools, Boxplots, Bubble plots, Heat map, Dendrogram, Venndiagram,
Treemap, 3D scatter plots, Advanced data visualization tools- Wordclouds, Visualization of geospatialdata, Data
Visualization types
Ready Reference
Python provides the Data Scientists with various packages both for data processing and visualization. Data
visualization is concerned with the presentation of data in a pictorial or graphical form. It is one of the most
important tasks in data analysis, since it enables us to see analytical results, detect outliers, and make decisions
for model building. There are various packages available in Python for visualization purpose. Some of them are:
matplotlib, seaborn, bokeh, plotly, ggplot, pygal, gleam, leather etc.
Steps Involved in our Visualization
1. Importing packages
2. Importing and Cleaning Data
3. Creating Visualizations
Step-1: Importing Packages
Not only for Data Visualization, but every process to be held in Python should also be started by importing the
required packages. Our primary packages include Pandas for Data processing, Matplotlib for visuals, Seaborn for
advanced visuals, and Numpy for scientific calculations.
Step-2: Importing and Cleaning Data
This is an important step as a perfect data is an essential need for a perfect visualization.
40
We have successfully imported and cleaned our dataset. Now we are set to do our visualizations using our
cleaned dataset.
Step-3: Creating Visualizations
In this step we create different types of Visualizations right from basic charts to advanced charts.
There are many Python libraries for visualization, of which matplotlib, seaborn, bokeh, and ggplot are among the
most popular. However, in this assignment, we mainly focus on the matplotlib library. Matplotlib produces
publication-quality figures in a variety of formats, and interactive environments across Python platforms.Another
advantage is that Pandas comes equipped with useful wrappers around several matplotlib plotting routines,
allowing for quick and handy plotting of Series and DataFrame objects. The matplotlib librarysupports many
more plot types that are useful for data visualization. However, our goal is to provide the basic knowledge that
will help you to understand and use the library for visualizing data in the most common situations.
Self Activity
Installing Matplotlib
There are multiple ways to install the matplotlib library. The easiest way to install matplotlib is to
download the Anaconda package. Matplotlib is default installed with Anaconda package and does not
require any additional steps.
• Download anaconda package from the official site of Anaconda
• To install matplotlib, go to anaconda prompt and run the following command
• Verify whether the matplotlib is properly installed using the following command in Jupyter
notebook
import matplotlib
matplotlib. version
3.2.2
41
import matplotlib.pyplot as plt
42
# add markers to the plot, marker has different elements i.e., style, color, size etc.,
plt.plot (x, y, marker='*', c='g', markersize=3)
43
▪ m – magenta ▪ y – yellow
▪ r – red ▪ Can use Hexadecimal, RGB
▪ w – white formats
• Line Styles
▪ ‘-‘ : solid line ▪ ‘- .’: dash-dot line
▪ ‘- -‘: dotted line ▪ ‘:’ – dotted line
• Marker Styles
▪ . – point marker ▪ 2 – Tripod up marker
▪ , – Pixel marker ▪ 3 – Tripod left marker
▪ v – Triangle down marker ▪ 4 – Tripod right marker
▪ ^ – Triangle up marker ▪ s – Square marker
▪ < – Triangle left marker ▪ p – Pentagon marker
▪ > – Triangle right marker
▪ * – Star marker
▪ 1 – Tripod down marker
• Other configurations
▪ color or c ▪ markeredgewidth
▪ linestyle ▪ markeredgecolor
▪ linewidth ▪ markerfacecolor
▪ marker ▪ markersize
44
# To show the multiple plots in separate figure instead of a single figure, use plt.show()
statement before the next plot statement as shown below
x= np.linspace(0,100,50)
plt.plot(x,'r',label='simple x')
plt.show()
plt.plot(x*x,'g',label='two times x')
plt.show()
plt.legend(loc='upper right')
Create Subplots
There could be some situations where we should show multiple plots in a single figure to show
the complete storyline while presenting to stakeholders. This can be achieved with the use of
subplot in matplotlib library. For example, a retail store has 6 stores and the manager would
like to see the daily sales of all the 6 stores in a single window to compare. This can be
visualised using subplots by representing the charts in rows and columns.
# subplots are used to create multiple plots in a single figure
# let’s create a single subplot first following by adding more
subplots
x = np.random.rand(50)
y = np.sin(x*2)
45
#need to create an empty figure with an axis as below, figure
and axis are two separate objects in matplotlib
fig, ax = plt.subplots()
#add the charts to the plot
ax.plot(y)
46
# add the data to the plots
ax1.plot(x, x**2)
ax2.plot(x, x**3)
ax3.plot(x, np.sin(x**2))
ax4.plot(x, np.cos(x**2))
# add title
fig.suptitle('Horizontal plots')
47
# let’s create a figure object
# change the size of the figure is ‘figsize = (a,b)’ a is width
and ‘b’ is height in inches
# create a figure object and name it as fig
fig = plt.figure(figsize=(4,3))
# create a sample data
X = np.array([1,2,3,4,5,6,8,9,10])
Y = X**2
# plot the figure
plt.plot(X,Y)
Figure Object
Matplotlib is an object-oriented library and has objects, calluses and methods. Figure is also
one of the classes from the object ‘figure’. The object figure is a container for showing the
plots and is instantiated by calling figure() function.
‘plt.figure()’ is used to create the empty figure object in matplotlib. Figure has the following
additional parameters.
48
Axes Object
Axes is the region of the chart with data, we can add the axes to the figure using the
‘add_axes()’ method. This method requires the following four parameters i.e., left, bottom,
width, and height
Other parameters that can be used for the axes object are:
49
Different Types of Matplotlib Plots
Matplotlib has a wide variety of plot formats, few of them include bar chart, line chart, pie
chart, scatter chart, bubble chart, waterfall chart, circular area chart, stacked bar chart etc.,
50
#let’s do some customizations
#width – shows the bar width and default value is 0.8
#color – shows the bar color
#bottom – value from where the y – axis starts in the chart
i.e., the lowest value on y-axis shown
#align –move the position of x-label, has two options ‘edge’ or
‘center’
#edgecolor – used to color the borders of the bar
#linewidth – used to adjust the width of the line around the
bar
#tick_label – to set the customized labels for the x-axis
plt.bar(subject,marks,color='g',width=0.5,bottom=10,align='cent
er',edgecolor='r',linewidth=2,
tick_label=subject)
# errors bars could be added to represent the error values referring to an array value
# here in this example we used standard deviation to show as error bars
plt.bar(subject,marks,color ='g',yerr=np.std(marks))
51
Pie Chart:
Pie charts display the proportion of each value against the total sum of values. This chart
requires a single series to display. The values on the pie chart show the percentage
contribution in terms of a pie called Wedge/Widget. The angle of the wedge/widget is
calculated based on the proportion of values.
Function:
• The function used for pie chart is ‘plt.pie()’
• To draw a pie chart, we need only one list of values, each wedge is calculated as
proportion converted into angle.
Customizations:
plt.pie() function has the following specific arguments that can be used for configuring the
plot.
• labels – used to show the widget categories
• explode – used to pull out the widget/wedge slice
• autopct – used to show the % of contributions for the widgets
• Set_aspect – used to
• shadow – to show the shadow for a slice
• colours – to set the custom colours for the wedges
• startangle – to set the angles of the wedges
# Let’s create a simple pie plot
Tickets_Closed = [10, 20, 8, 35, 30, 25]
Agents = ['Raj', 'Ramesh', 'Krishna', 'Arun', 'Virag',
'Mahesh']
Plt.pie(Tickets_Closed, labels = Agents)
52
#Let’s add additional parameters to pie plot
#explode – to move one of the wedges of the plot
#autopct – to add the contribution %
explode = [0.2,0.1,0,0.1,0,0]
plt.pie(Tickets_Closed,labels=Agents, explode=explode,
autopct='%1.1f%%' )
Scatter Plot
Scatterplot is used to visualise the relationship between two columns/series of data. The
chart needs two variables, one variable shows X-position and the second variable shows Y-
position. Scatterplot helps in understanding the following information across the two
columns
• Any relationship exists between the two columns
• + ve Relationship
• Or -Ve relationship
Function:
• The function used for the scatter plot is ‘plt.scatter()’
Customizations:
plt.scatter() function has the following specific arguments that can be used for
configuring the plot.
• size – to manage the size of the points
• color – to set the color of the points
53
marker – type of marker
•
alpha – transparency of point
•
norm – to normalize the data (scaling between 0 to 1)
•
# generate the data with random numbers
x = np.random.randn(1000)
y = np.random.randn(1000)
plt.scatter(x,y)
Histogram
Histogram is used to understand the distribution of the data. It is an estimate of the
probability distribution of continuous data. It is similar to bar graph as discussed above but
this is used to represent the distribution of a continuous variable whereas bar graph is used
for discrete variable. Every distribution is characterised by four different elements
including
• Center of the distribution
54
• Spread of the distribution
• Shape of the distribution
• Peak of the distribution
Histogram requires two elements x-axis shown using bins and y-axis shown with the
frequency of the values in each of the bins form the data set. Every bin has a range with
minimum and maximum values.
Function:
• The function used for scatter plot is ‘plt.hist()’
Customizations:
plt.hist() function has the following specific arguments that can be used for configuring
the plot.
• bins – number of bins
• color
• edgecolor
• alpha – transparency of the color
• normed
• xlim – to set the x-limits
• ylim – to set the y-limits
• xticks, yticks
• facecolor, edgecolor, density
# let’s generate random numbers and use the random numbers to
generate histogram
data = np.random.randn(1000)
plt.hist(data)
55
# lets create multiple histograms in a single plot
# Create random data
hist1 = np.random.normal(25,10,1000)
hist2 = np.random.normal(200,5,1000)
#plot the histogram
plt.hist(hist1,facecolor = 'yellow',alpha = 0.5, edgecolor
='b',bins=50)
plt.hist(hist2,facecolor = 'orange',alpha = 0.8, edgecolor
='b',bins=30)
Saving Plot
Saving plot as an image using ‘savefig()’ function in matplotlib. The plot can be saved in
multiple formats like .png, .jpeg, .pdf and many other supporting formats.
# let's create a figure and save it as image
items = [5,10,20,25,30,40]
x = np.arange(6)
fig = plt.figure()
ax = plt.subplot(111)
ax.plot(x, y, label='items')
plt.title('Saving as Image')
ax.legend()
fig.savefig('saveimage.png')
56
#To display the image again, use the following package and
commands
import matplotlib.image as mpimg
image = mpimg.imread("saveimage.png")
plt.imshow(image)
plt.show()
Box Plot
It displays the distribution of data based on the five-number theory by dividing the dataset into three
quartiles and then presents the values - minimum, maximum, median, first (lower) quartile and third
(upper) quartile– in the plotted graph itself. It helps in identifying outliers and how much spread out
data is from the center.
In Python, a boxplot can be created using the boxplot() function of the matplotlib library. Consider the
following data elements:1, 1, 2, 2, 4, 6, 6.8, 7.2, 8, 8.3, 9, 10, 10, 11.5
Here, min is 1, max is 11.5, the lower quartile is 2, the median is 7, and the upper quartile is 9. If we
plot the box plot, the code would be:
#Let us create a Simple boxplot
import matplotlib.pyplot as plt
data=[1, 1, 2, 2, 4, 6, 6.8, 7.2, 8, 8.3, 9, 10, 10, 11.5]
plt.boxplot(data, vert=False)
plt.show()
The box plot appears as:
2 4 6 8 10 12
Lab Assignments
Student must use Iris flower data set for Lab Assignments
57
The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British
statistician and biologist Ronald Fisher in his 1936 paper
The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica
and Iris versicolor). Four features were measured from each sample: the length and the width
of the sepals and petals, in centimeters. Based on the combination of these four features, Fisher
developed a linear discriminant model to distinguish the species from each other.
Set A
1. Generate a random array of 50 integers and display them using a line chart, scatter
plot, histogram and box plot. Apply appropriate color, labels and styling options.
2. Add two outliers to the above data and display the box plot.
3. Create two lists, one representing subject names and the other representing marks
obtained in those subjects. Display the data in a pie chart and bar chart.
4. Write a Python program to create a Bar plot to get the frequency of the three species
of the Iris data.
5. Write a Python program to create a Pie plot to get the frequency of the three species of
the Iris data.
6. Write a Python program to create a histogram of the three species of the Iris data.
Set B
1. Write a Python program to create a graph to find relationship between the petal length
and petal width.
2. Write a Python program to draw scatter plots to compare two features of the iris
dataset.
58
3. Write a Python program to create box plots to see how each feature i.e. Sepal Length,
Sepal Width, Petal Length, Petal Width are distributed across the three species.
Set C
1. Write a Python program to create a pairplot of the iris data set and check which flower
species seems to be the most separable.
2. Write a Python program to generate a box plot to show the Interquartile range and
outliers for the three species for each feature.
3. Write a Python program to create a join plot using "kde" to describe individual
distributions on the same plot between Sepal length and Sepal width. Note: The kernel
density estimation (kde) procedure visualizes a bivariate distribution. In seaborn, this
kind of plot is shown with a contour plot and is available as a style in joint plot().
Assignment Evaluation
59