Dsbda Lab Manual
Dsbda Lab Manual
Experiment No. 1
------------------------------------------------------------------------------------------------------------
Aim:
Data Wrangling, I
Perform the following operations using Python on any open source dataset (e.g., data.csv)
1) Import all the required Python Libraries.
2) Locate an open source data from the web (e.g., https://www.kaggle.com). Provide
a clear description of the data and its source (i.e., URL of the web site).
3) Load the Dataset into pandas dataframe.
4) Data Preprocessing: check for missing values in the data using pandas isnull(),
describe() function to get some initial statistics. Provide variable descriptions. Types
of variables etc. Check the dimensions of the data frame.
5) Data Formatting and Data Normalization: Summarize the types of variables by
checking the data types (i.e., character, numeric, integer, factor, and logical) of the
variables in the data set. If variables are not in the correct data type, apply proper
type conversions.
6) Turn categorical variables into quantitative variables in Python.
------------------------------------------------------------------------------------------------------------
Requirement:
• Anaconda Installer
• Windows 10 OS
• Jupyter Notebook
Theory:
Data wrangling is the process of cleaning and unifying messy and complex data sets
for easy access and analysis.
With the amount of data and data sources rapidly growing and expanding, it is get-
ting increasingly essential for large amounts of available data to be organized for analysis.
This process typically includes manually converting and mapping data from one raw form
into another format to allow for more convenient consumption and organization of the
data.
• Provide accurate, actionable data in the hands of business analysts in a timely mat-
ter
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
• Reduce the time spent collecting and organizing unruly data before it can be uti-
lized
• Enable data scientists and analysts to focus on the analysis of data, rather than the
wrangling
• Data Acquisition: Identify and obtain access to the data within your sources.
• Joining Data: Combine the edited data for further use and analysis.
• Data Cleansing: Redesign the data into a usable and functional format and cor-
rect/remove any bad data.
Libraries Used:
Pandas: Pandas is a Python library used for working with data sets. It has functions for
analyzing, cleaning, exploring and manipulating data.
Numpy: NumPy is a Python library used for working with arrays. It also has functions for
working in domain of linear algebra, fourier transform, and matrices.
Conclusion:
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Practical 1
Data Wrangling I
Perform the following operations using Python on any open source
dataset (e.g., data.csv)
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Experiment No. 2
--------------------------------------------------------------------------------------------------------
Aim:
Data Wrangling II
Create an “Academic performance” dataset of students and perform the following
operations using Python.
1) Scan all variables for missing values and inconsistencies. If there are missing
values and/or inconsistencies, use any of the suitable techniques to deal with
them.
2) Scan all numeric variables for outliers. If there are outliers, use any of the
suitable techniques to deal with them.
3) Apply data transformations on at least one of the variables. The purpose of this
transformation should be one of the following reasons: to change the scale for
better understanding of the variable, to convert a non-linear relation into a
linear one, or to decrease the skewness and convert the distribution into a
normal distribution.
--------------------------------------------------------------------------------------------------------
Requirement:
• Anaconda Installer
• Windows 10 OS
• Jupyter Notebook
Theory:
Data Wrangling:
Data wrangling can be defined as the process of cleaning, organizing, and
transforming raw data into the desired format for analysts to use for prompt decision-
making. Also known as data cleaning or data munging, data wrangling enables
businesses to tackle more complex data in less time, produce more accurate results,
and make better decisions. The exact methods vary from project to project depending
upon your data and the goal you are trying to achieve. More and more organizations
are increasingly relying on data wrangling tools to make data ready for downstream
analytics.
Benefits of Data Wrangling:
• It helps to quickly build data flows within an intuitive user interface and
easily schedule and automate the data-flow process.
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
• Help users to process very large volumes of data easily and easily share
data-flow techniques.
ETL stands for Extract, Transform and Load. ETL is a middleware process that
involves mining or extracting data from various sources, joining the data,
transforming data as per business rules, and subsequently loading data to the target
systems. ETL is generally used for loading processed data to flat files or relational
database tables.
Though Data Wrangling and ETL look similar, there are key differences
between data wrangling and ETL processes that set them apart.
• Data Structure – Data wrangling involves varied and complex data sets,
while ETL involves structured or semi-structured relational data sets.
• Use Case – Data wrangling is normally used for exploratory data analysis,
but ETL is used for gathering, transforming, and loading data for reporting.
Conclusion:
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
In [2]:
import os
os.getcwd()
Out[2]: 'C:\\Users\\rohit\\Downloads'
In [3]:
import pandas as pd
In [4]:
df = pd.read_csv('student2.csv')
In [5]:
df
In [6]:
df.shape
Out[6]: (10, 5)
In [7]:
df.head()
In [8]:
df.tail()
localhost:8888/nbconvert/html/Downloads/practical_no2.ipynb?download=false 1/9
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
In [9]:
df.count()
Out[9]: roll 10
name 10
class 9
marks 8
age 9
dtype: int64
In [10]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 roll 10 non-null int64
1 name 10 non-null object
2 class 9 non-null object
3 marks 8 non-null float64
4 age 9 non-null float64
dtypes: float64(2), int64(1), object(2)
memory usage: 528.0+ bytes
In [11]:
df.isnull()
In [12]:
df.isnull().sum()
Out[12]: roll 0
localhost:8888/nbconvert/html/Downloads/practical_no2.ipynb?download=false 2/9
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
name 0
class 1
marks 2
age 1
dtype: int64
In [13]:
print(True+True)
In [14]:
df.dropna()
#drop all rows that having no value
In [15]:
df.fillna(0)
#identify misssing value using domain knowldege....fill absed n domain knowlwdge
In [16]:
#only using class column
df['class'].fillna('TE')
Out[16]: 0 TE
1 TE
2 BE
3 TE
4 TE
5 BE
6 BE
7 BE
8 TE
localhost:8888/nbconvert/html/Downloads/practical_no2.ipynb?download=false 3/9
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
9 BE
Name: class, dtype: object
In [24]:
df['marks'].fillna(df['marks'].mean())
Out[24]: 0 56.770
1 59.770
2 76.880
3 69.660
4 63.280
5 49.550
6 63.915
7 63.915
8 56.750
9 78.660
Name: marks, dtype: float64
In [17]:
df['age'].fillna(df['age'].median())
Out[17]: 0 22.0
1 21.0
2 19.0
3 20.0
4 20.0
5 20.0
6 19.0
7 23.0
8 20.0
9 21.0
Name: age, dtype: float64
In [20]:
df['class'].value_counts()
Out[20]: BE 5
TE 4
Name: class, dtype: int64
In [21]:
df['class'].fillna(df['class'].mode()[0])
Out[21]: 0 TE
1 TE
2 BE
3 TE
4 BE
5 BE
6 BE
7 BE
8 TE
9 BE
Name: class, dtype: object
In [22]:
df.fillna(method='backfill')
localhost:8888/nbconvert/html/Downloads/practical_no2.ipynb?download=false 4/9
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
In [23]:
df.fillna(method='pad')
In [25]:
df.describe()
In [82]:
import numpy as np
x = np.array([5,4,3,2,7,8,98,28])
In [83]:
np.mean(x)
localhost:8888/nbconvert/html/Downloads/practical_no2.ipynb?download=false 5/9
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Out[83]: 19.375
In [84]:
np.median(x)
Out[84]: 6.0
In [85]:
import matplotlib.pyplot as plt
In [86]:
plt.boxplot(x);
In [87]:
df.plot.box()
Out[87]: <AxesSubplot:>
In [88]:
df.loc[6,'marks']
Out[88]: 98.45
In [89]:
df.loc[6,'marks']=98.45
In [90]:
df.plot.box()
Out[90]: <AxesSubplot:>
localhost:8888/nbconvert/html/Downloads/practical_no2.ipynb?download=false 6/9
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
In [91]:
df.loc[6,'marks']
Out[91]: 98.45
In [92]:
df.plot.hist()
Out[92]: <AxesSubplot:ylabel='Frequency'>
In [93]:
df['age'].plot.hist()
Out[93]: <AxesSubplot:ylabel='Frequency'>
localhost:8888/nbconvert/html/Downloads/practical_no2.ipynb?download=false 7/9
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
In [97]: x= df[['age','marks']]
In [98]:
x.describe()
In [103…
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
x_scaled = scaler.fit_transform(x)
In [106…
pd.DataFrame(x_scaled).describe()
Out[106… 0 1
In [107…
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_scaled = scaler.fit_transform(x)
In [110…
pd.DataFrame(x_scaled).describe()
Out[110… 0 1
localhost:8888/nbconvert/html/Downloads/practical_no2.ipynb?download=false 8/9
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
0 1
In [ ]:
In [ ]:
localhost:8888/nbconvert/html/Downloads/practical_no2.ipynb?download=false 9/9
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Experiment No. 3
------------------------------------------------------------------------------------------------------------
Aim:
1) Provide summery statistics for a dataset with numeric varibles grouped by one
of the qualitative variables.
2) write a python program to display some basic statistical details like percentile,
mean, standard devivation , etc.
------------------------------------------------------------------------------------------------------------
Requirement:
• Anaconda Installer
• Windows 10 OS
• Jupyter Notebook
Theory:
Step 1: Provide summery statistics such as mean, median, mode, standard deviation.
(1) Mean:
“Average” value is termed as mean of the dataset
(2) Median:
The middle values of sorted dataset is known as median.
(3) Mode:
mode refers to most frequently occuring values in the dataset.
e.g. Consider the weight (in kg) of 5children as 36,40,32,42,30.lets compute
mean, median,mode
(1) Mean=(36+40+32+42+30) / 5
= 36 kg
df = pd.Dataframe(dict)
mean=df.[‘score’].mean()
print(mean)
df = pd.Dataframe(dict)
mode=df.mode()
print(mode)
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
df=pd.Dataframe[[10,20,30,40],[7,14,21,28],[55,15,8,12],[15,14,1,8],[7,1,1,8],[5,4,9,2]]
coloumns=[‘Apple’, ‘Orange’, ‘Banana’ , ‘Peer’]
index=[‘Basket1’ , ‘Basket2’ , ‘Basket3’ , ‘Basket4’ , ‘ Basket5’ , ‘ Basket6’]
print(minimum);
print(maximum);
Standard Deviation:
Key Takeways:
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Libraries Used:
1. Pandas: Pandas is a Python library used for working with data sets.
It has functions for analyzing, cleaning, exploring and manipulating
data.
2. Numpy: is a library consisting of multidimensional array objects and a
collection of routines for processing those arrays. Using NumPy,
mathematical and logical operations on arrays can be performed.
3. Seaborn: Seaborn is a data visualization library built on top of
matplotlib and closely integrated with pandas data structures in
Python.
Conclusion:
From this experiment we learnt how to calculate mean, median and mode.
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
practical_no3
In [2]:
import pandas as pd
In [5]:
df= pd.read_csv('student.csv')
In [6]:
df
0 1 anil TE 56.77 22
1 2 amit TE 59.77 21
2 3 aniket BE 76.88 19
3 4 ajinkya TE 69.66 20
4 5 asha TE 63.28 20
5 6 ayesha BE 49.55 20
6 7 amar BE 65.34 19
7 8 amita BE 68.33 23
8 9 amol TE 56.75 20
9 10 anmol BE 78.66 21
In [7]:
df.mean()
#mean
In [8]:
df.median()
#median
In [9]:
#standarad deviation
df.std()
In [10]:
df.min()
#min
Out[10]: roll 1
name ajinkya
class BE
1/5
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
practical_no3
marks 49.55
age 19
dtype: object
In [17]:
df.max()
#max
Out[17]: roll 10
name ayesha
class TE
marks 78.66
age 23
dtype: object
In [12]:
import numpy as np
In [15]:
np.std(df['marks'])
Out[15]: 8.734696846485285
In [18]:
gr1 = df.groupby('class')
In [21]:
te = gr1.get_group('TE')
In [22]:
te.min()
Out[22]: roll 1
name ajinkya
class TE
marks 56.75
age 20
dtype: object
In [23]:
te.max()
Out[23]: roll 9
name asha
class TE
marks 69.66
age 22
dtype: object
In [24]:
gr2 = df.groupby('age')
In [25]:
gr2.groups
Out[25]: {19: [2, 6], 20: [3, 4, 5, 8], 21: [1, 9], 22: [0], 23: [7]}
In [26]:
tw = gr2.get_group(20)
In [27]:
tw
2/5
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
practical_no3
3 4 ajinkya TE 69.66 20
4 5 asha TE 63.28 20
5 6 ayesha BE 49.55 20
8 9 amol TE 56.75 20
In [28]:
import seaborn as sns
In [29]:
df = sns.load_dataset('iris')
In [30]:
df
In [33]:
df = px.data.iris()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-33-dcabb90b9f03> in <module>
----> 1 df = px.data.iris()
In [34]:
#https://mitu.co.in
df=pd.read_csv('iris.csv')
df
df.describe()
3/5
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
practical_no3
In [35]:
gr = df.groupby('species')
In [37]:
se = gr.get_group('setosa')
ve = gr.get_group('versicolor')
vi = gr.get_group('virginica')
In [38]:
se.shape
Out[38]: (50, 5)
In [39]:
ve.shape
Out[39]: (50, 5)
In [40]:
vi.shape
Out[40]: (50, 5)
In [42]:
se.describe()
In [43]:
ve.describe()
4/5
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
practical_no3
In [44]:
vi.describe()
In [ ]:
5/5
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Experiment No. 4
------------------------------------------------------------------------------------------------------------
Aim: Create a Linear Regression model using python to predict home prise using
Boston Housing dataset.
------------------------------------------------------------------------------------------------------------
Requirement:
• Anaconda Installer
• Windows 10 OS
• Jupyter Notebook
Theory:
The term regression is used when you try to find the relationship between
variables. In Machine Learning and in statistical modeling, that relationship is used to
predict the outcome of events.
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Assumptions:
Given below are the basic assumptions that a linear regression model makes regarding
a dataset on which it is applied:
• Linear relationship: Relationship between response and feature variables should be
linear. The linearity assumption can be tested using scatter plots. As shown below, 1st
figure represents linearly related variables whereas variables in the 2nd and 3rd
figures are most likely non-linear. So, 1st figure will give better predictions using linear
regression.
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Applications:
1. Trend lines: A trend line represents the variation in quantitative data with the
passage of time (like GDP, oil prices, etc.). These trends usually follow a linear
relationship. Hence, linear regression can be applied to predict future values.
However, this method suffers from a lack of scientific validity in cases where
other potential changes can affect the data.
2. Economics: Linear regression is the predominant empirical tool in economics.
For example, it is used to predict consumer spending, fixed investment
spending, inventory investment, purchases of a country’s exports, spending
on imports, the demand to hold liquid assets, labor demand, and labor supply.
3. Finance: The capital price asset model uses linear regression to analyze and
quantify the systematic risks of an investment.
Biology: Linear regression is used to model causal relationships between
parameters in biological systems.
Dataset used:
• In this experminet we are going to use the boston housing dataset which
contain information about various houses in boston through different
parameters.
• There are total 506 samples ad 14 features (columns) in this dataset.
• Our objective is to predict the value of prices of the house using features with
the help of linear regression.
Libraries Uesd:
1. Pandas: Pandas is a Python library used for working with data sets.
It has functions for analyzing, cleaning, exploring and manipulating
data.
2. Sklearn: It provides a selection of efficient tools for machine learning
and statistical modeling including classification, regression, clustering
and dimensionality reduction via a consistence interface in Python.
Conclusion:
In this experiment we have studied abbout linear regression and done house
prise prediction using boston housing dataset.
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Experiment No. 5
------------------------------------------------------------------------------------------------------------
Aim: Implement logistic regression using python to perform classifiaction on
social network_ads , cv dataset.
------------------------------------------------------------------------------------------------------------
Requirement:
• Anaconda Installer
• Windows 10 OS
• Jupyter Notebook
Theory:
Logistic Regression?
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
would need to be looked at when considering how to classify each item or data
point. Aspects, or features, may include color, size, weight, shape, height, volume
or amount of limbs.
• In this way, knowing that an orange’s shape was a circle may help the algorithm
to conclude that the orange was not an animal. Similarly, knowing that the orange
had zero limbs would help as well.
• Logistic regression requires that the dependent variable, in this case whether the
item was an animal or not, be categorical. The outcome is either animal or not an
animal—there is no range in between.
• A problem that has a continuous outcome, such as predicting the grade of a
student or the fuel tank range of a car, is not a good candidate to use logistic
regression. Other options like linear regression may be more appropriate.
They differ in execution and theory. Binary regression deals with two possible values,
essentially: yes or no. Multinomial logistic regression deals with three or more values.
And ordinal logistic regression deals with three or more classes in a predetermined
order.
Binary logistic regression was mentioned earlier in the case of classifying an object as
an animal or not an animal—it’s an either/or solution. There are just two possible
outcome answers. This concept is typically represented as a 0 or a 1 in coding.
Examples include:
• Whether or not to lend to a bank customer (outcomes are yes or no).
• Assessing cancer risk (outcomes are high or low).
• Will a team win tomorrow’s game (outcomes are yes or no).
Multinomial logistic regression is a model where there are multiple classes that an item
can be classified as. There is a set of three or more predefined classes set up prior to
running the model.
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Examples include:
• Classifying texts into what language they come from.
• Predicting whether a student will go to college, trade school or into the workforce.
• Does your cat prefer wet food, dry food or human food?
Ordinal logistic regression is also a model where there are multiple classes that an item
can be classified as; however, in this case an ordering of classes is required. Classes do
not need to be proportionate. The distance between each class can vary. Examples
include:
• Ranking restaurants on a scale of 0 to 5 stars.
• Predicting the podium results of an Olympic event.
• Assessing a choice of candidates, specifically in places that institute ranked-choice
voting.
4. Logistic regression assumptions:
• Remove highly correlated inputs.
• Consider removing outliers in your training set because logistic regression will not give
significant weight to them during its calculations.
• Does not favor sparse (consisting of a lot of zero values) data.
• Logistic regression is a classification model, unlike linear regression.
Libraries Used:
1. Pandas: Pandas is a Python library used for working with data sets. It has
functions for analyzing, cleaning, exploring and manipulating data.
Conclusion:
In this experiment we have studied about the logistic regression model. We ahve
performed the classification on the social network. Ads dataset using various liatories of
python.
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
In [1]:
import pandas as pd
In [3]:
df = pd.read_csv('Social_Network_Ads.csv')
In [4]:
df
In [44]:
#input data
x=df[['Age','EstimatedSalary']]
#output data
y=df['Purchased']
In [82]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
x_scaled = scaler.fit_transform(x)
In [62]:
#cross. validation
In [83]:
x_train, x_test, y_train, y_test = train_test_split(x_scaled,y,random_state=0,test_s
In [84]:
x_train
[0.83333333, 0.65925926],
[0.5 , 0.2 ],
[0.47619048, 0.34074074],
[0.42857143, 0.25925926],
[0.42857143, 0.35555556],
[0.4047619 , 0.07407407],
[0.4047619 , 0.25925926],
[0.57142857, 0.42962963],
[0.69047619, 0.25185185],
[0.97619048, 0.1037037 ],
[0.73809524, 0.37037037],
[0.64285714, 0.85925926],
[0.30952381, 0.54814815],
[0.66666667, 0.4962963 ],
[0.69047619, 0.26666667],
[0.19047619, 0. ],
[1. , 0.64444444],
[0.47619048, 0.71851852],
[0.52380952, 0.68148148],
[0.57142857, 0.28148148],
[0.4047619 , 0.32592593],
[0.71428571, 0.19259259],
[0.71428571, 0.88148148],
[0.47619048, 0.72592593],
[0.26190476, 0.98518519],
[0.19047619, 0. ],
[1. , 0.2 ],
[0.14285714, 0.02962963],
[0.57142857, 0.99259259],
[0.66666667, 0.6 ],
[0.23809524, 0.32592593],
[0.5 , 0.6 ],
[0.23809524, 0.54814815],
[0.54761905, 0.42222222],
[0.64285714, 0.08148148],
[0.35714286, 0.4 ],
[0.04761905, 0.4962963 ],
[0.30952381, 0.43703704],
[0.57142857, 0.48148148],
[0.4047619 , 0.42222222],
[0.35714286, 0.99259259],
[0.52380952, 0.41481481],
[0.78571429, 0.97037037],
[0.66666667, 0.47407407],
[0.4047619 , 0.44444444],
[0.47619048, 0.26666667],
[0.42857143, 0.44444444],
[0.45238095, 0.46666667],
[0.47619048, 0.34074074],
[1. , 0.68888889],
[0.04761905, 0.4962963 ],
[0.92857143, 0.43703704],
[0.57142857, 0.37037037],
[0.19047619, 0.48148148],
[0.66666667, 0.75555556],
[0.4047619 , 0.34074074],
[0.07142857, 0.39259259],
[0.23809524, 0.21481481],
[0.54761905, 0.53333333],
[0.45238095, 0.13333333],
[0.21428571, 0.55555556],
[0.5 , 0.2 ],
[0.23809524, 0.8 ],
[0.30952381, 0.76296296],
[0.16666667, 0.53333333],
[0.4047619 , 0.41481481],
[0.45238095, 0.40740741],
[0.4047619 , 0.17777778],
[0.69047619, 0.05925926],
localhost:8888/nbconvert/html/Downloads/practical_05.ipynb?download=false 2/10
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
[0.4047619 , 0.97777778],
[0.71428571, 0.91111111],
[0.19047619, 0.52592593],
[0.16666667, 0.47407407],
[0.80952381, 0.91111111],
[0.78571429, 0.05925926],
[0.4047619 , 0.33333333],
[0.35714286, 0.72592593],
[0.28571429, 0.68148148],
[0.71428571, 0.13333333],
[0.54761905, 0.48148148],
[0.71428571, 0.6 ],
[0.30952381, 0.02222222],
[0.30952381, 0.41481481],
[0.5952381 , 0.84444444],
[0.97619048, 0.45185185],
[0. , 0.21481481],
[0.42857143, 0.76296296],
[0.57142857, 0.55555556],
[0.69047619, 0.11111111],
[0.19047619, 0.20740741],
[0.52380952, 0.46666667],
[0.66666667, 0.32592593],
[0.97619048, 0.2 ],
[0.66666667, 0.43703704],
[0.4047619 , 0.56296296],
[0.23809524, 0.32592593],
[0.52380952, 0.31111111],
[0.97619048, 0.94814815],
[0.92857143, 0.08148148],
[0.80952381, 0.17037037],
[0.69047619, 0.72592593],
[0.83333333, 0.94814815],
[0.4047619 , 0.08888889],
[0.95238095, 0.63703704],
[0.64285714, 0.22222222],
[0.11904762, 0.4962963 ],
[0.66666667, 0.05925926],
[0.57142857, 0.37037037],
[0.23809524, 0.51111111],
[0.47619048, 0.32592593],
[0.19047619, 0.51111111],
[0.26190476, 0.0962963 ],
[0.45238095, 0.41481481],
[0.0952381 , 0.2962963 ],
[0.71428571, 0.14814815],
[0.73809524, 0.0962963 ],
[0.47619048, 0.37037037],
[0.21428571, 0.01481481],
[0.66666667, 0.0962963 ],
[0.71428571, 0.93333333],
[0.19047619, 0.01481481],
[0.4047619 , 0.60740741],
[0.5 , 0.32592593],
[0.14285714, 0.08888889],
[0.33333333, 0.02222222],
[0.66666667, 0.54074074],
[0.4047619 , 0.31851852],
[0.9047619 , 0.33333333],
[0.69047619, 0.14074074],
[0.52380952, 0.42222222],
[0.33333333, 0.62962963],
[0.02380952, 0.04444444],
[0.16666667, 0.55555556],
[0.4047619 , 0.54074074],
[0.23809524, 0.12592593],
[0.76190476, 0.03703704],
[0.52380952, 0.32592593],
[0.76190476, 0.21481481],
localhost:8888/nbconvert/html/Downloads/practical_05.ipynb?download=false 3/10
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
[0.4047619 , 0.42222222],
[0.52380952, 0.94074074],
[0.66666667, 0.12592593],
[0.5 , 0.41481481],
[0.04761905, 0.43703704],
[0.26190476, 0.44444444],
[0.30952381, 0.45185185],
[0.69047619, 0.07407407],
[0.52380952, 0.34074074],
[0.38095238, 0.71851852],
[0.47619048, 0.48148148],
[0.57142857, 0.44444444],
[0.69047619, 0.23703704],
[0.5 , 0.44444444],
[0.02380952, 0.07407407],
[0.45238095, 0.48148148],
[0.42857143, 0.33333333],
[0.54761905, 0.27407407],
[0.42857143, 0.81481481],
[0.71428571, 0.1037037 ],
[0.42857143, 0.82222222],
[0.78571429, 0.88148148],
[0.21428571, 0.31111111],
[0.47619048, 0.41481481],
[0.5 , 0.34074074],
[0.0952381 , 0.08888889],
[0.35714286, 0.33333333],
[0.71428571, 0.43703704],
[0.95238095, 0.05925926],
[0.83333333, 0.42222222],
[0.33333333, 0.75555556],
[0.85714286, 0.40740741],
[0.28571429, 0.48148148],
[0.95238095, 0.59259259],
[0.19047619, 0.27407407],
[0.64285714, 0.47407407],
[0.14285714, 0.2962963 ],
[0.52380952, 0.44444444],
[0.35714286, 0.0962963 ],
[0.61904762, 0.91851852],
[0.0952381 , 0.02222222],
[0.35714286, 0.26666667],
[0.5952381 , 0.87407407],
[0.14285714, 0.12592593],
[0.66666667, 0.05185185],
[0.4047619 , 0.2962963 ],
[0.85714286, 0.65925926],
[0.71428571, 0.77037037],
[0.4047619 , 0.28148148],
[0.45238095, 0.95555556],
[0.11904762, 0.37777778],
[0.45238095, 0.9037037 ],
[0.30952381, 0.31851852],
[0.35714286, 0.19259259],
[0.64285714, 0.05185185],
[0.28571429, 0. ],
[0.02380952, 0.02962963],
[0.73809524, 0.43703704],
[0.5 , 0.79259259],
[0.4047619 , 0.42962963],
[0.5 , 0.41481481],
[0.14285714, 0.05925926],
[0.54761905, 0.42222222],
[0.26190476, 0.5037037 ],
[0.85714286, 0.08148148],
[0.4047619 , 0.21481481],
[0.45238095, 0.44444444],
[0.26190476, 0.23703704],
[0.30952381, 0.39259259],
localhost:8888/nbconvert/html/Downloads/practical_05.ipynb?download=false 4/10
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
[0.57142857, 0.28888889],
[0.28571429, 0.88888889],
[0.80952381, 0.73333333],
[0.76190476, 0.15555556],
[0.9047619 , 0.87407407],
[0.26190476, 0.34074074],
[0.28571429, 0.54814815],
[0.19047619, 0.00740741],
[0.35714286, 0.11851852],
[0.54761905, 0.42222222],
[0.42857143, 0.13333333],
[0.88095238, 0.81481481],
[0.71428571, 0.85925926],
[0.54761905, 0.41481481],
[0.28571429, 0.34814815],
[0.45238095, 0.42222222],
[0.54761905, 0.35555556],
[0.95238095, 0.23703704],
[0.28571429, 0.74814815],
[0.04761905, 0.25185185],
[0.45238095, 0.43703704],
[0.54761905, 0.32592593],
[0.73809524, 0.54814815],
[0.23809524, 0.47407407],
[0.83333333, 0.4962963 ],
[0.52380952, 0.31111111],
[1. , 0.14074074],
[0.4047619 , 0.68888889],
[0.07142857, 0.42222222],
[0.47619048, 0.41481481],
[0.5 , 0.67407407],
[0.45238095, 0.31111111],
[0.19047619, 0.42222222],
[0.4047619 , 0.05925926],
[0.85714286, 0.68888889],
[0.28571429, 0.01481481],
[0.5 , 0.88148148],
[0.26190476, 0.20740741],
[0.35714286, 0.20740741],
[0.4047619 , 0.17037037],
[0.54761905, 0.22222222],
[0.54761905, 0.42222222],
[0.5 , 0.88148148],
[0.21428571, 0.9037037 ],
[0.07142857, 0.00740741],
[0.19047619, 0.12592593],
[0.30952381, 0.37777778],
[0.5 , 0.42962963],
[0.54761905, 0.47407407],
[0.69047619, 0.25925926],
[0.54761905, 0.11111111],
[0.45238095, 0.57777778],
[1. , 0.22962963],
[0.16666667, 0.05185185],
[0.23809524, 0.16296296],
[0.47619048, 0.2962963 ],
[0.42857143, 0.28888889],
[0.04761905, 0.15555556],
[0.9047619 , 0.65925926],
[0.52380952, 0.31111111],
[0.57142857, 0.68888889],
[0.04761905, 0.05925926],
[0.52380952, 0.37037037],
[0.69047619, 0.03703704],
[0. , 0.52592593],
[0.4047619 , 0.47407407],
[0.92857143, 0.13333333],
[0.38095238, 0.42222222],
[0.73809524, 0.17777778],
localhost:8888/nbconvert/html/Downloads/practical_05.ipynb?download=false 5/10
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
[0.21428571, 0.11851852],
[0.02380952, 0.40740741],
[0.5 , 0.47407407],
[0.19047619, 0.48888889],
[0.16666667, 0.48148148],
[0.23809524, 0.51851852],
[0.88095238, 0.17777778],
[0.76190476, 0.54074074],
[0.73809524, 0.54074074],
[0.80952381, 1. ],
[0.4047619 , 0.37037037],
[0.57142857, 0.28888889],
[0.38095238, 0.20740741],
[0.45238095, 0.27407407],
[0.71428571, 0.11111111],
[0.26190476, 0.20740741],
[0.42857143, 0.27407407],
[0.21428571, 0.28888889],
[0.19047619, 0.76296296]])
In [85]:
y_train
Out[85]: 250 0
63 1
312 0
159 1
283 1
..
323 1
192 0
117 0
47 0
172 0
Name: Purchased, Length: 300, dtype: int64
In [86]:
from sklearn.linear_model import LogisticRegression
In [87]:
import seaborn as sns
sns.countplot(x=y)
In [88]:
y.value_counts()
Out[88]: 0 257
localhost:8888/nbconvert/html/Downloads/practical_05.ipynb?download=false 6/10
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
1 143
Name: Purchased, dtype: int64
In [89]:
#creat the object
classifier = LogisticRegression()
In [90]:
classifier.fit(x_train,y_train)
Out[90]: LogisticRegression()
In [91]:
#predication
y_pred = classifier.predict(x_test)
In [92]:
y_train.shape
Out[92]: (300,)
In [93]:
x_train.shape
Out[93]: (300, 2)
In [94]:
y_pred
Out[94]: array([0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,
0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1,
0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1], dtype=int64)
In [75]:
y_test
Out[75]: 132 0
309 0
341 0
196 0
246 0
..
146 1
135 0
390 1
264 1
364 1
Name: Purchased, Length: 100, dtype: int64
In [76]:
import matplotlib.pyplot as plt
In [77]:
plt.xlabel('Age')
plt.ylabel('Salary')
plt.scatter(x['Age'],x['EstimatedSalary'],c=y)
localhost:8888/nbconvert/html/Downloads/practical_05.ipynb?download=false 7/10
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
In [78]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
x_scaled = scaler.fit_transform(x)
In [79]:
pd.DataFrame(x_scaled).describe()
Out[79]: 0 1
In [80]:
plt.xlabel('Age')
plt.ylabel('Salary')
plt.scatter(x_scaled[:,0],x_scaled[:,1],c=y)
localhost:8888/nbconvert/html/Downloads/practical_05.ipynb?download=false 8/10
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
In [95]:
from sklearn.metrics import confusion_matrix
In [96]:
confusion_matrix(y_test,y_pred)
In [97]:
y_test.value_counts()
Out[97]: 0 68
1 32
Name: Purchased, dtype: int64
In [98]:
from sklearn.metrics import plot_confusion_matrix
In [103…
plot_confusion_matrix(classifier,x_test,y_test)
In [106…
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)
Out[106… 0.89
localhost:8888/nbconvert/html/Downloads/practical_05.ipynb?download=false 9/10
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
In [107…
from sklearn.metrics import classification_report
In [110…
print(classification_report(y_test,y_pred))
In [111…
new1=[[26,34000]]
new2=[[57,138000]]
In [113…
classifier.predict(scaler.transform(new1))
In [112…
classifier.predict(scaler.transform(new2))
localhost:8888/nbconvert/html/Downloads/practical_05.ipynb?download=false 10/10
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Experiment No. 6
-------------------------------------------------------------------------------------------------------------
Aim: Implement simple Navie Bayes Classifications algorithm. Using python on
iris.csv dataset.
-------------------------------------------------------------------------------------------------------------
Requirement:
• Anaconda Installer
• Windows 10 OS
• Jupyter Notebook
Theory:
Bayes' Theorem:
• Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used
to determine the probability of a hypothesis with prior knowledge. It
depends on the conditional probability.
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event
B.
P(B|A) is Likelihood probability: Probability of the evidence given that the
probability of a hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of Evidence.
Example:
dataset.
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
• You can use it to solve multi-class prediction problems as it’s quite useful
with them.
• Naive Bayes classifier performs better than other models with less training
• If you have categorical input variables, the Naive Bayes algorithm performs
present in the training data set, the Naive Bayes model will assign it zero
probability and won’t be able to make any predictions in this regard. This
• This algorithm is also notorious as a lousy estimator. So, you shouldn’t take
• It assumes that all the features are independent. While it might sound great
predictions.
• This algorithm is popular for multi-class predictions. You can find the
• Email services (like Gmail) use this algorithm to figure out whether an email
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
• Collaborative Filtering and the Naive Bayes algorithm work together to build
Libraries Used:
1. Pandas: Pandas is a Python library used for working with data sets. It
has functions for analyzing, cleaning, exploring and manipulating
data.
2. Sklearn: It provides a selection of efficient tools for machine learning
and statistical modeling including classification, regression, clustering
and dimensionality reduction via a consistence interface in Python.
Conclusion:
In this experiment we have studied about Navie bayes algorithm and
implemented it an the iris dataset for spam filteration
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
In [1]:
import pandas as pd
In [2]:
df = pd.read_csv('iris.csv')
In [3]:
df.shape
Out[3]: (150, 5)
In [4]:
#input data
x=df.drop('species',axis=1)
#output data
y=df['species']
In [5]:
y.value_counts()
Out[5]: versicolor 50
setosa 50
virginica 50
Name: species, dtype: int64
#cross validation from sklearn.model_selection import train_test_split
In [8]:
#cross validation
from sklearn.model_selection import train_test_split
In [15]:
x_train ,x_test,y_train,y_test=train_test_split(x,y,random_state=0,test_size=0.25)
In [17]:
x_train.shape
Out[17]: (112, 4)
In [18]:
x_test.shape
Out[18]: (38, 4)
In [19]:
#import the class
from sklearn.naive_bayes import GaussianNB
In [20]:
#create the object
clf= GaussianNB()
In [22]:
#train the algorithm
clf.fit(x_train,y_train)
Out[22]: GaussianNB()
In [23]:
localhost:8888/nbconvert/html/Downloads/practical_no6.ipynb?download=false 1/35
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
y_pred=clf.predict(x_test)
In [24]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import plot_confusion_matrix
In [25]:
confusion_matrix(y_test,y_pred)
In [30]:
plot_confusion_matrix(clf,x_test,y_test)
In [31]:
accuracy_score(y_test,y_pred)
Out[31]: 1.0
In [34]:
clf.predict_proba(x_test)
In [35]:
newl=[[4.5,2.9,3.1,0.4]]
clf.predict(newl)[0]
Out[35]: 'versicolor'
In [36]:
newl=[[5.5,3.1,1.0,0.8]]
clf.predict(newl)[0]
Out[36]: 'setosa'
In [37]:
newl=[[6.5,3.3,4.9,1.8]]
clf.predict(newl)[0]
Out[37]: 'virginica'
In [39]:
print(classification_report(y_test,y_pred))
accuracy 1.00 38
macro avg 1.00 1.00 1.00 38
weighted avg 1.00 1.00 1.00 38
Experiment No. 7
------------------------------------------------------------------------------------------------------------
Aim: Extract Sample document and apply following document preprocessing
methods: Tokenization, POS Tagging, stop words removal, Stemming and
Lemmatization.
------------------------------------------------------------------------------------------------------------
Requirement:
• Anaconda Installer
• Windows 10 OS
• Linux
• Jupyter Notebook
Theory:
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
• Another aspect is that the data set should be formatted in such a way that
more than one Machine Learning and Deep Learning algorithm are executed in
one data set, and best out of them is chosen.
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
-> "likes"
-> "liked"
-> "likely"
-> "liking"
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Errors in Stemming:
There are mainly two errors in stemming – Overstemming
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Experiment No. 8
------------------------------------------------------------------------------------------------------------
Aim: Use the inbuilt dataset “titanic” the dataset contains 891 rows and contains
informationabout the passengers who boarded the unfortunate titanic ship use
the seaborn liabrary to see if we can find any patterns in the data.
------------------------------------------------------------------------------------------------------------
Requirement:
• Anaconda Installer
• Windows 10 OS
• Linux
• Jupyter Notebook
Theory:
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
• Better analysis
• Quick action
• Identifying patterns
• Finding errors
• Understanding the story
• Exploring business insights
• Grasping the Latest Trends
Better analysis:
Data visualization helps business stakeholders analyze reports regarding
sales, marketing strategies, and product interest. Based on the analysis, they can
focus on the areas that require attention to increase profits, which in turn makes
the business more productive.
Quick action:
As mentioned previously, the human brain grasps visuals more easily than
table reports. Data visualizations allow decision makers to be notified quickly of
new data insights and take necessary actions for business growth.
Identifying patterns:
Large amounts of complicated data can provide many opportunities for
insights when we visualize them. Visualization allows business users to recognize
relationships between the data, providing greater meaning to it. Exploring these
patterns helps users focus on specific areas that require attention in the data, so
that they can identify the significance of those areas to drive their business
forward.
Finding errors:
Visualizing your data helps quickly identify any errors in the data. If the data
tends to suggest the wrong actions, visualizations help identify erroneous data
sooner so that it can be removed from analysis.
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Libraries Used:
Conclusion:
In this experiment we have studied about what is data visualization. How
data visualization helps us in different ways. We have also used the inbuilt data
set i.r., Titanic from the seaborn liabrary to perform data visualization on it.
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
In [2]:
import seaborn as sns
df= sns.load_dataset('titanic')
In [3]:
df
Out[3]: survived pclass sex age sibsp parch fare embarked class who adult_male
... ... ... ... ... ... ... ... ... ... ... ...
In [4]:
df=df[['survived','class','sex','age','fare']]
In [5]:
df
localhost:8888/nbconvert/html/Downloads/practical_no8.ipynb?download=false 1/6
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
In [6]:
sns.jointplot(x='age',y='fare',data=df)
In [7]:
sns.jointplot(x='age',y='fare',data=df,hue='survived')
localhost:8888/nbconvert/html/Downloads/practical_no8.ipynb?download=false 2/6
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
In [8]: sns.jointplot(x='age',y='fare',data=df,hue='class')
In [9]:
sns.pairplot(df,hue='sex')
localhost:8888/nbconvert/html/Downloads/practical_no8.ipynb?download=false 3/6
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
In [10]:
sns.countplot(x=df['sex'])
In [11]:
sns.countplot(x=df['class'])
localhost:8888/nbconvert/html/Downloads/practical_no8.ipynb?download=false 4/6
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
In [12]:
sns.barplot(x='sex',y='survived',data=df)
In [13]:
sns.barplot(x='sex',y='survived',hue='class',data=df)
In [14]:
sns.histplot(df['fare'])
localhost:8888/nbconvert/html/Downloads/practical_no8.ipynb?download=false 5/6
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
In [15]:
sns.kdeplot(df['fare'])
In [ ]:
localhost:8888/nbconvert/html/Downloads/practical_no8.ipynb?download=false 6/6
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Experiment No. 9
------------------------------------------------------------------------------------------------------------
Aim: Use the inbuilt dataset ‘Titanic’ as used in previous experiment . Plot a box
plot for distribution of age with respect to each gender along with the information
anout wheather they survived or not.(Cloumn name ‘sex’ and ‘age’)
------------------------------------------------------------------------------------------------------------
Requirement:
• Anaconda Installer
• Windows 10 OS
• Linux
• Jupyter Notebook
Theory:
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Implementation:
• Boxplot is available in the Seaborn library.
• Here x is considered as the dependent variable and y is considered as the
independent variable. These box plots come under univariate analysis, which means
that we are exploring data only with one variable.
• Here we are trying to check the impact of a feature named “axil_nodes” on the class
named “Survival status” and not between any two independent features.
The code snippet is as follows:
sns.boxplot(x='SurvStat',y='axil_nodes',data=hb)
Libraries Used:
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
In [2]:
import seaborn as sns
In [3]:
df = sns. load_dataset('titanic')
In [6]:
df= df[['sex','age','survived']]
In [7]:
df
0 male 22.0 0
1 female 38.0 1
2 female 26.0 1
3 female 35.0 1
4 male 35.0 0
In [8]:
sns.boxplot(x='sex',y='age',data=df)
In [9]:
sns.boxplot(x='sex',y='age',hue='survived', data=df)
<AxesSubplot:xlabel='sex', ylabel='age'>
localhost:8888/nbconvert/html/Downloads/practical_no9.ipynb?download=false 1/2
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Out[9]:
In [ ]:
localhost:8888/nbconvert/html/Downloads/practical_no9.ipynb?download=false 2/2
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Experiment No. 10
------------------------------------------------------------------------------------------------------------
Aim:
Download the iris flower dataset or any other dataset intoa dataframe. Scan the
dataset and given the interference as
1) List down the feature and types
2) Create histogram for each feature in the dataset
3) Create a boxplot for each seature in the dataset
4) Compare distributions and identify outliers.
------------------------------------------------------------------------------------------------------------
Requirement:
• Anaconda Installer
• Windows 10 OS
• Linux
• Jupyter Notebook
Theory:
Categorical Data:
For visualization, the main difference is that ordinal data suggests a particular
display order.Purely categorical data can come in a range of formats. The most
common are:
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Most plotting and modeling functions will convert character vectors to factors with
levels ordered alphabetically. Some standard R functions for working with factors
include:
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Conclusion:
In this experiment we have studied about the data visualization and
performed data visualization on the iris data set in different ways.
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
In [2]:
import seaborn as sns
In [3]:
df =sns.load_dataset('iris')
In [4]:
df
In [6]:
#list down there features and tere types available in dataset
df.columns
In [7]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepal_length 150 non-null float64
1 sepal_width 150 non-null float64
2 petal_length 150 non-null float64
3 petal_width 150 non-null float64
4 species 150 non-null object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
In [8]:
df.dtypes
localhost:8888/nbconvert/html/Downloads/practical_no10.ipynb?download=false 1/7
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
petal_width float64
species object
dtype: object
In [10]:
sns.pairplot(df,hue='species')
In [11]:
sns.pairplot(df)
localhost:8888/nbconvert/html/Downloads/practical_no10.ipynb?download=false 2/7
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
In [12]:
sns.pairplot(df,hue='species',diag_kind='hist')
localhost:8888/nbconvert/html/Downloads/practical_no10.ipynb?download=false 3/7
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
In [13]:
sns.histplot(df['sepal_length'],kde=True)
In [14]:
sns.histplot(df['sepal_width'],kde=True)
localhost:8888/nbconvert/html/Downloads/practical_no10.ipynb?download=false 4/7
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
In [16]:
sns.kdeplot(df['sepal_width'])
In [17]:
sns.histplot(df['petal_length'],kde=True)
In [18]:
sns.histplot(df['petal_width'],kde=True)
localhost:8888/nbconvert/html/Downloads/practical_no10.ipynb?download=false 5/7
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
In [19]:
sns.kdeplot(df['petal_width'])
In [20]:
sns.boxplot(x=df['sepal_length']);
In [21]:
sns.boxplot(x=df['sepal_width']);
localhost:8888/nbconvert/html/Downloads/practical_no10.ipynb?download=false 6/7
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
In [22]:
sns.boxplot(x=df['petal_length']);
In [23]:
sns.boxplot(x=df['petal_width']);
In [ ]:
localhost:8888/nbconvert/html/Downloads/practical_no10.ipynb?download=false 7/7
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Use the following dataset and classify tweets into positive and negative tweets.
https://www.kaggle.com/ruchi798/data-science-tweet
------------------------------------------------------------------------------------------------------------
Requirement:
• Anaconda Installer
• Windows 10 OS
• Jupyter Notebook
Theory:
What is Sentiment Analysis?
Sentiment Analysis is the process of ‘computationally’ determining whether a
piece of writing is positive, negative or neutral. It’s also known as opinion mining,
deriving the opinion or attitude of a speaker.
Why Sentiment Analysis?
• Business: In marketing field companies use it to develop their strategies, to
understand customers’ feelings towards products or brand, how people respond
to their campaigns or product launches and why consumers don’t buy some
products.
• Politics: In political field, it is used to keep track of political view, to detect
consistency and inconsistency between statements and actions at the
government level. It can be used to predict election results as well!
• Public Actions: Sentiment analysis also is used to monitor and analyse social
phenomena, for the spotting of potentially dangerous situations and determining
the general mood of the blogosphere.
Libraries Used:
1. Pandas: Pandas is an open source Python package that is most widely used for
data science/data analysis and machine learning tasks. It is built on top of another
package named Numpy, which provides support for multi-dimensional arrays.
2. String: The string module contains a number of functions to process standard
Python strings.
3. Sklearn: It provides a selection of efficient tools for machine learning and statistical
modeling including classification, regression, clustering and dimensionality reduction via
a consistence interface in Python.
Conslusion:
Hence, we successfully implemented sentiment analysis using python.
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Theory:
About this dataset:
Coronaviruses are a large family of viruses which may cause illness in animals or
humans. In humans, several coronaviruses are known to cause respiratory infections
ranging from the common cold to more severe diseases such as Middle East
Respiratory Syndrome (MERS) and Severe Acute Respiratory Syndrome (SARS). The
most recently discovered coronavirus causes coronavirus disease COVID-19 - World
Health Organization The number of new cases are increasing day by day around the
world. This dataset has information from the states and union territories of India at daily
level.
State level data comes from Ministry of Health & Family Welfare Testing data and
vaccination data comes from covid19india. Huge thanks to them for their efforts! Update
on April 20, 2021: Thanks to the Team at ISIBang, I was able to get the historical data for
the periods that I missed to collect and updated the csv file.
Libraries Used:
1. Pandas: Pandas is an open source Python package that is most widely used for
data science/data analysis and machine learning tasks. It is built on top of another
package named Numpy, which provides support for multi-dimensional arrays.
Conclusion:
Hence, we successfully implemented mini project using
covid_vaccine_statewise.csv dataset.
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Downloaded by T P (pakhareamol761@gmail.com)
lOMoARcPSD|39524323
Downloaded by T P (pakhareamol761@gmail.com)