Fds Lab Manual
Fds Lab Manual
Aim
To Download, install and explore the features of Numpy, scipy, Jupyter, Statsmodel and
Pandas Package.
Procedure
For problem solvers, installing and using the Anaconda distribution of Python. This
section details the installation of the Anaconda distribution of Python on Windows 10. I think
the Anaconda distribution of Python is the best option for problem solvers who want to use
Python. Anaconda is free (although the download is large which can take time) and can be
installed on school or work computers where you don't have administrator access or the
ability to install new programs. Anaconda comes bundled with about 600 packages pre-
installed including NumPy, Matplotlib and SymPy.
Follow the steps below to install the Anaconda distribution of Python on Windows.
Steps:
1. Visit Anaconda.com/downloads
2. Select Windows
3. Download the .exe installer
4. Open and run the .exe installer
5. Open the Anaconda Prompt and run some Python code
1
2. Select Windows
3. Download
Download the most recent Python 3 release. At the time of writing, the most recent
release was the Python 3.6 Version. Python 2.7 is legacy Python. For problem solvers, select
the Python 3.6 version. If you are unsure if your computer is running a 64-bit or 32-bit
version of Windows, select 64-bit as 64-bit Windows is most common.
You may be prompted to enter your email. You can still download Anaconda if you click [No
Thanks] and don't enter your Work Email address.
2
The download is quite large (over 500 MB) so it may take a while to for Anaconda to
download.
Once the download completes, open and run the .exe installer
At the beginning of the install, you need to click Next to confirm the installation.
3
Then agree to the license.
At the Advanced Installation Options screen, I recommend that you do not check "Add
Anaconda to my PATH environment variable"
After the installation of Anaconda is complete, you can go to the Windows start menu and
select the Anaconda Prompt.
4
This opens the Anaconda Prompt. Anaconda is the Python distribution and the Anaconda
Prompt is a command line shell (a program where you type in commands instead of using a
mouse). The black screen and text that makes up the Anaconda Prompt doesn't look like
much, but it is really helpful for problem solvers using Python.At the Anaconda prompt,
type python and hit [Enter]. The python command starts the Python interpreter, also called
the Python REPL (for Read Evaluate Print Loop).
> python
Note the Python version. You should see something like Python 3.6.1. With the interpreter
running, you will see a set of greater-than symbols >>> before the cursor.
Now you can type Python commands. Try typing import this. You should see the Zen of
Python by Tim Peters
5
To close the Python interpreter, type exit() at the prompt >>>. Note the double parenthesis at
the end of the exit() command. The () is needed to stop the Python interpreter and get back
out to the Anaconda Prompt.
To close the Anaconda Prompt, you can either close the window with the mouse, or
type exit, no parenthesis necessary.
When you want to use the Python interpreter again, just click the Windows Start button and
select the Anaconda Prompt and type python.
• Click the link below to download an environment file. This file contains a list of
common packages and libraries for doing data science in Python. Remember where
you save the file environment.yml. You'll need that path shortly. You don't need to
open that file right now.
o Windows
o OSX
• Once the download finishes, open the command line by doing the following:
o Windows - Hit "Start" and then type "Command Prompt" and use that
terminal.
o OSX - Type Cmd+Space and then enter Terminal in the search box to open
the terminal.
6
• Run the following commands, which will install the package and put you in the
tutorial environment.
o conda env create -f <PATH_TO_ENVIRONMENT.YML> - You'll need to
replace <PATH_TO_ENVIRONMENT.YML> with the actual path where the
file was downloaded. For OSX, that's often
(/Users/<USERNAME>/Downloads/environment.yml). For Windows, it is
usually C:/Users/<USERNAME>/Downloads/environment.yml. You'll have
to replace <USERNAME> with your username on your machine.
• That will download all a set of packages that are commonly used for data science in
Python. When it finishes, you can activate the environment with the following
command:
o Windows - activate tutorial
o OSX - source activate tutorial
In this step, we'll make sure everything is working by running the Jupyter
Notebook. Jupyter Notebook is a tool for doing interactive data science work in your
browser. * In your command prompt with the tutorial environment activated (Note: you'll
be able to tell because your command prompt will say (tutorial) at the start of it.) * Type the
following command jupyter notebook . * A browser window will open, showing the Jupyer
environment. By default, you will be in a file browser view. * In the file browser, find where
you have a Jupyter notebook. If you don't have materials for a course or tutorial that you have
downloaded, you can download this fun Jupyter notebook and then open it in the file
browser. * Click on one of the notebook (*.ipynb) files to get started!
• Hit Ctrl+c to stop the Jupyter notebook server running on your machine. (Make sure
to use Ctrl+s in the notebook to save it first!)
7
Feature of python package:
Pandas is a free Python software library for data analysis and data handling. It was
created as a community library project and initially released around 2008. Pandas provides
various high-performance and easy-to-use data structures and operations for manipulating
data in the form of numerical tables and time series. Pandas also has multiple tools for
reading and writing data between in-memory data structures and different file formats. In
short, it is perfect for quick and easy data manipulation, data aggregation, reading, and
writing the data as well as data visualization. Pandas can also take in data from different
types of files such as CSV, excel etc.or a SQL database and create a Python object known as
a data frame. A data frame contains rows and columns and it can be used for data
manipulation with operations such as join, merge, groupby, concatenate etc.
2. NumPy
NumPy is a free Python software library for numerical computing on data that can be
in the form of large arrays and multi-dimensional matrices. These multidimensional matrices
are the main objects in NumPy where their dimensions are called axes and the number of
axes is called a rank. NumPy also provides various tools to work with these arrays and high-
level mathematical functions to manipulate this data with linear algebra, Fourier transforms,
random number crunchings, etc. Some of the basic array operations that can be performed
using NumPy include adding, slicing, multiplying, flattening, reshaping, and indexing the
arrays. Other advanced functions include stacking the arrays, splitting them into sections,
broadcasting arrays, etc.
3. SciPy
SciPy is a free software library for scientific computing and technical computing on
the data. It was created as a community library project and initially released around 2001.
SciPy library is built on the NumPy array object and it is part of the NumPy stack which also
includes other scientific computing libraries and tools such as Matplotlib, SymPy, pandas
8
etc. This NumPy stack has users which also use comparable applications such as GNU
Octave, MATLAB, GNU Octave, Scilab, etc. SciPy allows for various scientific computing
tasks that handle data optimization, data integration, data interpolation, and data modification
using linear algebra, Fourier transforms, random number generation, special functions, etc.
Just like NumPy, the multidimensional matrices are the main objects in SciPy, which are
provided by the NumPy module itself.
4. Scikit-learn
Scikit-learn is a free software library for Machine Learning coding primarily in the
Python programming language. It was initially developed as a Google Summer of Code
project by David Cournapeau and originally released in June 2007. Scikit-learn is built on top
of other Python libraries like NumPy, SciPy, Matplotlib, Pandas, etc. and so it provides full
interoperability with these libraries. While Scikit-learn is written mainly in Python, it has
also used Cython to write some core algorithms in order to improve performance. You can
implement various Supervised and Unsupervised Machine learning models on Scikit-learn
like Classification, Regression, Support Vector Machines, Random Forests, Nearest
Neighbors, Naive Bayes, Decision Trees, Clustering, etc. with Scikit-learn.
5. TensorFlow
TensorFlow is a free end-to-end open-source platform that has a wide variety of tools,
libraries, and resources for Artificial Intelligence. It was developed by the Google Brain team
and initially released on November 9, 2015. You can easily build and train Machine
Learning models with high-level API’s such as Keras using TensorFlow. It also provides
multiple levels of abstraction so you can choose the option you need for your model.
TensorFlow also allows you to deploy Machine Learning models anywhere such as the cloud,
browser, or your own device. You should use TensorFlow Extended (TFX) if you want the
full experience, TensorFlow Lite if you want usage on mobile devices, and TensorFlow.js if
you want to train and deploy models in JavaScript environments. TensorFlow is available
for Python and C APIs and also for C++, Java, JavaScript, Go, Swift, etc. but without an API
9
backward compatibility guarantee. Third-party packages are also available for MATLAB, C#,
Julia, Scala, R, Rust, etc.
6. Keras
Matplotlib is a data visualization library and 2-D plotting library of Python It was
initially released in 2003 and it is the most popular and widely-used plotting library in the
Python community. It comes with an interactive environment across multiple platforms.
Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter
notebook, web application servers etc. It can be used to embed plots into applications using
various GUI toolkits like Tkinter, GTK+, wxPython, Qt, etc. So you can use Matplotlib
to create plots, bar charts, pie charts, histograms, scatterplots, error charts, power spectra,
stemplots, and whatever other visualization charts you want! The Pyplot module also
provides a MATLAB-like interface that is just as versatile and useful as MATLAB while
being totally free and open source.
10
2. Seaborn
Seaborn is a Python data visualization library that is based on Matplotlib and closely
integrated with the numpy and pandas data structures. Seaborn has various dataset-oriented
plotting functions that operate on data frames and arrays that have whole datasets within
them. Then it internally performs the necessary statistical aggregation and mapping functions
to create informative plots that the user desires. It is a high-level interface for creating
beautiful and informative statistical graphics that are integral to exploring and understanding
data. The Seaborn data graphics can include bar charts, pie charts, histograms, scatterplots,
error charts, etc. Seaborn also has various tools for choosing color palettes that can reveal
patterns in the data.
3. Plotly
Plotly is a free open-source graphing library that can be used to form data
visualizations. Plotly (plotly.py) is built on top of the Plotly JavaScript library (plotly.js) and
can be used to create web-based data visualizations that can be displayed in Jupyter
notebooks or web applications using Dash or saved as individual HTML files. Plotly provides
more than 40 unique chart types like scatter plots, histograms, line charts, bar charts, pie
charts, error bars, box plots, multiple axes, sparklines, dendrograms, 3-D charts, etc. Plotly
also provides contour plots, which are not that common in other data visualization libraries.
In addition to all this, Plotly can be used offline with no internet connection.
4. GGplot
11
create highly customised graphics in ggplot. Ggplot is also deeply connected with pandas so
it is best to keep the data in DataFrames.
Results
Thus the Anaconda Navigator has successfully downloaded, installed and explored the
features of Numpy, scipy, Jupyter, Statsmodel and Pandas Package.
12
EX NO: 2
AIM
Write a Python program to demonstrate basic array characteristics.
ALGORITHM
Step 1: Start
Step 2: Import numpy module
Step 3: Print the basic characteristics of array
Step 4: Stop
Program:
import numpy as np
arr=np.array([[1,2,3],[4,2,5]]) print
("Array is of type:",type (arr)) print ("no.of
dimensions:",arr.ndim) print ("Shape of
array:",arr.shape) print ("Size of
array:",arr.size) print ("Array stores
elements of type:",arr.dtype)
output:
Array is of type: <class
'numpy.ndarray'> no.of
dimensions: 2 Shape of
array: (2, 3)
Size of array: 6
Array stores elements of type: int32
RESULT
Thus the python program working with NumPy array has been implemented and executed
successfully.
EX NO 3
AIM
Write a program to create a dataframe using a list of elements.
ALGORITHM
Step1: Start
Step2: import numpy and pandas module
Step3: Create a dataframe using list of elements
Step4: Print the output
Step5: Stop
Program:
import pandas as pd
lst=['A','B','C','D','E','F','G']
df=pd.DataFram
e(lst) print(df)
Output:
0
0 A
1 B
2 C
3 D
4 E
5 F 6 G
RESULT
Thus the python program for dataframe using list of elements
has been implemented and executed successfully
Ex: 4
Reading data from text file, Excel and the web and exploring various
Command for doing descriptive analysis on the iris dataset
Aim
To Read data from text file, Excel and the web and exploring various Command for
doing descriptive analysis on the iris dataset.
Procedure:
Exploratory Data Analysis (EDA) is a technique to analyze data using some visual
Techniques. With this technique, we can get detailed information about the statistical summary of
the data. We will also be able to deal with the duplicates values, outliers, and also see some trends or
patterns present in the dataset.
Now let’s see a brief about the Iris dataset.
Iris Dataset
Iris Dataset is considered as the Hello World for data science. It contains five columns
namely – Petal Length, Petal Width, Sepal Length, Sepal Width, and Species Type. Iris is a
flowering plant, the researchers have measured various features of the different iris flowers and
recorded them digitally.
You can download the Iris.csv file from the above link. Now we will use the Pandas library
to load this CSV file, and we will convert it into the dataframe. read_csv() method is used to read
CSV files.
Program:
import pandas as pd
df=pd.read_csv('iris_csv.csv') df.head()
Output:
sepalwidth petallength petalwidth class
sepallength
Program:
df.shape output:
(150, 5)
Program:
df.info(
output
:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to
149 Data columns (total 5
columns):
# Column Non-Null Count Dtype
df.describe()
outpu
t:
sepallength sepalwidth petallength petalwidth
Program:
df.isnull().sum(
)
output:
sepallength 0
sepalwidth 0
petallength 0
petalwidth 0
class 0
dtype: int64
program:
data=df.drop_duplicates(subset="class",) data
outpu
t:
sepallength sepalwidth petallength petalwidth class
Program:
df.value_counts("class")
output:
class
Iris-setosa 50
Iris-versicolor 50
Iris-virginica 50
Name: count, dtype: int64
Program:
sns.countplot(x='class',data=df,) plt.show()
output:
Program:
output:
Program:
output:
Program:
output:
<seaborn.axisgrid.PairGrid at 0x239cd47ad90>
Program:
output:
Program:
output:
Program:
data.corr(method.pearson)
output:
program:
output:
program:
# importing packages
import seaborn as sns
import
matplotlib.pyplot as
plt def graph(y):
sns.boxplot(x="class", y="class",
data=df) plt.figure(figsize=(10,10))
# Adding the subplot at the
specified # grid position
plt.subplot(221)
graph('sepallen
gth')
plt.subplot(222
)
graph('sepalwi
dth')
plt.subplot(223
)
graph('petallen
gth')
plt.subplot(224
)
graph('petalwidth') plt.show()
output:
Results:
Thus the text file, Excel and the web and exploring various Command for doing
descriptive analysis on the iris dataset has successfully completed.
EX NO:5
Use the diabetes data set from UCI and Pima Indians Diabetes data set for
performing the following:
a. Univariate analysis: frequency, Mean, Median, Mode, Variance, Standard
Deviation, Skewness and Kurtosis
b. Bivariate analysis: Linear and logistic regression modelling
c. Multiple Regression analysis
d. Also compare the results of the above analysis for the two data sets
a.Univariate analysis: frequency, Mean, Median, Mode, Variance, Standard
Deviation, Skewness and Kurtosis
Aim
To perform Univariate analysis: frequency, Mean, Median, Mode, Variance,
Standard Deviation, Skewness and Kurtosis
Program :
import numpy as np
import pandas as pd
import matplotlib.pyplot as
pltimport statsmodels.api as
sm
df =
pd.read_csv('D:\Studentdetails.csv')
def UVA_numeric(data):
var_group =
data.columnssize =
len(var_group)
mini = data[i].min()
plt.subplot(1,size,j+1)
sns.distplot(data[i],hist=True,
kde=True)
sns.lineplot(points, [0,0], color = 'black', label = "std_dev")
plt.ylabel('density')
n,2)))UVA_numeric(df)
OUTPUT:
Aim
Linear Regression
Code:
# Import the
librariesimport
pandas as pd
import numpy as
np
import matplotlib.pyplot as plt
from sklearn import datasets
diabetes =
datasets.load_diabetes()diabetes
OUTPUT:
{'data': array([[ 0.03807591, 0.05068012, 0.06169621, ..., -0.00259226,
0.01990749, -0.01764613],
[-0.00188202, -0.04464164, -0.05147406, ..., -0.03949338,
-0.06833155, -0.09220405],
[ 0.08529891, 0.05068012, 0.04445121, ..., -0.00259226,
0.00286131, -0.02593034],
...,
[ 0.04170844, 0.05068012, -0.01590626, ..., -0.01107952,
-0.04688253, 0.01549073],
[-0.04547248, -0.04464164, 0.03906215, ..., 0.02655962,
0.04452873, -0.02593034],
[-0.04547248, -0.04464164, -0.0730303 , ..., -0.03949338,
-0.00422151, 0.00306441]]),
'target': array([151., 75., 141., 206., 135., 97., 138., 63., 110., 310., 101.,
69., 179., 185., 118., 171., 166., 144., 97., 168., 68., 49.,
68., 245., 184., 202., 137., 85., 131., 283., 129., 59., 341.,
87., 65., 102., 265., 276., 252., 90., 100., 55., 61., 92.,
259., 53., 190., 142., 75., 142., 155., 225., 59., 104., 182.,
128., 52., 37., 170., 170., 61., 144., 52., 128., 71., 163.,
150., 97., 160., 178., 48., 270., 202., 111., 85., 42., 170.,
200., 252., 113., 143., 51., 52., 210., 65., 141., 55., 134.,
42., 111., 98., 164., 48., 96., 90., 162., 150., 279., 92.,
83., 128., 102., 302., 198., 95., 53., 134., 144., 232., 81.,
104., 59., 246., 297., 258., 229., 275., 281., 179., 200., 200.,
173., 180., 84., 121., 161., 99., 109., 115., 268., 274., 158.,
107., 83., 103., 272., 85., 280., 336., 281., 118., 317., 235.,
60., 174., 259., 178., 128., 96., 126., 288., 88., 292., 71.,
197., 186., 25., 84., 96., 195., 53., 217., 172., 131., 214.,
59., 70., 220., 268., 152., 47., 74., 295., 101., 151., 127.,
237., 225., 81., 151., 107., 64., 138., 185., 265., 101., 137.,
143., 141., 79., 292., 178., 91., 116., 86., 122., 72., 129.,
142., 90., 158., 39., 196., 222., 277., 99., 196., 202., 155.,
77., 191., 70., 73., 49., 65., 263., 248., 296., 214., 185.,
78., 93., 252., 150., 77., 208., 77., 108., 160., 53., 220.,
154., 259., 90., 246., 124., 67., 72., 257., 262., 275., 177.,
71., 47., 187., 125., 78., 51., 258., 215., 303., 243., 91.,
150., 310., 153., 346., 63., 89., 50., 39., 103., 308., 116.,
145., 74., 45., 115., 264., 87., 202., 127., 182., 241., 66.,
94., 283., 64., 102., 200., 265., 94., 230., 181., 156., 233.,
60., 219., 80., 68., 332., 248., 84., 200., 55., 85., 89.,
31., 129., 83., 275., 65., 198., 236., 253., 124., 44., 172.,
114., 142., 109., 180., 144., 163., 147., 97., 220., 190., 109.,
191., 122., 230., 242., 248., 249., 192., 131., 237., 78., 135.,
244., 199., 270., 164., 72., 96., 306., 91., 214., 95., 216.,
263., 178., 113., 200., 139., 139., 88., 148., 88., 243., 71.,
77., 109., 272., 60., 54., 221., 90., 311., 281., 182., 321.,
58., 262., 206., 233., 242., 123., 167., 63., 197., 71., 168.,
140., 217., 121., 235., 245., 40., 52., 104., 132., 88., 69.,
219., 72., 201., 110., 51., 277., 63., 118., 69., 273., 258.,
43., 198., 242., 232., 175., 93., 168., 275., 293., 281., 72.,
140., 189., 181., 209., 136., 261., 113., 131., 174., 257., 55.,
84., 42., 146., 212., 233., 91., 111., 152., 120., 67., 310.,
94., 183., 66., 173., 72., 49., 64., 48., 178., 104., 132.,
220., 57.]),
'frame': None,
'DESCR': '.. _diabetes_dataset:\n\nDiabetes dataset\n \n\nTen baseline variables, age, sex,
body
mass index, average blood\npressure, and six blood serum measurements were obtained
for each of n =\n442 diabetes patients, as well as the response of interest, a\nquantitative
measure of disease progression one year after baseline.\n\n**Data Set
Characteristics:**\n\n :Number of Instances: 442\n\n :Number of Attributes: First 10
columns are numeric predictive values\n\n :Target: Column 11 is a quantitative measure
of disease progression one year after baseline\n\n :Attribute Information:\n - age age
in years\n - sex\n - bmi body mass index\n - bp
average blood pressure\n - s1 tc, total serum cholesterol\n - s2
ldl, low-density lipoproteins\n - s3 hdl, high-density
lipoproteins\n - s4 tch, total cholesterol / HDL\n
- s5 ltg, possibly log of serum triglycerides level\n - s6 glu, blood sugar
level\n\nNote: Each of these 10 feature variables have been mean centered and scaled by
the standard deviation times the square root of
`n_samples` (i.e. the sum of squares of each column totals 1).\n\nSource
URL:\nhttps://www4.stat.ncsu.edu/~boos/var.select/diabetes.html\n\nFor more
information see:\nBradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani
(2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-
499.\n(https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)\n',
'feature_names': ['age',
'sex',
'bmi',
'bp',
's1',
's2',
's3',
's4',
's5',
's6'],
'data_filename':
'diabetes_data_raw.csv.gz',
'target_filename':
'diabetes_target.csv.gz',
'data_module':
'sklearn.datasets.data'}
Code:
print(diabetes.DESCR)
OUTPUT:
.. _diabetes_dataset:
Diabetes dataset
Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.
:Attribute Information:
- age age in years
- sex
- bmi body mass index
- bp average blood pressure
- s1 tc, total serum cholesterol
- s2 ldl, low-density lipoproteins
- s3 hdl, high-density lipoproteins
- s4 tch, total cholesterol / HDL
- s5 ltg, possibly log of serum triglycerides level
- s6 glu, blood sugar level
Note: Each of these 10 feature variables have been mean centered and scaled
by the standard deviation times the square root of `n_samples` (i.e. the
sum of squares of each column totals 1).
Source URL:
https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
CODE:
# columns
diabetes.feature_names
OUTPUT:
['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
CODE:
# Now we will split the data into the independent and independent variable
X = diabetes.data
Y = diabetes.target
X.shape, Y.shape
OUTPUT:
((442, 10), (442,))
CODE:
Y
OUTPUT:
array([151., 75., 141., 206., 135., 97., 138., 63., 110., 310., 101.,
69., 179., 185., 118., 171., 166., 144., 97., 168., 68., 49.,
68., 245., 184., 202., 137., 85., 131., 283., 129., 59., 341.,
87., 65., 102., 265., 276., 252., 90., 100., 55., 61., 92.,
259., 53., 190., 142., 75., 142., 155., 225., 59., 104., 182.,
128., 52., 37., 170., 170., 61., 144., 52., 128., 71., 163.,
150., 97., 160., 178., 48., 270., 202., 111., 85., 42., 170.,
200., 252., 113., 143., 51., 52., 210., 65., 141., 55., 134.,
42., 111., 98., 164., 48., 96., 90., 162., 150., 279., 92.,
83., 128., 102., 302., 198., 95., 53., 134., 144., 232., 81.,
104., 59., 246., 297., 258., 229., 275., 281., 179., 200., 200.,
173., 180., 84., 121., 161., 99., 109., 115., 268., 274., 158.,
107., 83., 103., 272., 85., 280., 336., 281., 118., 317., 235.,
60., 174., 259., 178., 128., 96., 126., 288., 88., 292., 71.,
197., 186., 25., 84., 96., 195., 53., 217., 172., 131., 214.,
59., 70., 220., 268., 152., 47., 74., 295., 101., 151., 127.,
237., 225., 81., 151., 107., 64., 138., 185., 265., 101., 137.,
143., 141., 79., 292., 178., 91., 116., 86., 122., 72., 129.,
142., 90., 158., 39., 196., 222., 277., 99., 196., 202., 155.,
77., 191., 70., 73., 49., 65., 263., 248., 296., 214., 185.,
78., 93., 252., 150., 77., 208., 77., 108., 160., 53., 220.,
154., 259., 90., 246., 124., 67., 72., 257., 262., 275., 177.,
71., 47., 187., 125., 78., 51., 258., 215., 303., 243., 91.,
150., 310., 153., 346., 63., 89., 50., 39., 103., 308., 116.,
145., 74., 45., 115., 264., 87., 202., 127., 182., 241., 66.,
94., 283., 64., 102., 200., 265., 94., 230., 181., 156., 233.,
60., 219., 80., 68., 332., 248., 84., 200., 55., 85., 89.,
31., 129., 83., 275., 65., 198., 236., 253., 124., 44., 172.,
114., 142., 109., 180., 144., 163., 147., 97., 220., 190., 109.,
191., 122., 230., 242., 248., 249., 192., 131., 237., 78., 135.,
244., 199., 270., 164., 72., 96., 306., 91., 214., 95., 216.,
263., 178., 113., 200., 139., 139., 88., 148., 88., 243., 71.,
77., 109., 272., 60., 54., 221., 90., 311., 281., 182., 321.,
58., 262., 206., 233., 242., 123., 167., 63., 197., 71., 168.,
140., 217., 121., 235., 245., 40., 52., 104., 132., 88., 69.,
219., 72., 201., 110., 51., 277., 63., 118., 69., 273., 258.,
43., 198., 242., 232., 175., 93., 168., 275., 293., 281., 72.,
140., 189., 181., 209., 136., 261., 113., 131., 174., 257., 55.,
84., 42., 146., 212., 233., 91., 111., 152., 120., 67., 310.,
94., 183., 66., 173., 72., 49., 64., 48., 178., 104., 132.,
220., 57.])
CODE:
# We will split the data into training and testing data
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(X,Y,test_size=0.3,random_state=99)
train_x.shape, train_y.shape
OUTPUT:
((309, 10), (309,))
CODE:
# Linear Regression
le = LinearRegression()
le.fit(train_x,train_y)
y_pred = le.predict(test_x)
y_pred
OUTPUT:
# Linear Regression
from sklearn.linear_model import LinearRegression
le = LinearRegression()
le.fit(train_x,train_y)
y_pred = le.predict(test_x)
y_pred
CODE:
# we will check the accuracy
print('coefficient', le.coef_)
print('intercept', le.intercept_)
OUTPUT:
coefficient [ 40.66179183 -313.2934701 517.18345014 386.06561935 -
604.64289258
275.31448303 3.92015562 172.39101348 661.95332633 62.26068823]
intercept 155.59120125597175
CODE:
from sklearn.metrics import mean_squared_error, r2_score
# mean_squared_error
mean_squared_error(test_y,y_pred)
# r2 score
r2_score(test_y,y_pred)
OUTPUT:
0.4545709909725646
CODE:
df.head()
OUTPUT:
CODE:
df.info()
OUTPUT:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
# Column Non-Null Count Dtype
CODE:
df.describe()
OUTPUT:
CODE:
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
OUTPUT:
<Axes: >
CODE:
sns.countplot(x='Outcome',data=df)
OUTPUT:
<Axes: xlabel='Outcome', ylabel='count'>
CODE:
sns.distplot(df['Age'].dropna(),kde=True)
OUTPUT:
C:\Users\mca1\AppData\Local\Temp\ipykernel_9620\436136364.py:1:
UserWarning:
sns.distplot(df['Age'].dropna(),kde=True)
<Axes: xlabel='Age', ylabel='Density'>
CODE:
df.corr()
OUTPUT:
CODE:
sns.heatmap(df.corr())
OUTPUT:
<Axes: >
CODE:
x = df.drop('Outcome',axis=1)
y = df['Outcome']
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3,random_state=101)
from sklearn.linear_model import LogisticRegression
logmodel = LogisticRegression()
logmodel.fit(x_train,y_train)
OUTPUT:
C:\Users\mca1\anaconda3\Lib\site-
packages\sklearn\linear_model\_logistic.py:460: ConvergenceWarning: lbfgs
failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-
regression
n_iter_i = _check_optimize_result(
LogisticRegression
LogisticRegression()
CODE:
predictions = logmodel.predict(x_test)
from sklearn.metrics import classification_report
print(classification_report(y_test,predictions))
OUTPUT:
precision recall f1-score support
CODE:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test,predictions)
OUTPUT:
array([[133, 17],
[ 30, 51]], dtype=int64)
Aim
CODE:
Import numpy as np
fromsklearn.linear_model import LinearRegression
x = [[0, 1], [5, 1], [15, 2], [25, 5], [35, 11], [45, 15], [55, 34], [60, 35]]
y = [4, 5, 20, 14, 32, 22, 38, 43]
x,y=np.array(x),np.array(y)
model=LinearRegression().fit(x,y)
r_sq = model.score(x,y)
print('coefficient of determination:', r_sq)
print('intercept:', model.intercept_)
print('slope:', model.coef_)
y_pred = model.predict(x)
print('predicted response:', y_pred)
OUTPUT:
coefficient of determination: 0.8615939258756776
intercept: 5.52257927519819
slope: [0.44706965 0.25502548]
predicted response: [ 5.77760476 8.012953 12.73867497 17.9744479 23.97529728 29.4660957
38.78227633 41.27265006]
d. Comparison of dataset
PROGRAM:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# disable warnings
import warnings
warnings.filterwarnings('ignore')
data=pd.read_csv('diabetes.csv')
data1=pd.read_csv('diabetes.csv')
data=pd.read_csv('diabetes.csv')
data1=pd.read_csv('diabetes.csv')
print(data.columns)
print(data1.columns)
OUTPUT:
Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness',
'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
dtype='object')
Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness',
'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
dtype='object')
PROGRAM:
corrmat = data.corr()
f, ax = plt.subplots(figsize=(12, 9))
corrmat = data1.corr()
f, ax = plt.subplots(figsize=(12, 9))
OUTPUT:
code:
sns.set()
cols=['Pregnancies','Glucose','BloodPressure','Insulin','BMI','DiabetesPedigreeFunction','Age',
'Outcome']
sns.pairplot(data[cols], size = 2.5)
plt.show();
sns.set()
cols=['Pregnancies','Glucose','BloodPressure','Insulin','BMI','DiabetesPedigreeFunction','Age',
'Outcome']
plt.show()
Output:
Results
a. Normal curves
Program:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
df = pd.read_csv('Marks.csv')
def UVA_numeric(data):
var_group = data.columns
size = len(var_group)
plt.figure(figsize = (7*size,3), dpi = 400)
#looping for each variable
for j,i in enumerate(var_group):
# calculating descriptives of variable
mini = data[i].min()
maxi = data[i].max()
ran = data[i].max()-data[i].min()
mean = data[i].mean()
median = data[i].median()
st_dev = data[i].std()
skew = data[i].skew()
kurt = data[i].kurtosis()
# calculating points of standard deviation
points = mean-st_dev, mean+st_dev
#Plotting the variable with every information
plt.subplot(1,size,j+1)
sns.distplot(data[i],hist=True, kde=True)
sns.lineplot(points, [0,0], color = 'black', label = "std_dev")
sns.scatterplot([mini,maxi], [0,0], color = 'orange', label = "min/max")
sns.scatterplot([mean], [0], color = 'red', label = "mean")
sns.scatterplot([median], [0], color = 'blue', label = "median")
plt.xlabel('{}'.format(i), fontsize = 20)
plt.ylabel('density')
plt.title('std_dev = {}; kurtosis = {};\nskew = {}; range = {}\nmean = {}; median =
{}'.format((round(points[0],2),round(points[1],2)),
round(kurt,2),
round(skew,2),
(round(mini,2),round(maxi,2),round(ran,2)),
round(mean,2),
round(median,2)))
UVA_numeric(df)
Output:
PROGRAM:
import numpy as np
import seaborn as sb
data = np.arange(1,10,0.01)
sb.set_style('whitegrid'),
sb.lineplot(pdf)
plt.xlabel('Heights')
plt.ylabel('Probability Density'
OUTPUT:
... ... ... ... ... ... ... ... ... ...
PROGRAM:
list(con.columns)
OUTPUT:
['cement',
'slag',
'ash',
'water',
'superplastic',
'coarseagg',
'fineagg',
'age',
'strength']
In [ ]:
PROGRAM:
con.rename(columns={'Fly ash': 'FlyAsh', 'Coarse Aggr.': "CoarseAgg" ,'Fine Aggr.': 'FineAgg', 'Age': 'Age',
'Compressive Strength (28-day)(Mpa)': 'Strength'}, inplace=True)
con.head()
OUTPUT:
PROGRAM:
con['age'] = con['age'].astype('category')
con.describe(include='category')
OUTPUT:
age
count 1030
unique 14
top 28
freq 42
PROGRAM:
OUTPUT:
PROGRAM:
ax.set_xlabel("Fly ash")
]
OUTPUT:
PROGRAM:
OUTPUT:
PROGRAM:
OUTPUT:
PearsonRResult(statistic=0.9999999999999997, pvalue=1.123412337643488e-23)
PROGRAM:
Codecormat = con.corr()
round(Codecormat,2)
OUTPUT:
d.Histogram
Code:
import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_excel('https://github.com/datagy/Intro-to-Python/raw/master/sportsdata.xls',
usecols=['Age'])
print(df.describe())
plt.hist(df['Age'])
OUTPUT:
Code:
# importing mplot3d toolkits
from mpl_toolkits import mplot3d
import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure()
# syntax for 3-D projection
ax = plt.axes(projection ='3d')
# defining axes
z = np.linspace(0, 1, 100)
x = z * np.sin(25 * z)
y = z * np.cos(25 * z)
c=x+y
ax.scatter(x, y, z, c = c)
# syntax for plotting
ax.set_title('3d Scatter plot geeks for geeks')
plt.show()
OUTPUT:
Code:
# importing libraries
from mpl_toolkits import mplot3d
import numpy as np
import matplotlib.pyplot as plt
# defining surface and axes
x = np.outer(np.linspace(-2, 2, 10), np.ones(10))
y = x.copy().T
z = np.cos(x ** 2 + y ** 3)
fig = plt.figure()
# syntax for 3-D plotting
ax = plt.axes(projection ='3d')
# syntax for plotting
ax.plot_surface(x, y, z, cmap ='viridis', edgecolor ='green')
ax.set_title('Surface plot geeks for geeks')
plt.show()
Output:
Results
Thus the various plotting curve has been successfully completed
EX. 7
Apply and explore various plotting functions
Aim
To Apply and explore various plotting functions
Procedures
Geographic data
One common type of visualization in data science is that of geographic data. Matplotlib's main tool
for this type of visualization is the Basemap toolkit, which is one of several Matplotlib toolkits
which lives under the mpl_toolkits namespace. Admittedly, Basemap feels a bit clunky to use, and
often even simple visualizations take much longer to render than you might hope. More modern
solutions such as leaflet or the Google Maps API may be a better choice for more intensive map
visualizations. Still, Basemap is a useful tool for Python users to have in their virtual toolbelts. In
this section, we'll show several examples of the type of map visualization that is possible with this
toolkit.
Installation of Basemap is straightforward; if you're using conda you can type this and the package
will be downloaded:
Program:
## Package Plan ##
basemap-data-hires conda-forge/noarch::basemap-data-hires-
1.3.2pyhd8ed1ab_3
Note: you may need to restart the kernel to use updated packages
Program:
Clipping input data to the valid range for imshow with RGB
data ([0..1] for floats or [0..255] for integers).
<matplotlib.image.AxesImage at 0x18859f04cd0>
Program:
fig=plt.figure(figsize=(8,8))
m=Basemap(projection='lcc',resolution=None,wid
th=8E6, height=8E6,lat_0=45,lon_0=-100,)
m.etopo(scale=0.5,alpha=0.5)
# Map (long,lat) to (x,y) for plotting
x,y=m(- 122.3,47.6)
plt.plot(x,y,'ok',markersize=5)
plt.text(x,y,'seattle',fontsize=12)
Output:
Clipping input data to the valid range for imshow with RGB
data ([0..1] for floats or [0..255] for integers).
Text(2347268.222744085, 4518079.266407731, 'seattle')
Program:
from itertools import chain
def draw_map(m,scale=0.2):
#draw a shaded-relief
image
m.shadedrelief(scale=scale)
#lats and longs are retrurned as dictonary
lats=m.drawparallels(np.linespace(-90,90,13))
lons=m.drawmeridians(np.linespace(-180,180,13))
#key
s contain the plt.Line2D instances
lat_lines=chain(*(tup[1][0]for tup in
lats.items())) lon_lines=chain(*(tup[1][0]for tip
in lons.items()))
all_lines=chain(lat_lines,lon_lines)
#cycle through these lines and set the desired
style for line in all_lines:
line.set(linestyle='-',alpha=0.3,color=w)
fig=plt.figure(figsize=(8,6),edgecolor='w')
m=Basemap(projection='cyl',resolution=None,llcrnrlat=90,urcrnrlat=90,ll
crnr lon=-180,urcrnrlon=180,) draw_map(m)
output:
AttributeError Traceback (most recent call
last) Cell In[9], line 3
1 fig=plt.figure(figsize=(8,6),edgecolor='w')
2 m=Basemap(projection='cyl',resolution=None,llcrnrla
t=- 90,urcrnrlat=90,llcrnrlon=-180,urcrnrlon=180,)
----> 3 draw_map(m)
'linespace'
Program:
fig=plt.figure(figsize=(8,6),edgecolor='w')
m=Basemap(projection='moll',resolution=None,lat_0=0,lon_0=0,)
m.plot("area",y=True) draw_map(m)
output:
'linespace'
Program:
fig=plt.figure(figsize=(8,8))
m=Basemap(projection='lcc',resolution=None,lon_0=0,lat_0=50,lat_1=45,lat_2=
55,width=1.6E7,height=1.2E7) draw_map(m)
output:
AttributeError Traceback (most recent call last)
Cell In[12], line 3
1 fig=plt.figure(figsize=(8,8))
2
m=Basemap(projection='lcc',resolution=None,lon_0=0,lat_0=50,lat_1=45,lat_2=
55,width=1.6E7,height=1.2E7)
----> 3 draw_map(m)
m=Basemap(projection='gnom',lat_0=57.3,lon_0=6.2,width=90000,height=120000,
resolution=res,ax=ax[i])
m.fillcontinents(color="#FFDDCC",lake_color="#DDEEFF")
m.drawmapboundary(fill_color="#DDEEFF")
m.drawcoastlines()
m.plot("area",y=True)
ax[i].set_title("resolution='{0}'".format(res))
output:
code:
import pandas as pd
cities = pd.read_csv('data/california_cities.csv')
# Extract the data we're interested in
lat = cities['latd'].values
lon = cities['longd'].values
population = cities['population_total'].values
area = cities['area_total_km2'].values
# 1. Draw the map background
fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='lcc', resolution='h', lat_0=37.5, lon_0=-19, width=1E6, height=1.2E6)
m.shadedrelief()
m.drawcoastlines(color='gray')
m.drawcountries(color='gray')
m.drawstates(color='gray')
# 2. scatter city data, with color reflecting population
# and size reflecting area
m.scatter(lon, lat, latlon=True, c=np.log10(population), s=area, cmap='Reds', alpha=0.5)
# 3. create colorbar and legend
plt.colorbar(label=r'$\log_{10}({\rm population})$')
plt.clim(3, 7)
# make legend with dummy points
for a in [100, 300, 500]:
plt.scatter([], [], c='k', alpha=0.5, s=a, label=str(a) + ' km$^2$')
plt.legend(scatterpoints=1, frameon=False,labelspacing=1, loc='lower left')
output:
Result:
AIM
To write a python program frequency of last character appearing in all names associated with
males and females respectively and compares them.
ALORITHM
STEP 1: Start
STEP 2: import all necessary libraries
STEP 3: Display the frequency of each items in the list
STEP 4: plot
STEP 5: Stop
PROGRAM:
import nltk
nltk.download('names')
from nltk.corpus import names
nt=[(fid.split('.')[0],name[-1])
for fid in names.fileids()
for name in names.words(fid)]
cfd2=nltk.ConditionalFreqDist(nt)
cfd2['female']['a']
cfd2['male']['a']
cfd2['female']>cfd2['male']
cfd2.tabulate(saples=['a','e'])
cfd2.plot()
OUTPUT:
a b c d e f g h i j k l m n o p r s t u v w x y z
female 1 1773 9 0 39 1432 2 10 105 317 1 3 179 13 386 33 2 47 93 68 6 2 5 10
461 4
male 0 29 21 25 228 468 25 32 93 50 3 69 187 70 478 165 18 190 230 164 12 16 17
10 332 11
RESULT
Thus the python program frequency of last character appearing in all names
associated with males and females respectively and compares them has been
implemented and executed successfully
Ex 9:
AIM
To write a python program determine the frequency of words, of a particular genre, in brown
corpus.
ALORITHM
STEP 1: Start
STEP 2: import all necessary libraries
STEP 3: Display the frequency of each items in
the list
PROGRAM:
import nltk
nltk.download('brown')
from nltk.corpus import brown
cfd = nltk.ConditionalFreqDist([(genre, word) for genre in brown.categories() for word in
brown.words(categories=genre)])
cfd
cfd.conditions()
cfd.tabulate(conditions=['government', 'humor', 'reviews'], samples=['leadership',
'workship', 'hardship'])
cfd.plot(conditions=['government', 'humor', 'reviews'], samples=['leadership', 'workship',
'hardship'])
cfd.tabulate(conditions=['government', 'humor', 'reviews'], samples=['leadership',
'workship', 'hardship'], cumulative=True)
news_fd = cfd['news']
news_fd.most_common(3)
news_fd['the'] # corrected variable name from new_fd to news_fd
OUTPUT:
RESULT
Thus the python program frequency of words, of a particular genre, in brown
corpus has been implemented and executed successfully.