0% found this document useful (0 votes)
38 views72 pages

Fds Lab Manual

The document provides a comprehensive guide on downloading, installing, and exploring Python packages such as Numpy, Scipy, Jupyter, Statsmodel, and Pandas for data science. It includes detailed installation steps for Anaconda on Windows, instructions for setting up a data science environment, and descriptions of various Python libraries for data processing, modeling, and visualization. Additionally, it presents example programs demonstrating the use of Numpy arrays and Pandas data frames.

Uploaded by

akshaya31032007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views72 pages

Fds Lab Manual

The document provides a comprehensive guide on downloading, installing, and exploring Python packages such as Numpy, Scipy, Jupyter, Statsmodel, and Pandas for data science. It includes detailed installation steps for Anaconda on Windows, instructions for setting up a data science environment, and descriptions of various Python libraries for data processing, modeling, and visualization. Additionally, it presents example programs demonstrating the use of Numpy arrays and Pandas data frames.

Uploaded by

akshaya31032007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

EX 1:

Download, install and explore the features of Numpy, scipy, Jupyter,


Statsmodel and Pandas Package

Aim
To Download, install and explore the features of Numpy, scipy, Jupyter, Statsmodel and
Pandas Package.

Procedure

Setting up your machine for data science in Python

Download and Install Anaconda

Installing Anaconda on Windows

For problem solvers, installing and using the Anaconda distribution of Python. This
section details the installation of the Anaconda distribution of Python on Windows 10. I think
the Anaconda distribution of Python is the best option for problem solvers who want to use
Python. Anaconda is free (although the download is large which can take time) and can be
installed on school or work computers where you don't have administrator access or the
ability to install new programs. Anaconda comes bundled with about 600 packages pre-
installed including NumPy, Matplotlib and SymPy.

Follow the steps below to install the Anaconda distribution of Python on Windows.

Steps:

1. Visit Anaconda.com/downloads
2. Select Windows
3. Download the .exe installer
4. Open and run the .exe installer
5. Open the Anaconda Prompt and run some Python code

1. Visit the Anaconda downloads page

Go to the following link: Anaconda.com/downloads

The Anaconda Downloads Page will look something like this:

1
2. Select Windows

Select Windows where the three operating systems are listed.

3. Download

Download the most recent Python 3 release. At the time of writing, the most recent
release was the Python 3.6 Version. Python 2.7 is legacy Python. For problem solvers, select
the Python 3.6 version. If you are unsure if your computer is running a 64-bit or 32-bit
version of Windows, select 64-bit as 64-bit Windows is most common.

You may be prompted to enter your email. You can still download Anaconda if you click [No
Thanks] and don't enter your Work Email address.

2
The download is quite large (over 500 MB) so it may take a while to for Anaconda to
download.

4. Open and run the installer

Once the download completes, open and run the .exe installer

At the beginning of the install, you need to click Next to confirm the installation.

3
Then agree to the license.

At the Advanced Installation Options screen, I recommend that you do not check "Add
Anaconda to my PATH environment variable"

5. Open the Anaconda Prompt from the Windows start menu

After the installation of Anaconda is complete, you can go to the Windows start menu and
select the Anaconda Prompt.

4
This opens the Anaconda Prompt. Anaconda is the Python distribution and the Anaconda
Prompt is a command line shell (a program where you type in commands instead of using a
mouse). The black screen and text that makes up the Anaconda Prompt doesn't look like
much, but it is really helpful for problem solvers using Python.At the Anaconda prompt,
type python and hit [Enter]. The python command starts the Python interpreter, also called
the Python REPL (for Read Evaluate Print Loop).

> python

Note the Python version. You should see something like Python 3.6.1. With the interpreter
running, you will see a set of greater-than symbols >>> before the cursor.

Now you can type Python commands. Try typing import this. You should see the Zen of
Python by Tim Peters

5
To close the Python interpreter, type exit() at the prompt >>>. Note the double parenthesis at
the end of the exit() command. The () is needed to stop the Python interpreter and get back
out to the Anaconda Prompt.

To close the Anaconda Prompt, you can either close the window with the mouse, or
type exit, no parenthesis necessary.

When you want to use the Python interpreter again, just click the Windows Start button and
select the Anaconda Prompt and type python.

2. Download and install common packages for data science in Python

• Click the link below to download an environment file. This file contains a list of
common packages and libraries for doing data science in Python. Remember where
you save the file environment.yml. You'll need that path shortly. You don't need to
open that file right now.
o Windows
o OSX
• Once the download finishes, open the command line by doing the following:
o Windows - Hit "Start" and then type "Command Prompt" and use that
terminal.
o OSX - Type Cmd+Space and then enter Terminal in the search box to open
the terminal.

6
• Run the following commands, which will install the package and put you in the
tutorial environment.
o conda env create -f <PATH_TO_ENVIRONMENT.YML> - You'll need to
replace <PATH_TO_ENVIRONMENT.YML> with the actual path where the
file was downloaded. For OSX, that's often
(/Users/<USERNAME>/Downloads/environment.yml). For Windows, it is
usually C:/Users/<USERNAME>/Downloads/environment.yml. You'll have
to replace <USERNAME> with your username on your machine.

• That will download all a set of packages that are commonly used for data science in
Python. When it finishes, you can activate the environment with the following
command:
o Windows - activate tutorial
o OSX - source activate tutorial

3. Run Jupyter notebook!

In this step, we'll make sure everything is working by running the Jupyter
Notebook. Jupyter Notebook is a tool for doing interactive data science work in your
browser. * In your command prompt with the tutorial environment activated (Note: you'll
be able to tell because your command prompt will say (tutorial) at the start of it.) * Type the
following command jupyter notebook . * A browser window will open, showing the Jupyer
environment. By default, you will be in a file browser view. * In the file browser, find where
you have a Jupyter notebook. If you don't have materials for a course or tutorial that you have
downloaded, you can download this fun Jupyter notebook and then open it in the file
browser. * Click on one of the notebook (*.ipynb) files to get started!

4. To stop Jupyter notebook:

• Hit Ctrl+c to stop the Jupyter notebook server running on your machine. (Make sure
to use Ctrl+s in the notebook to save it first!)

7
Feature of python package:

Python Libraries for Data Processing and Modeling


1. Pandas

Pandas is a free Python software library for data analysis and data handling. It was
created as a community library project and initially released around 2008. Pandas provides
various high-performance and easy-to-use data structures and operations for manipulating
data in the form of numerical tables and time series. Pandas also has multiple tools for
reading and writing data between in-memory data structures and different file formats. In
short, it is perfect for quick and easy data manipulation, data aggregation, reading, and
writing the data as well as data visualization. Pandas can also take in data from different
types of files such as CSV, excel etc.or a SQL database and create a Python object known as
a data frame. A data frame contains rows and columns and it can be used for data
manipulation with operations such as join, merge, groupby, concatenate etc.

2. NumPy

NumPy is a free Python software library for numerical computing on data that can be
in the form of large arrays and multi-dimensional matrices. These multidimensional matrices
are the main objects in NumPy where their dimensions are called axes and the number of
axes is called a rank. NumPy also provides various tools to work with these arrays and high-
level mathematical functions to manipulate this data with linear algebra, Fourier transforms,
random number crunchings, etc. Some of the basic array operations that can be performed
using NumPy include adding, slicing, multiplying, flattening, reshaping, and indexing the
arrays. Other advanced functions include stacking the arrays, splitting them into sections,
broadcasting arrays, etc.

3. SciPy

SciPy is a free software library for scientific computing and technical computing on
the data. It was created as a community library project and initially released around 2001.
SciPy library is built on the NumPy array object and it is part of the NumPy stack which also
includes other scientific computing libraries and tools such as Matplotlib, SymPy, pandas

8
etc. This NumPy stack has users which also use comparable applications such as GNU
Octave, MATLAB, GNU Octave, Scilab, etc. SciPy allows for various scientific computing
tasks that handle data optimization, data integration, data interpolation, and data modification
using linear algebra, Fourier transforms, random number generation, special functions, etc.
Just like NumPy, the multidimensional matrices are the main objects in SciPy, which are
provided by the NumPy module itself.

4. Scikit-learn

Scikit-learn is a free software library for Machine Learning coding primarily in the
Python programming language. It was initially developed as a Google Summer of Code
project by David Cournapeau and originally released in June 2007. Scikit-learn is built on top
of other Python libraries like NumPy, SciPy, Matplotlib, Pandas, etc. and so it provides full
interoperability with these libraries. While Scikit-learn is written mainly in Python, it has
also used Cython to write some core algorithms in order to improve performance. You can
implement various Supervised and Unsupervised Machine learning models on Scikit-learn
like Classification, Regression, Support Vector Machines, Random Forests, Nearest
Neighbors, Naive Bayes, Decision Trees, Clustering, etc. with Scikit-learn.

5. TensorFlow

TensorFlow is a free end-to-end open-source platform that has a wide variety of tools,
libraries, and resources for Artificial Intelligence. It was developed by the Google Brain team
and initially released on November 9, 2015. You can easily build and train Machine
Learning models with high-level API’s such as Keras using TensorFlow. It also provides
multiple levels of abstraction so you can choose the option you need for your model.
TensorFlow also allows you to deploy Machine Learning models anywhere such as the cloud,
browser, or your own device. You should use TensorFlow Extended (TFX) if you want the
full experience, TensorFlow Lite if you want usage on mobile devices, and TensorFlow.js if
you want to train and deploy models in JavaScript environments. TensorFlow is available
for Python and C APIs and also for C++, Java, JavaScript, Go, Swift, etc. but without an API

9
backward compatibility guarantee. Third-party packages are also available for MATLAB, C#,
Julia, Scala, R, Rust, etc.

6. Keras

Keras is a free and open-source neural-network library written in Python. It was


primarily created by François Chollet, a Google engineer, and initially released on 27 March
2015. Keras was created to be user friendly, extensible, and modular while being supportive
of experimentation in deep neural networks. Hence, it can be run on top of other libraries and
languages like TensorFlow, Theano, Microsoft Cognitive Toolkit, R, etc. Keras has multiple
tools that make it easier to work with different types of image and textual data for coding in
deep neural networks. It also has various implementations of the building blocks for neural
networks such as layers, optimizers, activation functions, objectives, etc. You can perform
various actions using Keras such as creating custom function layers, writing functions with
repeating code blocks that are multiple layers deep, etc.

Python Libraries for Data Visualization


1. Matplotlib

Matplotlib is a data visualization library and 2-D plotting library of Python It was
initially released in 2003 and it is the most popular and widely-used plotting library in the
Python community. It comes with an interactive environment across multiple platforms.
Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter
notebook, web application servers etc. It can be used to embed plots into applications using
various GUI toolkits like Tkinter, GTK+, wxPython, Qt, etc. So you can use Matplotlib
to create plots, bar charts, pie charts, histograms, scatterplots, error charts, power spectra,
stemplots, and whatever other visualization charts you want! The Pyplot module also
provides a MATLAB-like interface that is just as versatile and useful as MATLAB while
being totally free and open source.

10
2. Seaborn

Seaborn is a Python data visualization library that is based on Matplotlib and closely
integrated with the numpy and pandas data structures. Seaborn has various dataset-oriented
plotting functions that operate on data frames and arrays that have whole datasets within
them. Then it internally performs the necessary statistical aggregation and mapping functions
to create informative plots that the user desires. It is a high-level interface for creating
beautiful and informative statistical graphics that are integral to exploring and understanding
data. The Seaborn data graphics can include bar charts, pie charts, histograms, scatterplots,
error charts, etc. Seaborn also has various tools for choosing color palettes that can reveal
patterns in the data.

3. Plotly

Plotly is a free open-source graphing library that can be used to form data
visualizations. Plotly (plotly.py) is built on top of the Plotly JavaScript library (plotly.js) and
can be used to create web-based data visualizations that can be displayed in Jupyter
notebooks or web applications using Dash or saved as individual HTML files. Plotly provides
more than 40 unique chart types like scatter plots, histograms, line charts, bar charts, pie
charts, error bars, box plots, multiple axes, sparklines, dendrograms, 3-D charts, etc. Plotly
also provides contour plots, which are not that common in other data visualization libraries.
In addition to all this, Plotly can be used offline with no internet connection.

4. GGplot

Ggplot is a Python data visualization library that is based on the implementation of


ggplot2 which is created for the programming language R. Ggplot can create data
visualizations such as bar charts, pie charts, histograms, scatterplots, error charts, etc. using
high-level API. It also allows you to add different types of data visualization components or
layers in a single visualization. Once ggplot has been told which variables to map to which
aesthetics in the plot, it does the rest of the work so that the user can focus on interpreting the
visualizations and take less time in creating them. But this also means that it is not possible to

11
create highly customised graphics in ggplot. Ggplot is also deeply connected with pandas so
it is best to keep the data in DataFrames.

Results

Thus the Anaconda Navigator has successfully downloaded, installed and explored the
features of Numpy, scipy, Jupyter, Statsmodel and Pandas Package.

12
EX NO: 2

Working with Numpy array

AIM
Write a Python program to demonstrate basic array characteristics.
ALGORITHM

Step 1: Start
Step 2: Import numpy module
Step 3: Print the basic characteristics of array
Step 4: Stop

Program:

import numpy as np
arr=np.array([[1,2,3],[4,2,5]]) print
("Array is of type:",type (arr)) print ("no.of
dimensions:",arr.ndim) print ("Shape of
array:",arr.shape) print ("Size of
array:",arr.size) print ("Array stores
elements of type:",arr.dtype)
output:
Array is of type: <class
'numpy.ndarray'> no.of
dimensions: 2 Shape of
array: (2, 3)
Size of array: 6
Array stores elements of type: int32

RESULT
Thus the python program working with NumPy array has been implemented and executed
successfully.
EX NO 3

Working with Pandas Data frame

AIM
Write a program to create a dataframe using a list of elements.

ALGORITHM
Step1: Start
Step2: import numpy and pandas module
Step3: Create a dataframe using list of elements
Step4: Print the output
Step5: Stop

Program:

import pandas as pd
lst=['A','B','C','D','E','F','G']
df=pd.DataFram

e(lst) print(df)
Output:
0
0 A
1 B
2 C
3 D
4 E
5 F 6 G

RESULT
Thus the python program for dataframe using list of elements
has been implemented and executed successfully
Ex: 4
Reading data from text file, Excel and the web and exploring various
Command for doing descriptive analysis on the iris dataset

Aim
To Read data from text file, Excel and the web and exploring various Command for
doing descriptive analysis on the iris dataset.

Procedure:
Exploratory Data Analysis (EDA) is a technique to analyze data using some visual
Techniques. With this technique, we can get detailed information about the statistical summary of
the data. We will also be able to deal with the duplicates values, outliers, and also see some trends or
patterns present in the dataset.
Now let’s see a brief about the Iris dataset.
Iris Dataset

Iris Dataset is considered as the Hello World for data science. It contains five columns
namely – Petal Length, Petal Width, Sepal Length, Sepal Width, and Species Type. Iris is a
flowering plant, the researchers have measured various features of the different iris flowers and
recorded them digitally.

Note: This dataset can be downloaded from https://datahub.io/machine-learning/iris

You can download the Iris.csv file from the above link. Now we will use the Pandas library
to load this CSV file, and we will convert it into the dataframe. read_csv() method is used to read
CSV files.
Program:
import pandas as pd
df=pd.read_csv('iris_csv.csv') df.head()

Output:
sepalwidth petallength petalwidth class
sepallength

0 5.1 3.5 1.4 0.2 Iris-setosa

1 4.9 3.0 1.4 0.2 Iris-setosa

2 4.7 3.2 1.3 0.2 Iris-setosa

3 4.6 3.1 1.5 0.2 Iris-setosa

4 5.0 3.6 1.4 0.2 Iris-setosa

Program:
df.shape output:
(150, 5)

Program:
df.info(

output

:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to
149 Data columns (total 5
columns):
# Column Non-Null Count Dtype

0 sepallength 150 non-null float64


1 sepalwidth 150 non-null float64
2 petallength 150 non-null float64
3 petalwidth 150 non-null float64 4 class
150 non-
null object dtypes: float64(4), object(1) memory usage:
6.0+ KB
program:

df.describe()

outpu
t:
sepallength sepalwidth petallength petalwidth

Count 150.000 150.0000 150.0000


150.0000 000 00 00
00

Mean 5.843333 3.05400 3.758667 1.198667


0

Std 0.828066 0.43359 1.764420 0.763161


4

Min 4.300000 2.00000 1.000000 0.100000


0

25% 5.100000 2.80000 1.600000 0.300000


0

50% 5.800000 3.00000 4.350000 1.300000


0

75% 6.400000 3.30000 5.100000 1.800000


0
Max 7.900000 4.40000 6.900000 2.500000
0

Program:

df.isnull().sum(
)
output:
sepallength 0
sepalwidth 0
petallength 0
petalwidth 0
class 0
dtype: int64
program:

data=df.drop_duplicates(subset="class",) data

outpu
t:
sepallength sepalwidth petallength petalwidth class

0 5.1 3.5 1. 0.2 Iris-setosa


4

50 7.0 3.2 4. 1.4 Iris-


7 versicolor

100 6.3 3.3 6. 2.5 Iris-


0 virginica

Program:
df.value_counts("class")

output:
class
Iris-setosa 50
Iris-versicolor 50
Iris-virginica 50
Name: count, dtype: int64

Program:

sns.countplot(x='class',data=df,) plt.show()

output:
Program:

import seaborn as sns

import matplotlib.pyplot as plt


sns.scatterplot(x='sepallength',y='sepalwidth',hue='cl
ass',data=df,) plt.legend(bbox_to_anchor=(1,1),loc=2)
plt.show()

output:
Program:

import seaborn as sns import matplotlib.pyplot as plt


sns.scatterplot(x='petallength',y='petalwidth',hue='cl
ass',data=df,) plt.legend(bbox_to_anchor=(1,1),loc=2)
plt.show()

output:

Program:

import seaborn as sns import


matplotlib.pyplot as plt
sns.pairplot(df.drop([],axis=1),hue='class',height=2)
C:\Users\ELCOT\anaconda3\Lib\site-
packages\seaborn\axisgrid.py:118: UserWar ning: The figure
layout has changed to tight self._figure.tight_layout(*args,
**kwargs)

output:
<seaborn.axisgrid.PairGrid at 0x239cd47ad90>
Program:

# importing packages import seaborn as


sns import matplotlib.pyplot as plt fig, axes
= plt.subplots(2, 2, figsize=(10,10))
axes[0,0].set_title("sepallength")
axes[0,0].hist(df['sepallength'], bins=7)
axes[0,1].set_title("sepalwidth")
axes[0,1].hist(df['sepalwidth'], bins=5);
axes[1,0].set_title("petallength")
axes[1,0].hist(df['petallength'], bins=6);
axes[1,1].set_title("petalwidth")
axes[1,1].hist(df['petalwidth'], bins=6);

output:
Program:

# importing packages import seaborn as


sns import matplotlib.pyplot as plt plot
= sns.FacetGrid(df, hue="class")
plot.map(sns.distplot,
"sepallength").add_legend() plot =
sns.FacetGrid(df, hue="class")
plot.map(sns.distplot,
"sepalwidth").add_legend() plot =
sns.FacetGrid(df, hue="class")
plot.map(sns.distplot,
"petallength").add_legend() plot =
sns.FacetGrid(df, hue="class")
plot.map(sns.distplot,
"petalwidth").add_legend() plt.show()

output:
Program:

data.corr(method.pearson)

output:
program:

import seaborn as sns import matplotlib.pyplot as plt


sns.heatmap(df.corr(method='pearson').drop([''],
axis=1).drop([''], axis=0), annot = True); plt.show(

output:

program:

# importing packages
import seaborn as sns
import
matplotlib.pyplot as
plt def graph(y):
sns.boxplot(x="class", y="class",
data=df) plt.figure(figsize=(10,10))
# Adding the subplot at the
specified # grid position
plt.subplot(221)
graph('sepallen
gth')
plt.subplot(222
)
graph('sepalwi
dth')
plt.subplot(223
)
graph('petallen
gth')
plt.subplot(224
)
graph('petalwidth') plt.show()
output:

Results:
Thus the text file, Excel and the web and exploring various Command for doing
descriptive analysis on the iris dataset has successfully completed.
EX NO:5
Use the diabetes data set from UCI and Pima Indians Diabetes data set for
performing the following:
a. Univariate analysis: frequency, Mean, Median, Mode, Variance, Standard
Deviation, Skewness and Kurtosis
b. Bivariate analysis: Linear and logistic regression modelling
c. Multiple Regression analysis
d. Also compare the results of the above analysis for the two data sets
a.Univariate analysis: frequency, Mean, Median, Mode, Variance, Standard
Deviation, Skewness and Kurtosis
Aim
To perform Univariate analysis: frequency, Mean, Median, Mode, Variance,
Standard Deviation, Skewness and Kurtosis

Program :

import numpy as np

import pandas as pd

import seaborn as sns

import matplotlib.pyplot as

pltimport statsmodels.api as

sm

df =

pd.read_csv('D:\Studentdetails.csv')

def UVA_numeric(data):

var_group =

data.columnssize =

len(var_group)

plt.figure(figsize = (7*size,3), dpi =

400)for j,i in enumerate(var_group):

mini = data[i].min()

plt.subplot(1,size,j+1)

sns.distplot(data[i],hist=True,

kde=True)
sns.lineplot(points, [0,0], color = 'black', label = "std_dev")

sns.scatterplot([mini,maxi], [0,0], color = 'orange', label =

"min/max")sns.scatterplot([mean], [0], color = 'red', label =

"mean") sns.scatterplot([median], [0], color = 'blue', label =

"median") plt.xlabel('{}'.format(i), fontsize = 20)

plt.ylabel('density')

plt.title('std_dev = {}; kurtosis = {};\nskew = {}; range = {}\nmean = {}; median =


{}'.format((round(points[0],2),round(points[1],2)),round(kurt,2),round(skew,2),(round(mini,
2
),round(maxi,2),round(ran,2)),round(mean,2),round(media

n,2)))UVA_numeric(df)
OUTPUT:

b. Bivariate analysis: Linear and logistic regression modeling

Aim

To perform Bivariate analysis: Linear and logistic regression modeling

Linear Regression

Code:

# Import the
librariesimport
pandas as pd
import numpy as
np
import matplotlib.pyplot as plt
from sklearn import datasets
diabetes =
datasets.load_diabetes()diabetes
OUTPUT:
{'data': array([[ 0.03807591, 0.05068012, 0.06169621, ..., -0.00259226,
0.01990749, -0.01764613],
[-0.00188202, -0.04464164, -0.05147406, ..., -0.03949338,
-0.06833155, -0.09220405],
[ 0.08529891, 0.05068012, 0.04445121, ..., -0.00259226,
0.00286131, -0.02593034],
...,
[ 0.04170844, 0.05068012, -0.01590626, ..., -0.01107952,
-0.04688253, 0.01549073],
[-0.04547248, -0.04464164, 0.03906215, ..., 0.02655962,
0.04452873, -0.02593034],
[-0.04547248, -0.04464164, -0.0730303 , ..., -0.03949338,
-0.00422151, 0.00306441]]),
'target': array([151., 75., 141., 206., 135., 97., 138., 63., 110., 310., 101.,
69., 179., 185., 118., 171., 166., 144., 97., 168., 68., 49.,
68., 245., 184., 202., 137., 85., 131., 283., 129., 59., 341.,
87., 65., 102., 265., 276., 252., 90., 100., 55., 61., 92.,
259., 53., 190., 142., 75., 142., 155., 225., 59., 104., 182.,
128., 52., 37., 170., 170., 61., 144., 52., 128., 71., 163.,
150., 97., 160., 178., 48., 270., 202., 111., 85., 42., 170.,
200., 252., 113., 143., 51., 52., 210., 65., 141., 55., 134.,
42., 111., 98., 164., 48., 96., 90., 162., 150., 279., 92.,
83., 128., 102., 302., 198., 95., 53., 134., 144., 232., 81.,
104., 59., 246., 297., 258., 229., 275., 281., 179., 200., 200.,
173., 180., 84., 121., 161., 99., 109., 115., 268., 274., 158.,
107., 83., 103., 272., 85., 280., 336., 281., 118., 317., 235.,
60., 174., 259., 178., 128., 96., 126., 288., 88., 292., 71.,
197., 186., 25., 84., 96., 195., 53., 217., 172., 131., 214.,
59., 70., 220., 268., 152., 47., 74., 295., 101., 151., 127.,
237., 225., 81., 151., 107., 64., 138., 185., 265., 101., 137.,
143., 141., 79., 292., 178., 91., 116., 86., 122., 72., 129.,
142., 90., 158., 39., 196., 222., 277., 99., 196., 202., 155.,
77., 191., 70., 73., 49., 65., 263., 248., 296., 214., 185.,
78., 93., 252., 150., 77., 208., 77., 108., 160., 53., 220.,
154., 259., 90., 246., 124., 67., 72., 257., 262., 275., 177.,
71., 47., 187., 125., 78., 51., 258., 215., 303., 243., 91.,
150., 310., 153., 346., 63., 89., 50., 39., 103., 308., 116.,
145., 74., 45., 115., 264., 87., 202., 127., 182., 241., 66.,
94., 283., 64., 102., 200., 265., 94., 230., 181., 156., 233.,
60., 219., 80., 68., 332., 248., 84., 200., 55., 85., 89.,
31., 129., 83., 275., 65., 198., 236., 253., 124., 44., 172.,
114., 142., 109., 180., 144., 163., 147., 97., 220., 190., 109.,
191., 122., 230., 242., 248., 249., 192., 131., 237., 78., 135.,
244., 199., 270., 164., 72., 96., 306., 91., 214., 95., 216.,
263., 178., 113., 200., 139., 139., 88., 148., 88., 243., 71.,
77., 109., 272., 60., 54., 221., 90., 311., 281., 182., 321.,
58., 262., 206., 233., 242., 123., 167., 63., 197., 71., 168.,
140., 217., 121., 235., 245., 40., 52., 104., 132., 88., 69.,
219., 72., 201., 110., 51., 277., 63., 118., 69., 273., 258.,
43., 198., 242., 232., 175., 93., 168., 275., 293., 281., 72.,
140., 189., 181., 209., 136., 261., 113., 131., 174., 257., 55.,
84., 42., 146., 212., 233., 91., 111., 152., 120., 67., 310.,
94., 183., 66., 173., 72., 49., 64., 48., 178., 104., 132.,
220., 57.]),
'frame': None,
'DESCR': '.. _diabetes_dataset:\n\nDiabetes dataset\n \n\nTen baseline variables, age, sex,
body
mass index, average blood\npressure, and six blood serum measurements were obtained
for each of n =\n442 diabetes patients, as well as the response of interest, a\nquantitative
measure of disease progression one year after baseline.\n\n**Data Set
Characteristics:**\n\n :Number of Instances: 442\n\n :Number of Attributes: First 10
columns are numeric predictive values\n\n :Target: Column 11 is a quantitative measure
of disease progression one year after baseline\n\n :Attribute Information:\n - age age
in years\n - sex\n - bmi body mass index\n - bp
average blood pressure\n - s1 tc, total serum cholesterol\n - s2
ldl, low-density lipoproteins\n - s3 hdl, high-density
lipoproteins\n - s4 tch, total cholesterol / HDL\n
- s5 ltg, possibly log of serum triglycerides level\n - s6 glu, blood sugar
level\n\nNote: Each of these 10 feature variables have been mean centered and scaled by
the standard deviation times the square root of
`n_samples` (i.e. the sum of squares of each column totals 1).\n\nSource
URL:\nhttps://www4.stat.ncsu.edu/~boos/var.select/diabetes.html\n\nFor more
information see:\nBradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani
(2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-
499.\n(https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)\n',
'feature_names': ['age',
'sex',
'bmi',
'bp',
's1',
's2',
's3',
's4',
's5',
's6'],
'data_filename':
'diabetes_data_raw.csv.gz',
'target_filename':
'diabetes_target.csv.gz',
'data_module':
'sklearn.datasets.data'}
Code:

print(diabetes.DESCR)

OUTPUT:
.. _diabetes_dataset:

Diabetes dataset

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

:Number of Instances: 442

:Number of Attributes: First 10 columns are numeric predictive values

:Target: Column 11 is a quantitative measure of disease progression one


year after baseline

:Attribute Information:
- age age in years
- sex
- bmi body mass index
- bp average blood pressure
- s1 tc, total serum cholesterol
- s2 ldl, low-density lipoproteins
- s3 hdl, high-density lipoproteins
- s4 tch, total cholesterol / HDL
- s5 ltg, possibly log of serum triglycerides level
- s6 glu, blood sugar level

Note: Each of these 10 feature variables have been mean centered and scaled
by the standard deviation times the square root of `n_samples` (i.e. the
sum of squares of each column totals 1).

Source URL:
https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

For more information see:


Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004)
"Least Angle Regression," Annals of Statistics (with discussion), 407-499.
(https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)

CODE:
# columns

diabetes.feature_names
OUTPUT:
['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
CODE:
# Now we will split the data into the independent and independent variable
X = diabetes.data
Y = diabetes.target
X.shape, Y.shape

OUTPUT:
((442, 10), (442,))

CODE:
Y
OUTPUT:
array([151., 75., 141., 206., 135., 97., 138., 63., 110., 310., 101.,
69., 179., 185., 118., 171., 166., 144., 97., 168., 68., 49.,
68., 245., 184., 202., 137., 85., 131., 283., 129., 59., 341.,
87., 65., 102., 265., 276., 252., 90., 100., 55., 61., 92.,
259., 53., 190., 142., 75., 142., 155., 225., 59., 104., 182.,
128., 52., 37., 170., 170., 61., 144., 52., 128., 71., 163.,
150., 97., 160., 178., 48., 270., 202., 111., 85., 42., 170.,
200., 252., 113., 143., 51., 52., 210., 65., 141., 55., 134.,
42., 111., 98., 164., 48., 96., 90., 162., 150., 279., 92.,
83., 128., 102., 302., 198., 95., 53., 134., 144., 232., 81.,
104., 59., 246., 297., 258., 229., 275., 281., 179., 200., 200.,
173., 180., 84., 121., 161., 99., 109., 115., 268., 274., 158.,
107., 83., 103., 272., 85., 280., 336., 281., 118., 317., 235.,
60., 174., 259., 178., 128., 96., 126., 288., 88., 292., 71.,
197., 186., 25., 84., 96., 195., 53., 217., 172., 131., 214.,
59., 70., 220., 268., 152., 47., 74., 295., 101., 151., 127.,
237., 225., 81., 151., 107., 64., 138., 185., 265., 101., 137.,
143., 141., 79., 292., 178., 91., 116., 86., 122., 72., 129.,
142., 90., 158., 39., 196., 222., 277., 99., 196., 202., 155.,
77., 191., 70., 73., 49., 65., 263., 248., 296., 214., 185.,
78., 93., 252., 150., 77., 208., 77., 108., 160., 53., 220.,
154., 259., 90., 246., 124., 67., 72., 257., 262., 275., 177.,
71., 47., 187., 125., 78., 51., 258., 215., 303., 243., 91.,
150., 310., 153., 346., 63., 89., 50., 39., 103., 308., 116.,
145., 74., 45., 115., 264., 87., 202., 127., 182., 241., 66.,
94., 283., 64., 102., 200., 265., 94., 230., 181., 156., 233.,
60., 219., 80., 68., 332., 248., 84., 200., 55., 85., 89.,
31., 129., 83., 275., 65., 198., 236., 253., 124., 44., 172.,
114., 142., 109., 180., 144., 163., 147., 97., 220., 190., 109.,
191., 122., 230., 242., 248., 249., 192., 131., 237., 78., 135.,
244., 199., 270., 164., 72., 96., 306., 91., 214., 95., 216.,
263., 178., 113., 200., 139., 139., 88., 148., 88., 243., 71.,
77., 109., 272., 60., 54., 221., 90., 311., 281., 182., 321.,
58., 262., 206., 233., 242., 123., 167., 63., 197., 71., 168.,
140., 217., 121., 235., 245., 40., 52., 104., 132., 88., 69.,
219., 72., 201., 110., 51., 277., 63., 118., 69., 273., 258.,
43., 198., 242., 232., 175., 93., 168., 275., 293., 281., 72.,
140., 189., 181., 209., 136., 261., 113., 131., 174., 257., 55.,
84., 42., 146., 212., 233., 91., 111., 152., 120., 67., 310.,
94., 183., 66., 173., 72., 49., 64., 48., 178., 104., 132.,
220., 57.])

CODE:
# We will split the data into training and testing data
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(X,Y,test_size=0.3,random_state=99)
train_x.shape, train_y.shape
OUTPUT:
((309, 10), (309,))

CODE:

# Linear Regression

from sklearn.linear_model import LinearRegression

le = LinearRegression()

le.fit(train_x,train_y)

y_pred = le.predict(test_x)

y_pred
OUTPUT:
# Linear Regression
from sklearn.linear_model import LinearRegression
le = LinearRegression()
le.fit(train_x,train_y)
y_pred = le.predict(test_x)
y_pred

array([ 77.9991034 , 170.44712136, 109.03660582, 223.84307065,


87.38430375, 211.46934588, 223.65994161, 52.81888351,
149.39008619, 294.9893952 , 127.72956377, 182.90399415,
102.6375881 , 144.69460471, 171.5213581 , 266.17922455,
201.8877034 , 166.18209687, 103.67082991, 169.01866852,
187.13785171, 130.10161695, 151.54411583, 156.45841795,
121.85578431, 304.14149903, 126.55107098, 158.76430506,
249.42202516, 154.22310276, 180.85528758, 180.06909853,
182.96467155, 200.00132316, 73.87031841, 146.19357531,
165.52522033, 160.93215348, 247.99028596, 210.37177183,
85.69905264, 211.07539533, 188.10590426, 119.60434815,
151.80766971, 188.31163328, 185.69251949, 168.92581539,
291.55993431, 248.60092291, 170.17035216, 208.5515447 ,
59.08071813, 195.30432554, 190.19923551, 149.97489689,
114.4835119 , 244.83078249, 254.54782428, 138.88949628,
301.05425333, 57.71483254, 162.93256009, 187.59937115,
274.76599026, 158.21099102, 181.84517605, 151.29485495,
199.55472957, 197.6359619 , 83.37352464, 127.99031348,
179.13616685, 161.69195267, 106.02168539, 213.44405941,
99.63818719, 116.80068856, 176.2415266 , 190.88134043,
167.39910679, 185.39753187, 120.19155677, 94.95663036,
158.22520798, 189.80858086, 166.40135891, 235.30476439,
230.3206865 , 236.6916711 , 176.02450278, 135.15234363,
152.82007506, 124.46215244, 213.574207 , 42.64060377,
200.72397971, 64.05606538, 160.31380961, 61.10040585,
161.21565469, 169.64372088, 198.39611428, 63.92041427,
129.65961388, 171.88537096, 175.28400448, 126.28643791,
163.35396162, 92.43346316, 156.32693509, 174.65381761,
229.30521464, 128.22372219, 124.33529553, 64.52215982,
217.54827171, 230.22439161, 190.07750782, 161.96910026,
123.80090397, 161.67791209, 132.53128138, 116.96403575,
266.68910399, 275.9219297 , 86.84652207, 60.94725563,
202.34939075, 144.66352278, 82.4008924 , 185.44354389,
101.56321284])
CODE:
result = pd.DataFrame({'Actual': test_y, 'Predict' :
y_pred})result
OUTPUT:

CODE:
# we will check the accuracy
print('coefficient', le.coef_)
print('intercept', le.intercept_)

OUTPUT:
coefficient [ 40.66179183 -313.2934701 517.18345014 386.06561935 -
604.64289258
275.31448303 3.92015562 172.39101348 661.95332633 62.26068823]
intercept 155.59120125597175

CODE:
from sklearn.metrics import mean_squared_error, r2_score
# mean_squared_error
mean_squared_error(test_y,y_pred)
# r2 score
r2_score(test_y,y_pred)
OUTPUT:
0.4545709909725646

Logistic Regression Modelling

CODE:

import numpy as np # linear algebra

import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import seaborn as sns

import matplotlib.pyplot as plt


df = pd.read_csv('diabetes.csv')

df.head()

OUTPUT:

CODE:
df.info()

OUTPUT:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
# Column Non-Null Count Dtype

0 Pregnancies 768 non-null int64


1 Glucose 768 non-null int64
2 BloodPressure 768 non-null int64
3 SkinThickness 768 non-null int64
4 Insulin 768 non-null int64
5 BMI 768 non-null float64
6 DiabetesPedigreeFunction 768 non-null float64
7 Age 768 non-null int64
8 Outcome 768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

CODE:
df.describe()

OUTPUT:

CODE:
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')

OUTPUT:
<Axes: >

CODE:
sns.countplot(x='Outcome',data=df)

OUTPUT:
<Axes: xlabel='Outcome', ylabel='count'>

CODE:

sns.distplot(df['Age'].dropna(),kde=True)

OUTPUT:
C:\Users\mca1\AppData\Local\Temp\ipykernel_9620\436136364.py:1:
UserWarning:

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function


with
similar flexibility) or `histplot` (an axes-level function for histograms).
For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

sns.distplot(df['Age'].dropna(),kde=True)
<Axes: xlabel='Age', ylabel='Density'>

CODE:
df.corr()

OUTPUT:

CODE:
sns.heatmap(df.corr())
OUTPUT:
<Axes: >

CODE:
x = df.drop('Outcome',axis=1)
y = df['Outcome']
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3,random_state=101)
from sklearn.linear_model import LogisticRegression
logmodel = LogisticRegression()
logmodel.fit(x_train,y_train)

OUTPUT:
C:\Users\mca1\anaconda3\Lib\site-
packages\sklearn\linear_model\_logistic.py:460: ConvergenceWarning: lbfgs
failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-
regression
n_iter_i = _check_optimize_result(
LogisticRegression
LogisticRegression()

CODE:
predictions = logmodel.predict(x_test)
from sklearn.metrics import classification_report
print(classification_report(y_test,predictions))
OUTPUT:
precision recall f1-score support

0 0.82 0.89 0.85 150


1 0.75 0.63 0.68 81

accuracy 0.80 231


macro avg 0.78 0.76 0.77 231
weighted avg 0.79 0.80 0.79 231

CODE:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test,predictions)

OUTPUT:
array([[133, 17],
[ 30, 51]], dtype=int64)

c. Multiple Regression analysis

Aim

To perform Multiple Regression analysis.

CODE:

Import numpy as np
fromsklearn.linear_model import LinearRegression
x = [[0, 1], [5, 1], [15, 2], [25, 5], [35, 11], [45, 15], [55, 34], [60, 35]]
y = [4, 5, 20, 14, 32, 22, 38, 43]
x,y=np.array(x),np.array(y)
model=LinearRegression().fit(x,y)
r_sq = model.score(x,y)
print('coefficient of determination:', r_sq)
print('intercept:', model.intercept_)
print('slope:', model.coef_)
y_pred = model.predict(x)
print('predicted response:', y_pred)

OUTPUT:
coefficient of determination: 0.8615939258756776
intercept: 5.52257927519819
slope: [0.44706965 0.25502548]
predicted response: [ 5.77760476 8.012953 12.73867497 17.9744479 23.97529728 29.4660957
38.78227633 41.27265006]
d. Comparison of dataset

PROGRAM:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# disable warnings
import warnings
warnings.filterwarnings('ignore')
data=pd.read_csv('diabetes.csv')
data1=pd.read_csv('diabetes.csv')
data=pd.read_csv('diabetes.csv')
data1=pd.read_csv('diabetes.csv')
print(data.columns)
print(data1.columns)

OUTPUT:
Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness',
'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
dtype='object')
Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness',
'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
dtype='object')

PROGRAM:

import seaborn as sns

corrmat = data.corr()

f, ax = plt.subplots(figsize=(12, 9))

sns.heatmap(corrmat, cbar=True, annot=True, square=True, vmax=.8);

import seaborn as sns

corrmat = data1.corr()
f, ax = plt.subplots(figsize=(12, 9))

sns.heatmap(corrmat, cbar=True, annot=True, square=True, vmax=.8);

OUTPUT:

code:

sns.set()

cols=['Pregnancies','Glucose','BloodPressure','Insulin','BMI','DiabetesPedigreeFunction','Age',
'Outcome']
sns.pairplot(data[cols], size = 2.5)

plt.show();

sns.set()

cols=['Pregnancies','Glucose','BloodPressure','Insulin','BMI','DiabetesPedigreeFunction','Age',
'Outcome']

sns.pairplot(data1[cols], size = 2.5)

plt.show()

Output:

Results

Thus the program has been successfully completed.


EX.6
Apply and explore various plotting functions
a. Normal curves
b. Density and contour plot
c. Correlation and Scatter plots
d. Histogram
e. Three dimensional Plotting
Aim:
To Apply and explore various plotting functions a. Normal curves, b. Density and contour plot, c.
Correlation and Scatter plots, d. Histogram, e. Three dimensional Plotting,

a. Normal curves
Program:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
df = pd.read_csv('Marks.csv')
def UVA_numeric(data):
var_group = data.columns
size = len(var_group)
plt.figure(figsize = (7*size,3), dpi = 400)
#looping for each variable
for j,i in enumerate(var_group):
# calculating descriptives of variable
mini = data[i].min()
maxi = data[i].max()
ran = data[i].max()-data[i].min()
mean = data[i].mean()
median = data[i].median()
st_dev = data[i].std()
skew = data[i].skew()
kurt = data[i].kurtosis()
# calculating points of standard deviation
points = mean-st_dev, mean+st_dev
#Plotting the variable with every information
plt.subplot(1,size,j+1)
sns.distplot(data[i],hist=True, kde=True)
sns.lineplot(points, [0,0], color = 'black', label = "std_dev")
sns.scatterplot([mini,maxi], [0,0], color = 'orange', label = "min/max")
sns.scatterplot([mean], [0], color = 'red', label = "mean")
sns.scatterplot([median], [0], color = 'blue', label = "median")
plt.xlabel('{}'.format(i), fontsize = 20)
plt.ylabel('density')
plt.title('std_dev = {}; kurtosis = {};\nskew = {}; range = {}\nmean = {}; median =
{}'.format((round(points[0],2),round(points[1],2)),
round(kurt,2),
round(skew,2),
(round(mini,2),round(maxi,2),round(ran,2)),
round(mean,2),
round(median,2)))
UVA_numeric(df)

Output:

b.Density and contour plot

PROGRAM:

from scipy.stats import norm

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sb

data = np.arange(1,10,0.01)

pdf = norm.pdf(data , loc = 5.3 , scale = 1 )

sb.set_style('whitegrid'),

sb.lineplot(pdf)
plt.xlabel('Heights')

plt.ylabel('Probability Density'

OUTPUT:

c. Correlation and scatter plot


code:
PROGRAM:
import pandas as pd
con = pd.read_csv('ConcreteStrength.csv')
con
OUTPUT:

slag ash water superplastic coarseagg fineagg age strength


cement

0 540.0 0.0 0.0 162.0 2.5 1040.0 676.0 28 79.99

1 540.0 0.0 0.0 162.0 2.5 1055.0 676.0 28 61.89

2 332.5 142.5 0.0 228.0 0.0 932.0 594.0 270 40.27

3 332.5 142.5 0.0 228.0 0.0 932.0 594.0 365 41.05

4 198.6 132.4 0.0 192.0 0.0 978.4 825.5 360 44.30

... ... ... ... ... ... ... ... ... ...

1025 276.4 116.0 90.3 179.6 8.9 870.1 768.3 28 44.28

1026 322.2 0.0 115.6 196.0 10.4 817.9 813.4 28 31.18

1027 148.5 139.4 108.6 192.7 6.1 892.4 780.0 28 23.70


cement slag ash water superplastic coarseagg fineagg age strength

1028 159.1 186.7 0.0 175.6 11.3 989.6 788.9 28 32.77

1029 260.9 100.5 78.3 200.6 8.6 864.5 761.5 28 32.40

PROGRAM:

list(con.columns)

OUTPUT:

['cement',
'slag',
'ash',
'water',
'superplastic',
'coarseagg',
'fineagg',
'age',
'strength']
In [ ]:

PROGRAM:

con.rename(columns={'Fly ash': 'FlyAsh', 'Coarse Aggr.': "CoarseAgg" ,'Fine Aggr.': 'FineAgg', 'Age': 'Age',
'Compressive Strength (28-day)(Mpa)': 'Strength'}, inplace=True)

con.head()

OUTPUT:

cement slag ash water superplastic coarseagg fineagg age strength

0 540.0 0.0 0.0 162.0 2.5 1040.0 676.0 28 79.99

1 540.0 0.0 0.0 162.0 2.5 1055.0 676.0 28 61.89

2 332.5 142.5 0.0 228.0 0.0 932.0 594.0 270 40.27

3 332.5 142.5 0.0 228.0 0.0 932.0 594.0 365 41.05

4 198.6 132.4 0.0 192.0 0.0 978.4 825.5 360 44.30

PROGRAM:
con['age'] = con['age'].astype('category')
con.describe(include='category')

OUTPUT:

age

count 1030

unique 14

top 28

freq 42

PROGRAM:

import seaborn as sns

sns.scatterplot(x="FlyAsh", y="Strength", data=con);

OUTPUT:

PROGRAM:

ax = sns.scatterplot(x="FlyAsh", y="Strength", data=con)

ax.set_title("Concrete Strength vs. Fly ash")

ax.set_xlabel("Fly ash")

]
OUTPUT:

PROGRAM:

sns.lmplot(x="FlyAsh", y="Strength", data=con);

OUTPUT:

PROGRAM:

from scipy import stats


stats.pearsonr(con['Strength'], con['FlyAsh'])

OUTPUT:

PearsonRResult(statistic=0.9999999999999997, pvalue=1.123412337643488e-23)

PROGRAM:

Codecormat = con.corr()

round(Codecormat,2)
OUTPUT:

Fly Ash Strength

Fly Ash 1.0 1.0

Strength 1.0 1.0

d.Histogram

Code:

import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_excel('https://github.com/datagy/Intro-to-Python/raw/master/sportsdata.xls',
usecols=['Age'])
print(df.describe())
plt.hist(df['Age'])

OUTPUT:

e. Three dimensional plotting

Code:
# importing mplot3d toolkits
from mpl_toolkits import mplot3d
import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure()
# syntax for 3-D projection
ax = plt.axes(projection ='3d')
# defining axes
z = np.linspace(0, 1, 100)
x = z * np.sin(25 * z)
y = z * np.cos(25 * z)
c=x+y
ax.scatter(x, y, z, c = c)
# syntax for plotting
ax.set_title('3d Scatter plot geeks for geeks')
plt.show()

OUTPUT:

Code:
# importing libraries
from mpl_toolkits import mplot3d
import numpy as np
import matplotlib.pyplot as plt
# defining surface and axes
x = np.outer(np.linspace(-2, 2, 10), np.ones(10))
y = x.copy().T
z = np.cos(x ** 2 + y ** 3)
fig = plt.figure()
# syntax for 3-D plotting
ax = plt.axes(projection ='3d')
# syntax for plotting
ax.plot_surface(x, y, z, cmap ='viridis', edgecolor ='green')
ax.set_title('Surface plot geeks for geeks')
plt.show()
Output:

Results
Thus the various plotting curve has been successfully completed
EX. 7
Apply and explore various plotting functions

Aim
To Apply and explore various plotting functions

Procedures
Geographic data
One common type of visualization in data science is that of geographic data. Matplotlib's main tool
for this type of visualization is the Basemap toolkit, which is one of several Matplotlib toolkits
which lives under the mpl_toolkits namespace. Admittedly, Basemap feels a bit clunky to use, and
often even simple visualizations take much longer to render than you might hope. More modern
solutions such as leaflet or the Google Maps API may be a better choice for more intensive map
visualizations. Still, Basemap is a useful tool for Python users to have in their virtual toolbelts. In
this section, we'll show several examples of the type of map visualization that is possible with this
toolkit.

Installation of Basemap is straightforward; if you're using conda you can type this and the package
will be downloaded:

$ conda install basemap

We add just a single new import to our standard boilerplate:

Program:

conda install -c conda-forge basemap-data-hires


output:

Collecting package metadata (current_repodata.json): ...working... done


Solving environment: ...working...
Warning: 2 possible package resolutions (only showing differing packages):
- anaconda/win-64::openssl-3.0.12-h2bbff1b_0
- defaults/win-64::openssl-3.0.12-h2bbff1b_0done

## Package Plan ##

environment location: C:\Users\mca1\anaconda3


added / updated
specs: - basemap-
data-hires

The following packages will be downloaded:


package |
build |

basemap-data-hires-1.3.2 | pyhd8ed1ab_3 80.7 MB


condaforge
- Total: 80.7
MB

The following NEW packages will be INSTALLED:

basemap-data-hires conda-forge/noarch::basemap-data-hires-
1.3.2pyhd8ed1ab_3

Downloading and Extracting Packages: ...working. ...... done


Preparing transaction: ...working. ...... done
Verifying transaction: ...working. ...... done
Executing transaction: ...working. ...... done

Note: you may need to restart the kernel to use updated packages

Program:

%matplotlib inline import


numpy as np import
matplotlib.pyplot as plt from
mpl_toolkits.basemap import
Basemap
plt.figure(figsize=(8,8))
m=Basemap(projection='ortho',resolution=None,lat_0=50,lon_0=-100)
m.plot("area",y=True)
m.bluemarble(scale=0.5)
Output:

Clipping input data to the valid range for imshow with RGB
data ([0..1] for floats or [0..255] for integers).
<matplotlib.image.AxesImage at 0x18859f04cd0>

Program:
fig=plt.figure(figsize=(8,8))
m=Basemap(projection='lcc',resolution=None,wid
th=8E6, height=8E6,lat_0=45,lon_0=-100,)
m.etopo(scale=0.5,alpha=0.5)
# Map (long,lat) to (x,y) for plotting
x,y=m(- 122.3,47.6)
plt.plot(x,y,'ok',markersize=5)
plt.text(x,y,'seattle',fontsize=12)
Output:

Clipping input data to the valid range for imshow with RGB
data ([0..1] for floats or [0..255] for integers).
Text(2347268.222744085, 4518079.266407731, 'seattle')
Program:
from itertools import chain
def draw_map(m,scale=0.2):
#draw a shaded-relief
image
m.shadedrelief(scale=scale)
#lats and longs are retrurned as dictonary
lats=m.drawparallels(np.linespace(-90,90,13))
lons=m.drawmeridians(np.linespace(-180,180,13))
#key
s contain the plt.Line2D instances
lat_lines=chain(*(tup[1][0]for tup in
lats.items())) lon_lines=chain(*(tup[1][0]for tip
in lons.items()))
all_lines=chain(lat_lines,lon_lines)
#cycle through these lines and set the desired
style for line in all_lines:
line.set(linestyle='-',alpha=0.3,color=w)
fig=plt.figure(figsize=(8,6),edgecolor='w')
m=Basemap(projection='cyl',resolution=None,llcrnrlat=90,urcrnrlat=90,ll
crnr lon=-180,urcrnrlon=180,) draw_map(m)
output:
AttributeError Traceback (most recent call
last) Cell In[9], line 3
1 fig=plt.figure(figsize=(8,6),edgecolor='w')
2 m=Basemap(projection='cyl',resolution=None,llcrnrla
t=- 90,urcrnrlat=90,llcrnrlon=-180,urcrnrlon=180,)
----> 3 draw_map(m)

Cell In[8], line 6, in draw_map(m, scale)


4 m.shadedrelief(scale=scale)
5 #lats and longs are retrurned as dictonary
----> 6 lats=m.drawparallels(np.linespace(-90,90,13))
7 lons=m.drawmeridians(np.linespace(-180,180,13))
8 #keys contain the plt.Line2D instances

File ~\anaconda3\Lib\site-packages\numpy\ init .py:320, in


getattr (attr)
317 from .testing import Tester
318 return Tester
--> 320 raise AttributeError("module {!r} has no attribute "
321 "{!r}".format( name , attr))

AttributeError: module 'numpy' has no attribute

'linespace'

Program:
fig=plt.figure(figsize=(8,6),edgecolor='w')
m=Basemap(projection='moll',resolution=None,lat_0=0,lon_0=0,)
m.plot("area",y=True) draw_map(m)

output:

AttributeError Traceback (most recent call


last) Cell In[10], line 4
2 m=Basemap(projection='moll',resolution=None,lat_0=0,lon_0=0,)
3 m.plot("area",y=True)
----> 4 draw_map(m)
Cell In[8], line 4, in draw_map(m, scale)
2 def draw_map(m,scale=0.2):
3 #draw a shaded-relief image
----> 4 m.shadedrelief(scale=scale)
5 #lats and longs are retrurned as dictonary
6 lats=m.drawparallels(np.linespace(-90,90,13))

File ~\anaconda3\Lib\site-packages\mpl_toolkits\basemap\ init


.py:4014, in Basemap.shadedrelief(self, ax, scale, **kwargs)
4012 return
self.warpimage(image='shadedrelief',ax=ax,scale=scale,**k
wargs) 4013 else: -> 4014 return
self.warpimage(image='shadedrelief',scale=scale,**kwar
gs)

File ~\anaconda3\Lib\site-packages\mpl_toolkits\basemap\ init


.py:4187, in Basemap.warpimage(self, image, scale, **kwargs)
4185 lonright =
lon_0+180. 4186 lonleft
= lon_0-180.
-> 4187 x1 = np.array(ny*[0.5*(self.xmax +
self.xmin)],np.float) 4188 y1 = np.linspace(self.ymin,
self.ymax, ny)
4189 lons1, lats1 = self(x1,y1,inverse=True)

File ~\anaconda3\Lib\site-packages\numpy\ init .py:305, in


getattr (attr)
300 warnings.warn(
301 f"In the future `np.{attr}` will be defined as the "
302 "corresponding NumPy scalar.", FutureWarning, stacklevel=2)
304 if attr in former_attrs :
--> 305 raise AttributeError( former_attrs [attr])
307 # Importing Tester requires importing all of UnitTest which
is not a 308 # cheap import Since it is mainly used in test
suits, we lazy import it 309 # here to save on the order of 10 ms
of import time for most users 310 # 311 # The previous way
Tester was imported also had a side effect of adding
312 # the full `numpy.testing` namespace
313 if attr == 'testing':

AttributeError: module 'numpy' has no attribute 'float'.


`np.float` was a deprecated alias for the builtin `float`. To avoid
this error in existing code, use `float` by itself. Doing this will
not modify any behavior and is safe. If you specifically wanted the
numpy scalar type, use `np.float64` here.
The aliases was originally deprecated in NumPy 1.20; for more
details and guidance see the original release note at:
https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
Program:
fig=plt.figure(figsize=(8,8))
m=Basemap(projection='ortho',resolution=None,lat_0=50,lon_0=0)
m.plot("area",y=True) draw_map(m)
output:

AttributeError Traceback (most recent call


last) Cell In[11], line 4
2 m=Basemap(projection='ortho',resolution=None,lat_0=50,lon_0=0)
3 m.plot("area",y=True)
----> 4 draw_map(m)

Cell In[8], line 6, in draw_map(m, scale)


4 m.shadedrelief(scale=scale)
5 #lats and longs are retrurned as dictonary
----> 6 lats=m.drawparallels(np.linespace(-90,90,13))
7 lons=m.drawmeridians(np.linespace(-180,180,13))
8 #keys contain the plt.Line2D instances

File ~\anaconda3\Lib\site-packages\numpy\ init .py:320, in


getattr (attr)
317 from .testing import Tester
318 return Tester
--> 320 raise AttributeError("module {!r} has no attribute "
321 "{!r}".format( name , attr))

AttributeError: module 'numpy' has no attribute

'linespace'
Program:
fig=plt.figure(figsize=(8,8))
m=Basemap(projection='lcc',resolution=None,lon_0=0,lat_0=50,lat_1=45,lat_2=
55,width=1.6E7,height=1.2E7) draw_map(m)

output:
AttributeError Traceback (most recent call last)
Cell In[12], line 3
1 fig=plt.figure(figsize=(8,8))
2
m=Basemap(projection='lcc',resolution=None,lon_0=0,lat_0=50,lat_1=45,lat_2=
55,width=1.6E7,height=1.2E7)
----> 3 draw_map(m)

Cell In[8], line 6, in draw_map(m, scale)


4 m.shadedrelief(scale=scale)
5 #lats and longs are retrurned as dictonary
----> 6 lats=m.drawparallels(np.linespace(-90,90,13))
7 lons=m.drawmeridians(np.linespace(-180,180,13))
8 #keys contain the plt.Line2D instances

File ~\anaconda3\Lib\site-packages\numpy\ init .py:320, in


getattr (attr)
317 from .testing import Tester
318 return Tester
--> 320 raise AttributeError("module {!r} has no attribute "
321 "{!r}".format( name , attr))

AttributeError: module 'numpy' has no attribute 'linespace'


Program:
fig,ax=plt.subplots(1,2,figsize=(12,8)) for
i, res in enumerate(['l','h']):

m=Basemap(projection='gnom',lat_0=57.3,lon_0=6.2,width=90000,height=120000,
resolution=res,ax=ax[i])
m.fillcontinents(color="#FFDDCC",lake_color="#DDEEFF")
m.drawmapboundary(fill_color="#DDEEFF")
m.drawcoastlines()
m.plot("area",y=True)
ax[i].set_title("resolution='{0}'".format(res))
output:
code:

import pandas as pd
cities = pd.read_csv('data/california_cities.csv')
# Extract the data we're interested in
lat = cities['latd'].values
lon = cities['longd'].values
population = cities['population_total'].values
area = cities['area_total_km2'].values
# 1. Draw the map background
fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='lcc', resolution='h', lat_0=37.5, lon_0=-19, width=1E6, height=1.2E6)
m.shadedrelief()
m.drawcoastlines(color='gray')
m.drawcountries(color='gray')
m.drawstates(color='gray')
# 2. scatter city data, with color reflecting population
# and size reflecting area
m.scatter(lon, lat, latlon=True, c=np.log10(population), s=area, cmap='Reds', alpha=0.5)
# 3. create colorbar and legend
plt.colorbar(label=r'$\log_{10}({\rm population})$')
plt.clim(3, 7)
# make legend with dummy points
for a in [100, 300, 500]:
plt.scatter([], [], c='k', alpha=0.5, s=a, label=str(a) + ' km$^2$')
plt.legend(scatterpoints=1, frameon=False,labelspacing=1, loc='lower left')
output:

Result:

This the program has been successfully completed.


EX NO:8
Program frequency of last character appearing in all names associated with
males and females respectively and compares them.

AIM

To write a python program frequency of last character appearing in all names associated with
males and females respectively and compares them.

ALORITHM
STEP 1: Start
STEP 2: import all necessary libraries
STEP 3: Display the frequency of each items in the list

STEP 4: plot
STEP 5: Stop

PROGRAM:

import nltk
nltk.download('names')
from nltk.corpus import names
nt=[(fid.split('.')[0],name[-1])
for fid in names.fileids()
for name in names.words(fid)]
cfd2=nltk.ConditionalFreqDist(nt)
cfd2['female']['a']
cfd2['male']['a']
cfd2['female']>cfd2['male']
cfd2.tabulate(saples=['a','e'])
cfd2.plot()
OUTPUT:

a b c d e f g h i j k l m n o p r s t u v w x y z
female 1 1773 9 0 39 1432 2 10 105 317 1 3 179 13 386 33 2 47 93 68 6 2 5 10
461 4
male 0 29 21 25 228 468 25 32 93 50 3 69 187 70 478 165 18 190 230 164 12 16 17
10 332 11

<Axes: xlabel='Samples', ylabel='Counts'>

RESULT
Thus the python program frequency of last character appearing in all names
associated with males and females respectively and compares them has been
implemented and executed successfully
Ex 9:

Frequency of words, of a particular genre, in brown corpus.

AIM
To write a python program determine the frequency of words, of a particular genre, in brown
corpus.
ALORITHM
STEP 1: Start
STEP 2: import all necessary libraries
STEP 3: Display the frequency of each items in
the list

STEP 4:setting cumulative argument value to


True.
STEP 5: Stop

PROGRAM:
import nltk
nltk.download('brown')
from nltk.corpus import brown
cfd = nltk.ConditionalFreqDist([(genre, word) for genre in brown.categories() for word in
brown.words(categories=genre)])
cfd
cfd.conditions()
cfd.tabulate(conditions=['government', 'humor', 'reviews'], samples=['leadership',
'workship', 'hardship'])
cfd.plot(conditions=['government', 'humor', 'reviews'], samples=['leadership', 'workship',
'hardship'])
cfd.tabulate(conditions=['government', 'humor', 'reviews'], samples=['leadership',
'workship', 'hardship'], cumulative=True)
news_fd = cfd['news']
news_fd.most_common(3)
news_fd['the'] # corrected variable name from new_fd to news_fd
OUTPUT:

RESULT
Thus the python program frequency of words, of a particular genre, in brown
corpus has been implemented and executed successfully.

You might also like