0% found this document useful (0 votes)

38 views37 pages

MDA File

The document describes a lab manual for a course on methods of data analysis. It outlines 12 experiments covering topics like introduction to Python, data collection and preprocessing, descriptive statistics, correlation analyses, statistical tests, and comparing data analysis tools.

Uploaded by

Akshat

Available Formats

Download as PDF, TXT or read online on Scribd

Download as pdf or txt

0% found this document useful (0 votes)

38 views37 pages

MDA File

Uploaded by

Akshat

Available Formats

Download as PDF, TXT or read online on Scribd

Download as pdf or txt

You are on page 1/ 37

Methods of Data Analysis (SE327)

LAB MANUAL

Subject Code: SE327

Subject Name: Methods of Data Analysis
Branch: Software Engineering
Year: 3rd Year/5th Semester

Submitted by:
Akshat Gupta
2K21/SE/18

Submitted to:
Div Chaudhary
Abhishek Yadav
Department of Software Engineering

Delhi Technological University

Shahbad Daulatpur, Main Bawana Road, Delhi-110042

November, 2023

1
INDEX

S. No. Title Page No. Date Signature

1. Introduction to Python 3-9 07/08/2023
Collect dataset from historical repositories, open-
2. source projects using various tools
10 14/08/2023
Implement data preprocessing techniques for
3. appropriate dataset 11-17 21/08/2023

4. Describe data using various methods 18-20 28/08/2023

Write a program to implement chi-square test
5. for appropriate dataset 21-22 04/09/2023
Write a program to implement various
correlation analysis techniques

6. · Spearman's rank correlation coefficient 23-25

11/09/2023
· Pearson correlation coefficient
· Kendall rank correlation coefficient
Write a program to implement t-test (one
7. sample t-test and two sample t-test) for 26-27 18/09/2023
appropriate dataset
Write a program to implement Friedman test
8. for appropriate dataset 28-29 09/10/2023

Write a program to implement Wilcoxon signed

9. rank test for appropriate dataset
30-31 16/10/2023
Write a program to implement Kruskal Wallis
10. test for appropriate dataset 32-33 23/10/2023

Write a program to implement Mann-Whitney

11. U test for appropriate dataset 34-35 30/10/2023

Perform comparison of following data analysis

tools
· WEKA
12. · SAS Enterprise Miner 36-37 06/11/2023
. SPSS
· MATLAB
·R

2
EXPERIMENT-1

Objective:
Introduction to Python.

Introduction:
Python is a high-level, interpreted, and general-purpose programming language. It was created in the late 1980s
by Guido van Rossum. Python is dynamically typed, meaning that you don't have to specify the data type of a
variable when declaring it. Python's syntax is designed to be easy to read and write, making it a great choice for
beginners.
• Variables - They are containers used for storing data values. Variables can store data of different types
and different types can do different things. There are three numeric types in python - int, float,
complex. There may be times when you want to specify a type on a variable which can be done with
casting.
• Comments – They are used to explain code and are not part of the actual executable code.
• OOP - Python is an object-oriented language, and as such it uses classes to define data types, including
its primitive types. Almost everything in Python is an object, with its properties and methods. A Class
is like an object constructor, or a "blueprint" for creating objects.
• String - They are surrounded by either single quotation marks, or double quotation marks. You can
display a string literal with the print() function.
• Boolean - They represent one of two values, true or false.
• Operators - They are used to perform operations on variables and values.
• Lists – They are used to store multiple items in a single variable and created using square brackets.
• Tuples - They are used to store multiple items in a single variable. A tuple is a collection which is
ordered and unchangeable and is written with round brackets.
• Sets - They are used to store multiple items in a single variable. A set is a collection which is
unordered, unchangeable, and unindexed.
• Dictionaries - They are used to store data values in key:value pairs. A dictionary is a collection which
is ordered, changeable and does not allow duplicates.
• Loops - With the while loop we can execute a set of statements as long as a condition is true. A for
loop is used for iterating over a sequence (that is either a list, a tuple, a dictionary, a set, or a string).
• Function - It is a block of code which only runs when it is called. You can pass data, known as
parameters, into a function. A function can return data as a result.
• JSON - Python has a built-in package called json, which can be used to work with JSON data.
• File handling - It is an important part of any web application. Python has several functions for
creating, reading, updating, and deleting files.
• NumPy - It is a Python library. NumPy is used for working with arrays.
• Pandas - It is a Python library. Pandas is used to analyze data.
• SciPy - It is a scientific computation library that uses NumPy underneath. SciPy stands for Scientific
Python.
• Matplotlib - It is a low-level graph plotting library in python that serves as a visualization utility.
• Scikit-learn (Sklearn) - Itis the most useful and robust library for machine learning in Python. It
provides a selection of efficient tools for machine learning and statistical modeling including
classification, regression, clustering and dimensionality reduction via a consistent interface in Python.
• Pytorch - Itis used to process the tensors. Tensors are multidimensional arrays like n-dimensional
NumPy arrays.

Code (with output):

1. Basics:
a. Conversion:

x = 5 # int
y = 5.5 # float
a = float(x) #convert from int to float:
b = int(y) #convert from float to int:

3
print(a)
print(b)
print(type(a))
print(type(b))

b. String:
a = "Hello world"
print(a[1])

for s in "python":
print(s)

print(len(“python”)

b = "Hello world"
print(b[2:5])
print(b[:5])
print(b[2:])
print(b[-5:-2])

a = "Hello, World!"
print(a.upper())
print(a.lower())
print(a.replace("H", "J"))
print(a.split(","))

a = "Hello"
b = "World"
c=a+b
print(c)

c. Boolean:
print(10 > 9)
print(10 == 9)
print(10 < 9)

4
d. List:

mylist = ["apple", "mango", "grapes"]

print(mylist)
print(len(mylist))
print(type(mylist))
mylist.append("orange")
print(mylist)
mylist.remove("apple")
mylist.sort()
print(mylist)
thislist = mylist.copy()
print(thislist)

e. Tuple:

thistuple = ("apple", "mango", "grapes")

print(thistuple)
print(len(thistuple))
for x in thistuple:
print(x)

f. Set:

thisset = {"apple", "mango", "grapes"}

print(thisset)
thisset.add("orange")
print(thisset)
thisset.remove("apple")
print(thisset)

g. Dictionary:

thisdict = {"brand": "Ford","model": "Mustang","year": 1964}

print(thisdict)
thisdict["color"] = "red"
print(thisdict)
thisdict.pop("model")
print(thisdict)
for x in thisdict:
print(x)
for x in thisdict:
print(thisdict[x])

5
h. Loops:

i=1
while i < 5:
print(i)
i += 1

for s in "python":
print(s)

i. Function:

def my_function(fname):
print(fname + " name")

my_function("a")
my_function("b")
my_function("c")

j. Class:

class Person:
def __init__(self, name, age):
self.name = name
self.age = age

p1 = Person("John", 36)
print(p1.name)
print(p1.age)

k. JSON:

import json

6
x = '{ "name":"John", "age":30, "city":"New York"}'
y = json.loads(x)
print(y["age"])

2. File Handling:

import os
f = open("demofile3.txt", "w")
f.write("Woops! I have deleted the content!")
f.close()

f = open("demofile3.txt", "r")
print(f.read())
os.remove("demofile3.txt")

3. NumPy:

import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr)
print(type(arr))
print(arr[0])
print(arr[1:5])
x = arr.copy()
print(x)
print(arr.shape)
for x in arr:
print(x)
y = np.where(arr == 4)
print(y)
print(np.sort(arr))
x = [True, False, True, False, True]
newarr = arr[x]
print(newarr)

4. Pandas:

import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a)
print(myvar)
mydataset = {
'cars': ["BMW", "Volvo", "Ford"],

7
'passings': [3, 7, 2]
}
myvar = pd.DataFrame(mydataset)
print(myvar)
print(myvar.head(1))

5. SciPy:
from scipy import constants
print(constants.pi)
print(constants.gram)
print(constants.degree)
print(constants.hour)

6. Matplotlib:
import matplotlib.pyplot as plt
import numpy as np
xpoints = np.array([0, 6])
ypoints = np.array([0, 250])
plt.plot(xpoints, ypoints)
plt.show()
xpoints = np.array([1, 2, 6, 8])
ypoints = np.array([3, 8, 1, 10])
plt.plot(xpoints, ypoints)
plt.show()

8
7. PyTorch:

import torch
a = torch.tensor(2)
b = torch.tensor(1)
print(a, b)
print(a+b)
print(b-a)
print(a*b)
print(a/b)
a = torch.zeros((3, 3))
print(a)
print(a.shape)

8. SciKit-Learn:

from sklearn import datasets

iris = datasets.load_iris()
X = datasets.load_iris()
y = datasets.load_iris()
feature_name = iris.feature_names
target_name = iris.target_names
print("Features names : ", feature_name)
print("Target names :", target_name)

Results:
Implemented basic python data types, functions and libraries.

Learnings:
Getting familiarized by the basics of python and different libraries of python used in machine learning along
with file handling concepts.

9
EXPERIMENT-2
Objective:
Collect dataset from historical repositories, open-source projects using various tools.

Introduction:
The aim is to compile a comprehensive dataset for crop yield in different Indian States. The various factors
contributing towards yield may be state, rainfall, usage of fertilizer, season, amount produced, land size, etc. Since,
the exact dataset according to our requirements is not available anywhere on the internet, we will have to collect/mine
the dataset from different repositories/sources.

Data Sources:
1. https://data.gov.in/catalog/district-wise-season-wise-crop-production-statistics-0
2. https://www.fao.org/faostat/en/#data
3. https://data.gov.in/catalog/rainfall-india
4. https://environicsindia.in/
5. https://www.imdpune.gov.in/library/public/e-book110.pdf

From these sources, we have downloaded following files:

1. APY 2.csv
2. fertilizers.csv
3. pesticides.csv
4. rainfall dataset.csv

From these files, we can only compile the dataset from 1997-2020 of all Indian States and UTs, except
Chandigarh, Dadra and Nagar Haveli, Daman & Diu, Ladakh and Lakshadweep.

Results:
Downloaded different files from different credible and official data repositories.

Learnings:
Getting familiarized by data repositories and learning how to search for credible data on the internet effectively.

10
EXPERIMENT-3
Objective:
Implement data preprocessing techniques for appropriate dataset.

Introduction:
This dataset encompasses agricultural data for multiple crops cultivated across various states in India from the
year 1997 till 2020. The dataset provides crucial features related to crop yield prediction, including crop types,
crop years, cropping seasons, states, areas under cultivation, production quantities, annual rainfall, fertilizer
usage, pesticide usage, and calculated yields. The dataset is focused on predicting crop yields based on several
agronomic factors, such as weather conditions, fertilizer and pesticide usage, and other relevant variables. The
dataset is presented in tabular form, with each row representing data for a specific crop and its corresponding
features. It has 19698 rows and 10 columns (9 features and 1 label).

Columns Description:
1. Crop: The name of the crop cultivated.
2. Crop_Year: The year in which the crop was grown.
3. Season: The specific cropping season (e.g., Kharif, Rabi, Whole Year).
4. State: The Indian state where the crop was cultivated.
5. Area: The total land area (in hectares) under cultivation for the specific crop.
6. Production: The quantity of crop production (in metric tons).
7. Annual_Rainfall: The annual rainfall received in the crop-growing region (in mm).
8. Fertilizer: The total amount of fertilizer used for the crop (in kilograms).
9. Pesticide: The total amount of pesticide used for the crop (in kilograms).
10. Yield: The calculated crop yield (production per unit area).

Data Collection Methodology:

1. Downloaded csv/excel/pdf files from different sources.
2. Implement basic data preprocessing of all files in MS Excel.
3. Implement advanced data preprocessing in Google Colab.
4. Compiled all files into one comprehensive CSV file as the final dataset.

Kaggle Link: https://www.kaggle.com/datasets/akshatgupta7/crop-yield-in-indian-states-dataset/data

Code (with output):

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder

df_yield = pd.read_csv('/content/APY 2.csv')

print(df_yield)

11
df_yield.columns = df_yield.columns.str.strip()
df_yield = df_yield.drop('District', axis=1)
df_yield = df_yield.groupby(['Crop', 'Crop_Year', 'Season', 'State']).agg({'Area': 'sum', 'Production': 'sum', 'Yield':
'mean'})
print(df_yield)

df_yield.to_csv('/content/ds_new.csv', index=True)

df_new = pd.read_csv('/content/ds_new.csv')
print(df_new)

12
print(df_new.dtypes)

df_new['Crop_Year'] = df_new['Crop_Year'].astype('int')
print(df_new.dtypes)

df_rain = pd.read_csv('/content/rainfall dataset.csv')

print(df_rain)

mean_rainfall_by_state = df_rain.groupby('State')['Annual_Rainfall'].mean()
df_rain['Annual_Rainfall'].fillna(df_rain['State'].map(mean_rainfall_by_state), inplace=True)
print(df_rain)

13
df_new = df_new.merge(df_rain, left_on=['State', 'Crop_Year'], right_on=['State', 'Year'], how='inner')
df_new.drop(columns=['Year'], inplace=True)
print(df_new)

df_fertilizer = pd.read_csv('/content/fertilizers.csv')
print(df_fertilizer)

df_fertilizer = df_fertilizer.groupby('Year')['Value'].sum().reset_index()
print(df_fertilizer)

14
df_new = df_new.merge(df_fertilizer, left_on='Crop_Year', right_on='Year', how='left')
df_new['Fertilizer'] = df_new['Value'] * df_new['Area']
df_new.drop(columns=['Year'], inplace=True)
df_new.drop(columns=['Value'], inplace=True)
print(df_new)

df_pesticides = pd.read_csv('/content/pesticides.csv')
print(df_pesticides)

15
df_new = df_new.merge(df_pesticides, left_on='Crop_Year', right_on='Year', how='left')
df_new['Pesticide'] = df_new['Value'] * df_new['Area']
df_new.drop(columns=['Year'], inplace=True)
df_new.drop(columns=['Value'], inplace=True)
print(df_new)

df_new = df_new[['Crop', 'Crop_Year', 'Season', 'State', 'Area', 'Production', 'Annual_Rainfall', 'Fertilizer', 'Pesticide',
'Yield']]
print(df_new)

df_new.to_csv('/content/ds.csv', index=True)

categorical_columns = ['Crop', 'Season', 'State']

categorical_data = df_new[categorical_columns]
encoder = OneHotEncoder()
encoded_data = encoder.fit_transform(categorical_data)
encoded_df = pd.DataFrame(
encoded_data.toarray(),
columns=[f'{col}_{val}' for col in categorical_columns for val in
encoder.categories_[categorical_columns.index(col)]]
)
df_encoded = df_new.copy()
df_encoded.drop(categorical_columns, axis=1, inplace=True)
df_encoded = pd.concat([df_encoded, encoded_df], axis=1)
bool_columns = df_encoded.select_dtypes(include='bool').columns
df_encoded[bool_columns] = df_encoded[bool_columns].astype(int)
print(df_encoded)

16
df_encoded.to_csv('/content/ds_encoded.csv', index=True)

Results:
Compiled a normalized and preprocessed comprehensive dataset for crop yield with 97 features and 1 label.

Learnings:
Getting familiarized with various data preprocessing techniques in excel and python.

17
EXPERIMENT-4
Objective:
Describe data using various methods.

Code (with output):

print(df_encoded)

print("Shape of the dataset : ",df.shape)

print(df_encoded.describe())

18
print(df_encoded.info())

19
Results:
Described the dataset using pandas library functions.

Learnings:
Getting familiarized with basic functions of pandas library for describing data.

20
EXPERIMENT-5
Objective:
Write a program to implement chi-square test for appropriate dataset.

Introduction:
A chi-squared test (symbolically represented as χ2) is basically a data analysis on the basis of
observations of a random set of variables. Usually, it is a comparison of two statistical data sets. This
test was introduced by Karl Pearson in 1900 for categorical data analysis and distribution. So it was
mentioned as Pearson’s chi-squared test.
The chi-square test is used to estimate how likely the observations that are made would be, by
considering the assumption of the null hypothesis as true.
A hypothesis is a consideration that a given condition or statement might be true, which we can test
afterwards. Chi-squared tests are usually created from a sum of squared falsities or errors over the
sample variance.
The Null Hypothesis is the assumption that the event will not occur. A null hypothesis has no bearing on the
study's outcome unless it is rejected. H0 is the symbol for it, and it is pronounced H-naught.
The Alternate Hypothesis is the logical opposite of the null hypothesis. The acceptance of the alternative
hypothesis follows the rejection of the null hypothesis. H 1 is the symbol for it.

P stands for probability here. To calculate the p-value, the chi-square test is used in statistics. The
different values of p indicate the different hypothesis interpretation, are given below:
• P≤ 0.05; Hypothesis rejected
• P>0.05; Hypothesis Accepted

The following are the important properties of the chi-square test:

• Two times the number of degrees of freedom is equal to the variance.
• The number of degrees of freedom is equal to the mean distribution
• The chi-square distribution curve approaches the normal distribution when the degree of
freedom increases.

The formula for chi-square can be written as:

χ2 = ∑(Oi – Ei)2/Ei
where Oi is the observed value and Ei is the expected value.

Dataset: The Iris dataset is one of the most well-known and commonly used datasets in machine learning and
statistics. It is included in the scikit-learn library, which is a popular machine learning library in Python. The
Iris dataset is often used for testing and experimenting with various machine learning algorithms, particularly in
the field of classification.
The Iris dataset contains a set of 150 records of iris flowers. Each record has four features:
• Sepal length in centimeters
• Sepal width in centimeters
• Petal length in centimeters
• Petal width in centimeters
The dataset is labeled with the corresponding species of iris flowers, which include three classes:
• Iris-setosa
• Iris-versicolor
• Iris-virginica

Code:
from sklearn.datasets import load_iris
from sklearn.feature_selection import chi2

iris = load_iris()

21
X, y = iris.data, iris.target

chi2_stat, p_val = chi2(X, y)

print("Chi-squared statistics for each feature:")

for i in range(len(iris.feature_names)):
print(f"{iris.feature_names[i]}: {chi2_stat[i]}")

print("\nP-values for each feature:")

for i in range(len(iris.feature_names)):
print(f"{iris.feature_names[i]}: {p_val[i]}")

Output:

Results:
Implemented chi-squared test on iris dataset available in sklearn library.

Learnings:
Getting familiarized with chi-squared test and its implementation in python through sklearn library.

22
EXPERIMENT-6
Objective:
Write a program to implement various correlation analysis techniques:
· Spearman's rank correlation coefficient
· Pearson correlation coefficient
· Kendall rank correlation coefficient

Introduction:
Spearman’s rank correlation measures the strength and direction of association between two ranked variables.
It basically gives the measure of monotonicity of the relation between two variables i.e. how well the
relationship between two variables could be represented using a monotonic function.
The formula for Spearman’s rank coefficient is:

where,
𝝆 = Spearman’s rank correlation coefficient
di = Difference between the two ranks of each observation
n = Number of observations
The Spearman Rank Correlation can take a value from +1 to -1 where,
• A value of +1 means a perfect association of rank
• A value of 0 means that there is no association between ranks
• A value of -1 means a perfect negative association of rank

Pearson correlation coefficient, also known as Pearson's r, measures the strength of a relationship between
two variables and their association with one another. Pearson's Correlation Coefficient is named after Karl
Pearson.

where,
r = Coefficient of correlation
xbar = Mean of x-variable
ybar = Mean of y-variable.
xi yi = Samples of variable x, y

Kendall's rank correlation coefficient, often referred to as Kendall's tau, is a statistic that measures the
strength of dependence between two variables. It is particularly useful when dealing with ordinal data, which is
data that can be ranked but may not have a clear numerical difference between each ranking. Kendall's tau
assesses the correspondence between the ranks of the paired samples.
The value of Kendall's tau ranges between -1 and 1, where,
• 1 indicates a perfect agreement in rankings.
• -1 indicates a perfect disagreement or inverse agreement in rankings.
• 0 indicates no association between the rankings.
Formula for Kendall's rank correlation coefficient is,

where,
• Concordant Pair - A pair of observations (x1, y1) and (x2, y2) that follows the property
o x1 > x2 and y1 > y2 or

23
o x1 < x2 and y1 < y2
• Discordant Pair: A pair of observations (x1, y1) and (x2, y2) that follows the property
o x1 > x2 and y1 < y2 or
o x1 < x2 and y1 > y2
• n: Total number of samples
• The pair for which x1 = x2 and y1 = y2 are not classified as concordant or discordant and are ignored.

Code:
from sklearn.datasets import load_iris
from scipy.stats import spearmanr, pearsonr, kendalltau

iris = load_iris()
X = iris.data
y = iris.target

for i in range(X.shape[1]):
feature_name = iris.feature_names[i]
X_feature = X[:, i]

spearman_corr, _ = spearmanr(X_feature, y)
pearson_corr, _ = pearsonr(X_feature, y)
kendall_corr, _ = kendalltau(X_feature, y)

print(f"Feature: {feature_name}")
print(f"Spearman's Rank Correlation Coefficient: {spearman_corr}")
print(f"Pearson Correlation Coefficient: {pearson_corr}")
print(f"Kendall Rank Correlation Coefficient: {kendall_corr}")
print("------------------------")

Output:

24
Results:
Implemented Spearman's rank correlation coefficient, Pearson correlation coefficient and Kendall rank
correlation coefficient on iris dataset available in sklearn library.

Learnings:
Getting familiarized with Spearman's rank correlation coefficient, Pearson correlation coefficient and Kendall
rank correlation coefficient and its implementation in python through scipy library.

25
EXPERIMENT-7
Objective:
Write a program to implement t-test (one sample t-test and two sample t-test) for appropriate dataset.

Introduction:
The t-test is a statistical test used to determine whether there is a significant difference between the means of
two groups. It is widely used in hypothesis testing when the data is approximately normally distributed and the
variances of the two populations are assumed to be equal or approximately equal.
There are two main types of t-tests:
1. One-Sample t-Test: The one-sample t-test is used to determine whether the mean of a single sample
significantly differs from a known or hypothesized population mean. It is commonly used to compare
the sample mean with a hypothesized value.
Formula:
t = (X̄ – μ) / S/√n
where,
X̄ is the sample mean
μ is the hypothesized population mean
S is the standard deviation of the sample
n is the number of observations in the sample

2. Two-Sample t-Test: The two-sample t-test is used to determine whether the means of two independent
samples are significantly different from each other. It is commonly used to compare the means of two
groups or populations.
Formula:

where,
X̄1 is mean of first sample
X̄2 is mean of second sample
μ1 is the mean of first population
μ2 is the mean of second population
s1 is the standard deviation of first sample
s2 is the standard deviation of second sample
n1 is the size of the first sample
n2 is the size of the second sample

26
Code:
from scipy import stats
from sklearn.datasets import load_iris

iris = load_iris()
data = iris.data

one_sample = stats.ttest_1samp(data[:, 0], 5.0)

print("One-sample t-test:")
print("t-statistic:", one_sample.statistic)
print("p-value:", one_sample.pvalue)

setosa_data = data[iris.target == 0, 0]
versicolor_data = data[iris.target == 1, 0]
two_sample = stats.ttest_ind(setosa_data, versicolor_data)
print("\nTwo-sample t-test:")
print("t-statistic:", two_sample.statistic)
print("p-value:", two_sample.pvalue)

Output:

Results:
Implemented one sample t-test and two sample t-test on iris dataset available in sklearn library.

Learnings:
Getting familiarized with one sample t-test and two sample t-test and its implementation in python through
scipy library.

27
EXPERIMENT-8
Objective:
Write a program to implement Friedman test for appropriate dataset.

Introduction:
The Friedman test is a non-parametric statistical test used to determine whether there are statistically
significant differences between multiple treatments or groups in a repeated measures design. It is the non-
parametric equivalent of the one-way analysis of variance (ANOVA) and is used when the assumptions of
normality and homogeneity of variances are not met.
Assumptions:
• The observations should be independent within and between groups.
• The dependent variable should be measured on an ordinal or continuous scale.
• The data should be measured on the same subjects under different conditions or treatments.
• Hypotheses:
Null Hypothesis (H0): The population means of the groups are equal.
Alternative Hypothesis (H1): At least one population mean is different from the others.
Test Statistic: The Friedman test calculates a chi-squared (χ2) statistic that is based on the ranks of the
observations within each group. The test statistic is computed as follows:

where:
N is the total number of observations,
k is the number of treatments or groups,
Rj is the sum of ranks for treatment j.

Code:
from scipy.stats import friedmanchisquare
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target

friedman_test_statistic, p_value = friedmanchisquare(X[:, 0], X[:, 1], X[:, 2], X[:, 3])

print("Friedman Test Statistic:", friedman_test_statistic)

print("P-value:", p_value)

28
Output:

Results:
Implemented Friedman test on iris dataset available in sklearn library.

Learnings:
Getting familiarized with Friedman test and its implementation in python through scipy library.

29
EXPERIMENT-9
Objective:
Write a program to implement Wilcoxon signed rank test for appropriate dataset.

Introduction:
The Wilcoxon signed-rank test is a non-parametric statistical test used to assess whether two related paired
samples come from the same distribution. It is suitable for situations where the data does not meet the
assumptions of normality or when the sample size is small. The test is particularly valuable for analyzing data
with an ordinal or continuous scale.
Assumptions:
• The differences between the paired samples are independent and come from the same population.
• The data should be measured on at least an ordinal scale.
Test Procedure:
1. The Wilcoxon signed-rank test involves the following steps:
2. Compute the differences between the paired samples.
3. Discard the differences that are equal to zero.
4. Rank the absolute values of the remaining differences.
5. Calculate the test statistic (T) as the sum of the ranks of the positive differences or the sum of the
absolute ranks if the signs are disregarded.
6. Compare the test statistic to the critical value from the standard normal distribution or use the p-value
to determine statistical significance.

Code:
from scipy.stats import wilcoxon
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data

for i in range(X.shape[1] - 1):

for j in range(i + 1, X.shape[1]):
feature1 = X[:, i]
feature2 = X[:, j]
wilcoxon_stat, p_value = wilcoxon(feature1, feature2)
print(f"Wilcoxon Signed-Rank Test for Features {i+1} and {j+1}:")
print("Test Statistic:", wilcoxon_stat)
print("P-value:", p_value)
print("---------------")

30
Output:

Results:
Implemented Wilcoxon signed rank test on iris dataset available in sklearn library.

Learnings:
Getting familiarized with Wilcoxon signed rank test and its implementation in python through scipy library.

31
EXPERIMENT-10
Objective:
Write a program to implement Kruskal Wallis test for appropriate dataset.

Introduction:
The Kruskal-Wallis test is a non-parametric statistical test used to determine whether there are significant
differences between two or more independent groups in terms of a continuous or ordinal dependent variable. It
is used when the data does not meet the assumptions of normality required by parametric tests like the analysis
of variance (ANOVA). The test is often used as an extension of the Mann-Whitney U test for comparing more
than two groups.
Assumptions:
• The observations are independent within and between groups.
• The dependent variable is measured on an ordinal or continuous scale.
• The distributions of the groups are similar.
Test Procedure:
1. Rank all the data from lowest to highest, combining all groups.
2. Calculate the sum of ranks for each group.
3. Calculate the test statistic (H) using the formula:

where:
N is the total number of observations,
Ri is the sum of ranks for the ith group,
ni is the number of observations in the ith group.
4. Compare the test statistic to the critical value from the chi-squared distribution with k-1degrees of
freedom, where k is the number of groups.

Code:
from scipy.stats import kruskal
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data

kruskal_stat, p_value = kruskal(X[:, 0], X[:, 1], X[:, 2], X[:, 3])

print("Kruskal-Wallis Test Statistic:", kruskal_stat)

print("P-value:", p_value)

32
Output:

Results:
Implemented Kruskal Wallis test on iris dataset available in sklearn library.

Learnings:
Getting familiarized with Kruskal Wallis test and its implementation in python through scipy library.

33
EXPERIMENT-11
Objective:
Write a program to implement Mann-Whitney U test for iris dataset.

Introduction:
The Mann-Whitney U test, also known as the Wilcoxon rank-sum test, is a non-parametric statistical test used
to compare two independent samples to assess whether they originate from the same population. It is suitable
for situations where the data does not meet the assumptions of normality, and the samples are not paired.
Assumptions:
• The observations from the two groups are independent of each other.
• The dependent variable should be measured on at least an ordinal scale.
• The distributions of the two groups should be similar.
Test Procedure:
1. Combine the data from both groups into a single dataset.
2. Rank all the data from lowest to highest, assigning average ranks to tied values.
3. Calculate the U statistic, which represents the smaller of the two sums of ranks for the two groups.
4. Compare the U statistic to critical values from the Mann-Whitney U distribution or use the p-value to
determine statistical significance.

Code:
from scipy.stats import mannwhitneyu
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target

for i in range(X.shape[1]):
feature_data1 = X[y == 0, i]
feature_data2 = X[y == 1, i]
u_stat, p_value = mannwhitneyu(feature_data1, feature_data2)
print(f"Mann-Whitney U Test for Feature {i+1}:")
print("U Statistic:", u_stat)
print("P-value:", p_value)
print("-----------------------")

34
Output:

Results:
Implemented Mann-Whitney U test on iris dataset available in sklearn library.

Learnings:
Getting familiarized with Mann-Whitney U test and its implementation in python through scipy library.

35
EXPERIMENT-12
Objective:
Perform comparison of following data analysis tools
· WEKA
· SAS Enterprise Miner
. SPSS
· MATLAB
·R

Introduction:
1. WEKA:
• Features: Open-source, Java-based, provides a collection of machine learning algorithms for data
analysis, preprocessing tools, and visualization capabilities.
• Ease of Use: User-friendly graphical interface, suitable for beginners and experienced users.
• Learning Curve: Relatively easy to learn, especially for individuals with a basic understanding of data
analysis and machine learning concepts.
• Community Support: Active community support, ample documentation, and online resources available.
• Applications: Suitable for educational purposes, research, and small to medium-scale data analysis tasks.
• Data Types: Suitable for analyzing structured and unstructured data, primarily used for machine learning
tasks.
• Scalability: Limited scalability for handling large datasets and complex analyses compared to some
enterprise-level tools.
• Customization: Provides limited customization options compared to some other tools, primarily designed
for specific data analysis tasks.

2. SAS Enterprise Miner:

• Features: Comprehensive data mining platform, offers a wide range of data analysis and predictive
modeling tools, advanced analytics, and data visualization capabilities.
• Ease of Use: User-friendly interface, may require some training for optimal utilization.
• Learning Curve: Steeper learning curve compared to some other tools, particularly for individuals with
limited statistical or programming background.
• Community Support: SAS has an extensive user community and provides comprehensive technical
support for users.
• Applications: Ideal for large-scale data analysis tasks, complex statistical analyses, and business
intelligence applications.
• Data Types: Well-suited for analyzing structured data, particularly robust for handling large and complex
datasets.
• Scalability: Offers excellent scalability for large-scale data analysis and complex statistical modeling
tasks.
• Customization: Provides extensive customization and integration capabilities, allowing users to tailor
analyses to specific business requirements.

3. SPSS:
• Features: Widely used for statistical analysis, data management, and data documentation, includes various
analytical tools, data visualization, and reporting functionalities.
• Ease of Use: Intuitive interface, user-friendly, suitable for both beginners and experienced researchers.
• Learning Curve: Relatively easy to learn, especially for individuals with limited programming experience
or statistical background.
• Community Support: Active user community, extensive documentation, and online resources available.
• Applications: Well-suited for data analysis in social sciences, market research, and business applications.
• Data Types: Designed for analyzing structured data, particularly effective for survey data, social sciences
research, and market research.
• Scalability: Suitable for small to medium-scale datasets, may face limitations when handling very large
datasets or complex analyses.

36
• Customization: Provides moderate customization options, primarily for statistical analyses and data
visualization tasks.

4. MATLAB:
• Features: Powerful computational and visualization capabilities, extensive toolboxes for data analysis,
simulation, and modeling, ideal for numerical computing and algorithm development.
• Ease of Use: Requires programming knowledge, steeper learning curve, suitable for users with a strong
mathematical and technical background.
• Learning Curve: High learning curve, especially for individuals without prior programming experience.
• Community Support: Active user community, comprehensive documentation, and extensive support
from MathWorks.
• Applications: Suitable for engineering, scientific research, and academic purposes that involve complex
mathematical computations and simulations.
• Data Types: Well-suited for numerical data, engineering data, and scientific data analysis, particularly
effective for mathematical modeling and simulations.
• Scalability: Offers good scalability for complex mathematical computations, simulations, and algorithm
development tasks.
• Customization: Provides extensive customization options and programming capabilities, allowing users to
create custom analytical solutions and algorithms.

5. R:
• Features: Open-source, widely used for statistical computing, data analysis, and graphical representation,
provides a vast collection of packages for various data analysis tasks.
• Ease of Use: Requires programming knowledge, user-friendly interface, particularly for individuals with a
background in programming or statistics.
• Learning Curve: Moderate learning curve, with ample online resources, tutorials, and a supportive user
community.
• Community Support: Active and extensive community support, numerous packages, and libraries
available for diverse data analysis tasks.
• Applications: Ideal for statistical computing, data visualization, and research purposes in various fields,
including academia, business, and data science.
• Data Types: Designed for analyzing diverse data types, including structured and unstructured data, ideal
for statistical analysis and data visualization tasks.
• Scalability: Offers good scalability for handling both small and large datasets, supports parallel processing
for improved performance.
• Customization: Provides extensive customization options through numerous packages and libraries,
allowing users to create tailored data analysis solutions.

Results:
Compared various data analysis tools on different points.

Learnings:
Getting familiarized with WEKA, SAS Enterprise Miner, SPSS, MATLAB and R.

Service Manual Philips DCM2060
No ratings yet
Service Manual Philips DCM2060
31 pages
Orders
No ratings yet
Orders
84 pages
Python
No ratings yet
Python
20 pages
AML LAB MANUAL Yash
No ratings yet
AML LAB MANUAL Yash
60 pages
Workshop Notes-1 Introduction to Python
No ratings yet
Workshop Notes-1 Introduction to Python
8 pages
Python With AI
No ratings yet
Python With AI
7 pages
Lab Manual - ML - RIT
No ratings yet
Lab Manual - ML - RIT
54 pages
ICT G6 Sem2 Finals Revision
No ratings yet
ICT G6 Sem2 Finals Revision
21 pages
Intership Body
No ratings yet
Intership Body
31 pages
ML Lab File Vijay Kumar
No ratings yet
ML Lab File Vijay Kumar
27 pages
Practical 1to10
No ratings yet
Practical 1to10
32 pages
Module 4 - Writing Functions in Python
No ratings yet
Module 4 - Writing Functions in Python
20 pages
Data Science Machine Learning 17054
No ratings yet
Data Science Machine Learning 17054
27 pages
Data Analytics With Python Laboratory - Lab Manual
No ratings yet
Data Analytics With Python Laboratory - Lab Manual
45 pages
CT Hndpyth
No ratings yet
CT Hndpyth
11 pages
آشنایی با پایتون برای اقتصادسنجی
No ratings yet
آشنایی با پایتون برای اقتصادسنجی
407 pages
data science lab exp lis
No ratings yet
data science lab exp lis
72 pages
Python
No ratings yet
Python
2 pages
22616R
No ratings yet
22616R
19 pages
CH1 Introduction To Python
No ratings yet
CH1 Introduction To Python
13 pages
CH #3 Solved Exercise
No ratings yet
CH #3 Solved Exercise
6 pages
DAL EXT 1 and 2
No ratings yet
DAL EXT 1 and 2
125 pages
22616-summer-24
No ratings yet
22616-summer-24
19 pages
ISL56 Python Lab - EXAM-FINAL-QB
No ratings yet
ISL56 Python Lab - EXAM-FINAL-QB
4 pages
Practical Manual 6
No ratings yet
Practical Manual 6
38 pages
My Book of Python Computing - Abhijit Kar Gupta
50% (2)
My Book of Python Computing - Abhijit Kar Gupta
385 pages
Cheat Sheet
No ratings yet
Cheat Sheet
2 pages
Python Notes
No ratings yet
Python Notes
24 pages
AIML%20Short%20Term%20Internship%20Session%2013%20Summary-1719637291003
No ratings yet
AIML%20Short%20Term%20Internship%20Session%2013%20Summary-1719637291003
7 pages
Data Processing with Python and R
No ratings yet
Data Processing with Python and R
6 pages
Python Lab PDF
No ratings yet
Python Lab PDF
19 pages
AnshulPadiyar ML Lab File
No ratings yet
AnshulPadiyar ML Lab File
51 pages
Machine Learning - Manual
No ratings yet
Machine Learning - Manual
32 pages
DataScience - ML DEEP LEARNING - LPEI - 120 Days
No ratings yet
DataScience - ML DEEP LEARNING - LPEI - 120 Days
8 pages
oG1M8adGXOGe DHBiQVrXgXHO6GrHU01tHWZgd tpRqUW65xGX9ufzrZMtM6hjBWlvlYViPn6r2Cgghq2M8oiXNNdf0HeL-DQvJKWM
No ratings yet
oG1M8adGXOGe DHBiQVrXgXHO6GrHU01tHWZgd tpRqUW65xGX9ufzrZMtM6hjBWlvlYViPn6r2Cgghq2M8oiXNNdf0HeL-DQvJKWM
42 pages
AIML Lab Manual
No ratings yet
AIML Lab Manual
39 pages
Introduction To Programming in Python
No ratings yet
Introduction To Programming in Python
79 pages
ML Lab Manual
No ratings yet
ML Lab Manual
53 pages
Python Introduction 2016
100% (1)
Python Introduction 2016
429 pages
Report Format (1) .Docx - 20240508 - 124537 - 0000
No ratings yet
Report Format (1) .Docx - 20240508 - 124537 - 0000
11 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
49 pages
Python Self Study Material
0% (1)
Python Self Study Material
9 pages
1.1 (Co1, Co2)
No ratings yet
1.1 (Co1, Co2)
25 pages
Data Preprocessing-AIML Algorithm1
No ratings yet
Data Preprocessing-AIML Algorithm1
47 pages
Python Lab file
No ratings yet
Python Lab file
4 pages
Machine Learning Codes
No ratings yet
Machine Learning Codes
30 pages
UML LAB MANUL
No ratings yet
UML LAB MANUL
79 pages
Python Summary
No ratings yet
Python Summary
10 pages
G10 Python 2
No ratings yet
G10 Python 2
64 pages
Python: BY Kannan Moudgalya
No ratings yet
Python: BY Kannan Moudgalya
21 pages
22-ML Lab Expt 1.docx
No ratings yet
22-ML Lab Expt 1.docx
29 pages
2csoe78 CPD
No ratings yet
2csoe78 CPD
11 pages
Python Ultimate Guide
100% (1)
Python Ultimate Guide
10 pages
fdsa lab manual final
No ratings yet
fdsa lab manual final
70 pages
Python Lab ALL 10 Prgms
No ratings yet
Python Lab ALL 10 Prgms
16 pages
PYTHON
No ratings yet
PYTHON
22 pages
Python (3) Leaflet: Roland Becker December 16, 2020
No ratings yet
Python (3) Leaflet: Roland Becker December 16, 2020
15 pages
Basics of Python Programming
No ratings yet
Basics of Python Programming
51 pages
Introduction to Python 1
No ratings yet
Introduction to Python 1
13 pages
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
From Everand
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
Tenko
No ratings yet
Ian Talks Python A-Z
From Everand
Ian Talks Python A-Z
Ian Eress
No ratings yet
Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2
From Everand
Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2
e3
No ratings yet
Introduction To Data Repositories - Unlocked
No ratings yet
Introduction To Data Repositories - Unlocked
182 pages
Data Analytics Basics - Unlocked
No ratings yet
Data Analytics Basics - Unlocked
59 pages
MDA Book
No ratings yet
MDA Book
68 pages
Theory of Computation: Delhi Technological University, Delhi
No ratings yet
Theory of Computation: Delhi Technological University, Delhi
40 pages
Compiler Final Modified
No ratings yet
Compiler Final Modified
33 pages
Natural Language Understanding
No ratings yet
Natural Language Understanding
675 pages
ESE Notes 1.0
No ratings yet
ESE Notes 1.0
15 pages
Abalon - Standard Grades Beech
No ratings yet
Abalon - Standard Grades Beech
11 pages
193 - 0301 Ac1000 Owners Manual Pre 7 1 09
No ratings yet
193 - 0301 Ac1000 Owners Manual Pre 7 1 09
30 pages
Farm To Plate in Hospitals
No ratings yet
Farm To Plate in Hospitals
6 pages
Genericization Good or Bad
No ratings yet
Genericization Good or Bad
5 pages
CV and Cover Letter
0% (1)
CV and Cover Letter
4 pages
Test 9
No ratings yet
Test 9
13 pages
IB History HL1: 20th Century History
100% (2)
IB History HL1: 20th Century History
8 pages
P 650 ED tcm314 6758
No ratings yet
P 650 ED tcm314 6758
36 pages
Classroom Test Construction: The Power of A Table of Specifications
No ratings yet
Classroom Test Construction: The Power of A Table of Specifications
7 pages
Tarot Zodiacal. 2022
No ratings yet
Tarot Zodiacal. 2022
12 pages
Study On Reactive Dyes
100% (1)
Study On Reactive Dyes
14 pages
Sample Play For Theatre For Development
100% (1)
Sample Play For Theatre For Development
19 pages
LCSR Case Digests - Outline 1
100% (1)
LCSR Case Digests - Outline 1
25 pages
Zinobest PIL 27122016
No ratings yet
Zinobest PIL 27122016
7 pages
Deeper Into Defusion
No ratings yet
Deeper Into Defusion
11 pages
440478240-Manitowoc-Crane-Grove-Tm-300-Fes-83c-290-781 (1)
No ratings yet
440478240-Manitowoc-Crane-Grove-Tm-300-Fes-83c-290-781 (1)
492 pages
Exploring The Relationship Between Reading Comprehension Skills and Academic Performance Among UV-D Education Students
No ratings yet
Exploring The Relationship Between Reading Comprehension Skills and Academic Performance Among UV-D Education Students
14 pages
NM-Online Apartment Manager - Sangha - Specs-Sep-2024
No ratings yet
NM-Online Apartment Manager - Sangha - Specs-Sep-2024
15 pages
Interior Design & Color Schemes - Video & Lesson
No ratings yet
Interior Design & Color Schemes - Video & Lesson
1 page
Itemwise Cost Breakup: Name of Work
No ratings yet
Itemwise Cost Breakup: Name of Work
6 pages
The Switch
No ratings yet
The Switch
16 pages
The Necessity of Using Neurophysiological Methods For Diagnosing Communication Disorders. by Akhsaful To Tawhida Jahan Mam
No ratings yet
The Necessity of Using Neurophysiological Methods For Diagnosing Communication Disorders. by Akhsaful To Tawhida Jahan Mam
11 pages
Card Reader Manual
No ratings yet
Card Reader Manual
58 pages
Excerpt of "The Skeleton Crew" by Deborah Halber
No ratings yet
Excerpt of "The Skeleton Crew" by Deborah Halber
2 pages
Core 2 - Organization of A Selected Manufacturing Industry
No ratings yet
Core 2 - Organization of A Selected Manufacturing Industry
7 pages
Sir Syed School Result Date - 01-06-2024
No ratings yet
Sir Syed School Result Date - 01-06-2024
1 page
Para Phimosis
100% (1)
Para Phimosis
9 pages
Guide: The Design of Products To Be Hot-Dip Galvanized After Fabrication
No ratings yet
Guide: The Design of Products To Be Hot-Dip Galvanized After Fabrication
15 pages