MDA File
MDA File
LAB MANUAL
Submitted by:
Akshat Gupta
2K21/SE/18
Submitted to:
Div Chaudhary
Abhishek Yadav
Department of Software Engineering
November, 2023
1
INDEX
2
EXPERIMENT-1
Objective:
Introduction to Python.
Introduction:
Python is a high-level, interpreted, and general-purpose programming language. It was created in the late 1980s
by Guido van Rossum. Python is dynamically typed, meaning that you don't have to specify the data type of a
variable when declaring it. Python's syntax is designed to be easy to read and write, making it a great choice for
beginners.
• Variables - They are containers used for storing data values. Variables can store data of different types
and different types can do different things. There are three numeric types in python - int, float,
complex. There may be times when you want to specify a type on a variable which can be done with
casting.
• Comments – They are used to explain code and are not part of the actual executable code.
• OOP - Python is an object-oriented language, and as such it uses classes to define data types, including
its primitive types. Almost everything in Python is an object, with its properties and methods. A Class
is like an object constructor, or a "blueprint" for creating objects.
• String - They are surrounded by either single quotation marks, or double quotation marks. You can
display a string literal with the print() function.
• Boolean - They represent one of two values, true or false.
• Operators - They are used to perform operations on variables and values.
• Lists – They are used to store multiple items in a single variable and created using square brackets.
• Tuples - They are used to store multiple items in a single variable. A tuple is a collection which is
ordered and unchangeable and is written with round brackets.
• Sets - They are used to store multiple items in a single variable. A set is a collection which is
unordered, unchangeable, and unindexed.
• Dictionaries - They are used to store data values in key:value pairs. A dictionary is a collection which
is ordered, changeable and does not allow duplicates.
• Loops - With the while loop we can execute a set of statements as long as a condition is true. A for
loop is used for iterating over a sequence (that is either a list, a tuple, a dictionary, a set, or a string).
• Function - It is a block of code which only runs when it is called. You can pass data, known as
parameters, into a function. A function can return data as a result.
• JSON - Python has a built-in package called json, which can be used to work with JSON data.
• File handling - It is an important part of any web application. Python has several functions for
creating, reading, updating, and deleting files.
• NumPy - It is a Python library. NumPy is used for working with arrays.
• Pandas - It is a Python library. Pandas is used to analyze data.
• SciPy - It is a scientific computation library that uses NumPy underneath. SciPy stands for Scientific
Python.
• Matplotlib - It is a low-level graph plotting library in python that serves as a visualization utility.
• Scikit-learn (Sklearn) - Itis the most useful and robust library for machine learning in Python. It
provides a selection of efficient tools for machine learning and statistical modeling including
classification, regression, clustering and dimensionality reduction via a consistent interface in Python.
• Pytorch - Itis used to process the tensors. Tensors are multidimensional arrays like n-dimensional
NumPy arrays.
x = 5 # int
y = 5.5 # float
a = float(x) #convert from int to float:
b = int(y) #convert from float to int:
3
print(a)
print(b)
print(type(a))
print(type(b))
b. String:
a = "Hello world"
print(a[1])
for s in "python":
print(s)
print(len(“python”)
b = "Hello world"
print(b[2:5])
print(b[:5])
print(b[2:])
print(b[-5:-2])
a = "Hello, World!"
print(a.upper())
print(a.lower())
print(a.replace("H", "J"))
print(a.split(","))
a = "Hello"
b = "World"
c=a+b
print(c)
c. Boolean:
print(10 > 9)
print(10 == 9)
print(10 < 9)
4
d. List:
e. Tuple:
f. Set:
g. Dictionary:
5
h. Loops:
i=1
while i < 5:
print(i)
i += 1
for s in "python":
print(s)
i. Function:
def my_function(fname):
print(fname + " name")
my_function("a")
my_function("b")
my_function("c")
j. Class:
class Person:
def __init__(self, name, age):
self.name = name
self.age = age
p1 = Person("John", 36)
print(p1.name)
print(p1.age)
k. JSON:
import json
6
x = '{ "name":"John", "age":30, "city":"New York"}'
y = json.loads(x)
print(y["age"])
2. File Handling:
import os
f = open("demofile3.txt", "w")
f.write("Woops! I have deleted the content!")
f.close()
f = open("demofile3.txt", "r")
print(f.read())
os.remove("demofile3.txt")
3. NumPy:
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr)
print(type(arr))
print(arr[0])
print(arr[1:5])
x = arr.copy()
print(x)
print(arr.shape)
for x in arr:
print(x)
y = np.where(arr == 4)
print(y)
print(np.sort(arr))
x = [True, False, True, False, True]
newarr = arr[x]
print(newarr)
4. Pandas:
import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a)
print(myvar)
mydataset = {
'cars': ["BMW", "Volvo", "Ford"],
7
'passings': [3, 7, 2]
}
myvar = pd.DataFrame(mydataset)
print(myvar)
print(myvar.head(1))
5. SciPy:
from scipy import constants
print(constants.pi)
print(constants.gram)
print(constants.degree)
print(constants.hour)
6. Matplotlib:
import matplotlib.pyplot as plt
import numpy as np
xpoints = np.array([0, 6])
ypoints = np.array([0, 250])
plt.plot(xpoints, ypoints)
plt.show()
xpoints = np.array([1, 2, 6, 8])
ypoints = np.array([3, 8, 1, 10])
plt.plot(xpoints, ypoints)
plt.show()
8
7. PyTorch:
import torch
a = torch.tensor(2)
b = torch.tensor(1)
print(a, b)
print(a+b)
print(b-a)
print(a*b)
print(a/b)
a = torch.zeros((3, 3))
print(a)
print(a.shape)
8. SciKit-Learn:
Results:
Implemented basic python data types, functions and libraries.
Learnings:
Getting familiarized by the basics of python and different libraries of python used in machine learning along
with file handling concepts.
9
EXPERIMENT-2
Objective:
Collect dataset from historical repositories, open-source projects using various tools.
Introduction:
The aim is to compile a comprehensive dataset for crop yield in different Indian States. The various factors
contributing towards yield may be state, rainfall, usage of fertilizer, season, amount produced, land size, etc. Since,
the exact dataset according to our requirements is not available anywhere on the internet, we will have to collect/mine
the dataset from different repositories/sources.
Data Sources:
1. https://data.gov.in/catalog/district-wise-season-wise-crop-production-statistics-0
2. https://www.fao.org/faostat/en/#data
3. https://data.gov.in/catalog/rainfall-india
4. https://environicsindia.in/
5. https://www.imdpune.gov.in/library/public/e-book110.pdf
From these files, we can only compile the dataset from 1997-2020 of all Indian States and UTs, except
Chandigarh, Dadra and Nagar Haveli, Daman & Diu, Ladakh and Lakshadweep.
Results:
Downloaded different files from different credible and official data repositories.
Learnings:
Getting familiarized by data repositories and learning how to search for credible data on the internet effectively.
10
EXPERIMENT-3
Objective:
Implement data preprocessing techniques for appropriate dataset.
Introduction:
This dataset encompasses agricultural data for multiple crops cultivated across various states in India from the
year 1997 till 2020. The dataset provides crucial features related to crop yield prediction, including crop types,
crop years, cropping seasons, states, areas under cultivation, production quantities, annual rainfall, fertilizer
usage, pesticide usage, and calculated yields. The dataset is focused on predicting crop yields based on several
agronomic factors, such as weather conditions, fertilizer and pesticide usage, and other relevant variables. The
dataset is presented in tabular form, with each row representing data for a specific crop and its corresponding
features. It has 19698 rows and 10 columns (9 features and 1 label).
Columns Description:
1. Crop: The name of the crop cultivated.
2. Crop_Year: The year in which the crop was grown.
3. Season: The specific cropping season (e.g., Kharif, Rabi, Whole Year).
4. State: The Indian state where the crop was cultivated.
5. Area: The total land area (in hectares) under cultivation for the specific crop.
6. Production: The quantity of crop production (in metric tons).
7. Annual_Rainfall: The annual rainfall received in the crop-growing region (in mm).
8. Fertilizer: The total amount of fertilizer used for the crop (in kilograms).
9. Pesticide: The total amount of pesticide used for the crop (in kilograms).
10. Yield: The calculated crop yield (production per unit area).
11
df_yield.columns = df_yield.columns.str.strip()
df_yield = df_yield.drop('District', axis=1)
df_yield = df_yield.groupby(['Crop', 'Crop_Year', 'Season', 'State']).agg({'Area': 'sum', 'Production': 'sum', 'Yield':
'mean'})
print(df_yield)
df_yield.to_csv('/content/ds_new.csv', index=True)
df_new = pd.read_csv('/content/ds_new.csv')
print(df_new)
12
print(df_new.dtypes)
df_new['Crop_Year'] = df_new['Crop_Year'].astype('int')
print(df_new.dtypes)
mean_rainfall_by_state = df_rain.groupby('State')['Annual_Rainfall'].mean()
df_rain['Annual_Rainfall'].fillna(df_rain['State'].map(mean_rainfall_by_state), inplace=True)
print(df_rain)
13
df_new = df_new.merge(df_rain, left_on=['State', 'Crop_Year'], right_on=['State', 'Year'], how='inner')
df_new.drop(columns=['Year'], inplace=True)
print(df_new)
df_fertilizer = pd.read_csv('/content/fertilizers.csv')
print(df_fertilizer)
df_fertilizer = df_fertilizer.groupby('Year')['Value'].sum().reset_index()
print(df_fertilizer)
14
df_new = df_new.merge(df_fertilizer, left_on='Crop_Year', right_on='Year', how='left')
df_new['Fertilizer'] = df_new['Value'] * df_new['Area']
df_new.drop(columns=['Year'], inplace=True)
df_new.drop(columns=['Value'], inplace=True)
print(df_new)
df_pesticides = pd.read_csv('/content/pesticides.csv')
print(df_pesticides)
15
df_new = df_new.merge(df_pesticides, left_on='Crop_Year', right_on='Year', how='left')
df_new['Pesticide'] = df_new['Value'] * df_new['Area']
df_new.drop(columns=['Year'], inplace=True)
df_new.drop(columns=['Value'], inplace=True)
print(df_new)
df_new = df_new[['Crop', 'Crop_Year', 'Season', 'State', 'Area', 'Production', 'Annual_Rainfall', 'Fertilizer', 'Pesticide',
'Yield']]
print(df_new)
df_new.to_csv('/content/ds.csv', index=True)
16
df_encoded.to_csv('/content/ds_encoded.csv', index=True)
Results:
Compiled a normalized and preprocessed comprehensive dataset for crop yield with 97 features and 1 label.
Learnings:
Getting familiarized with various data preprocessing techniques in excel and python.
17
EXPERIMENT-4
Objective:
Describe data using various methods.
Introduction:
This dataset encompasses agricultural data for multiple crops cultivated across various states in India from the
year 1997 till 2020. The dataset provides crucial features related to crop yield prediction, including crop types,
crop years, cropping seasons, states, areas under cultivation, production quantities, annual rainfall, fertilizer
usage, pesticide usage, and calculated yields. The dataset is focused on predicting crop yields based on several
agronomic factors, such as weather conditions, fertilizer and pesticide usage, and other relevant variables. The
dataset is presented in tabular form, with each row representing data for a specific crop and its corresponding
features. It has 19698 rows and 10 columns (9 features and 1 label).
For describing the dataset, we will use various inbuilt, library functions and different strategies.
print(df_encoded.describe())
18
print(df_encoded.info())
19
Results:
Described the dataset using pandas library functions.
Learnings:
Getting familiarized with basic functions of pandas library for describing data.
20
EXPERIMENT-5
Objective:
Write a program to implement chi-square test for appropriate dataset.
Introduction:
A chi-squared test (symbolically represented as χ2) is basically a data analysis on the basis of
observations of a random set of variables. Usually, it is a comparison of two statistical data sets. This
test was introduced by Karl Pearson in 1900 for categorical data analysis and distribution. So it was
mentioned as Pearson’s chi-squared test.
The chi-square test is used to estimate how likely the observations that are made would be, by
considering the assumption of the null hypothesis as true.
A hypothesis is a consideration that a given condition or statement might be true, which we can test
afterwards. Chi-squared tests are usually created from a sum of squared falsities or errors over the
sample variance.
The Null Hypothesis is the assumption that the event will not occur. A null hypothesis has no bearing on the
study's outcome unless it is rejected. H0 is the symbol for it, and it is pronounced H-naught.
The Alternate Hypothesis is the logical opposite of the null hypothesis. The acceptance of the alternative
hypothesis follows the rejection of the null hypothesis. H 1 is the symbol for it.
P stands for probability here. To calculate the p-value, the chi-square test is used in statistics. The
different values of p indicate the different hypothesis interpretation, are given below:
• P≤ 0.05; Hypothesis rejected
• P>0.05; Hypothesis Accepted
Dataset: The Iris dataset is one of the most well-known and commonly used datasets in machine learning and
statistics. It is included in the scikit-learn library, which is a popular machine learning library in Python. The
Iris dataset is often used for testing and experimenting with various machine learning algorithms, particularly in
the field of classification.
The Iris dataset contains a set of 150 records of iris flowers. Each record has four features:
• Sepal length in centimeters
• Sepal width in centimeters
• Petal length in centimeters
• Petal width in centimeters
The dataset is labeled with the corresponding species of iris flowers, which include three classes:
• Iris-setosa
• Iris-versicolor
• Iris-virginica
Code:
from sklearn.datasets import load_iris
from sklearn.feature_selection import chi2
iris = load_iris()
21
X, y = iris.data, iris.target
Output:
Results:
Implemented chi-squared test on iris dataset available in sklearn library.
Learnings:
Getting familiarized with chi-squared test and its implementation in python through sklearn library.
22
EXPERIMENT-6
Objective:
Write a program to implement various correlation analysis techniques:
· Spearman's rank correlation coefficient
· Pearson correlation coefficient
· Kendall rank correlation coefficient
Introduction:
Spearman’s rank correlation measures the strength and direction of association between two ranked variables.
It basically gives the measure of monotonicity of the relation between two variables i.e. how well the
relationship between two variables could be represented using a monotonic function.
The formula for Spearman’s rank coefficient is:
where,
𝝆 = Spearman’s rank correlation coefficient
di = Difference between the two ranks of each observation
n = Number of observations
The Spearman Rank Correlation can take a value from +1 to -1 where,
• A value of +1 means a perfect association of rank
• A value of 0 means that there is no association between ranks
• A value of -1 means a perfect negative association of rank
Pearson correlation coefficient, also known as Pearson's r, measures the strength of a relationship between
two variables and their association with one another. Pearson's Correlation Coefficient is named after Karl
Pearson.
where,
r = Coefficient of correlation
xbar = Mean of x-variable
ybar = Mean of y-variable.
xi yi = Samples of variable x, y
Kendall's rank correlation coefficient, often referred to as Kendall's tau, is a statistic that measures the
strength of dependence between two variables. It is particularly useful when dealing with ordinal data, which is
data that can be ranked but may not have a clear numerical difference between each ranking. Kendall's tau
assesses the correspondence between the ranks of the paired samples.
The value of Kendall's tau ranges between -1 and 1, where,
• 1 indicates a perfect agreement in rankings.
• -1 indicates a perfect disagreement or inverse agreement in rankings.
• 0 indicates no association between the rankings.
Formula for Kendall's rank correlation coefficient is,
where,
• Concordant Pair - A pair of observations (x1, y1) and (x2, y2) that follows the property
o x1 > x2 and y1 > y2 or
23
o x1 < x2 and y1 < y2
• Discordant Pair: A pair of observations (x1, y1) and (x2, y2) that follows the property
o x1 > x2 and y1 < y2 or
o x1 < x2 and y1 > y2
• n: Total number of samples
• The pair for which x1 = x2 and y1 = y2 are not classified as concordant or discordant and are ignored.
Dataset: The Iris dataset is one of the most well-known and commonly used datasets in machine learning and
statistics. It is included in the scikit-learn library, which is a popular machine learning library in Python. The
Iris dataset is often used for testing and experimenting with various machine learning algorithms, particularly in
the field of classification.
The Iris dataset contains a set of 150 records of iris flowers. Each record has four features:
• Sepal length in centimeters
• Sepal width in centimeters
• Petal length in centimeters
• Petal width in centimeters
The dataset is labeled with the corresponding species of iris flowers, which include three classes:
• Iris-setosa
• Iris-versicolor
• Iris-virginica
Code:
from sklearn.datasets import load_iris
from scipy.stats import spearmanr, pearsonr, kendalltau
iris = load_iris()
X = iris.data
y = iris.target
for i in range(X.shape[1]):
feature_name = iris.feature_names[i]
X_feature = X[:, i]
spearman_corr, _ = spearmanr(X_feature, y)
pearson_corr, _ = pearsonr(X_feature, y)
kendall_corr, _ = kendalltau(X_feature, y)
print(f"Feature: {feature_name}")
print(f"Spearman's Rank Correlation Coefficient: {spearman_corr}")
print(f"Pearson Correlation Coefficient: {pearson_corr}")
print(f"Kendall Rank Correlation Coefficient: {kendall_corr}")
print("------------------------")
Output:
24
Results:
Implemented Spearman's rank correlation coefficient, Pearson correlation coefficient and Kendall rank
correlation coefficient on iris dataset available in sklearn library.
Learnings:
Getting familiarized with Spearman's rank correlation coefficient, Pearson correlation coefficient and Kendall
rank correlation coefficient and its implementation in python through scipy library.
25
EXPERIMENT-7
Objective:
Write a program to implement t-test (one sample t-test and two sample t-test) for appropriate dataset.
Introduction:
The t-test is a statistical test used to determine whether there is a significant difference between the means of
two groups. It is widely used in hypothesis testing when the data is approximately normally distributed and the
variances of the two populations are assumed to be equal or approximately equal.
There are two main types of t-tests:
1. One-Sample t-Test: The one-sample t-test is used to determine whether the mean of a single sample
significantly differs from a known or hypothesized population mean. It is commonly used to compare
the sample mean with a hypothesized value.
Formula:
t = (X̄ – μ) / S/√n
where,
X̄ is the sample mean
μ is the hypothesized population mean
S is the standard deviation of the sample
n is the number of observations in the sample
2. Two-Sample t-Test: The two-sample t-test is used to determine whether the means of two independent
samples are significantly different from each other. It is commonly used to compare the means of two
groups or populations.
Formula:
where,
X̄1 is mean of first sample
X̄2 is mean of second sample
μ1 is the mean of first population
μ2 is the mean of second population
s1 is the standard deviation of first sample
s2 is the standard deviation of second sample
n1 is the size of the first sample
n2 is the size of the second sample
Dataset: The Iris dataset is one of the most well-known and commonly used datasets in machine learning and
statistics. It is included in the scikit-learn library, which is a popular machine learning library in Python. The
Iris dataset is often used for testing and experimenting with various machine learning algorithms, particularly in
the field of classification.
The Iris dataset contains a set of 150 records of iris flowers. Each record has four features:
• Sepal length in centimeters
• Sepal width in centimeters
• Petal length in centimeters
• Petal width in centimeters
The dataset is labeled with the corresponding species of iris flowers, which include three classes:
• Iris-setosa
• Iris-versicolor
• Iris-virginica
26
Code:
from scipy import stats
from sklearn.datasets import load_iris
iris = load_iris()
data = iris.data
setosa_data = data[iris.target == 0, 0]
versicolor_data = data[iris.target == 1, 0]
two_sample = stats.ttest_ind(setosa_data, versicolor_data)
print("\nTwo-sample t-test:")
print("t-statistic:", two_sample.statistic)
print("p-value:", two_sample.pvalue)
Output:
Results:
Implemented one sample t-test and two sample t-test on iris dataset available in sklearn library.
Learnings:
Getting familiarized with one sample t-test and two sample t-test and its implementation in python through
scipy library.
27
EXPERIMENT-8
Objective:
Write a program to implement Friedman test for appropriate dataset.
Introduction:
The Friedman test is a non-parametric statistical test used to determine whether there are statistically
significant differences between multiple treatments or groups in a repeated measures design. It is the non-
parametric equivalent of the one-way analysis of variance (ANOVA) and is used when the assumptions of
normality and homogeneity of variances are not met.
Assumptions:
• The observations should be independent within and between groups.
• The dependent variable should be measured on an ordinal or continuous scale.
• The data should be measured on the same subjects under different conditions or treatments.
• Hypotheses:
Null Hypothesis (H0): The population means of the groups are equal.
Alternative Hypothesis (H1): At least one population mean is different from the others.
Test Statistic: The Friedman test calculates a chi-squared (χ2) statistic that is based on the ranks of the
observations within each group. The test statistic is computed as follows:
where:
N is the total number of observations,
k is the number of treatments or groups,
Rj is the sum of ranks for treatment j.
Dataset: The Iris dataset is one of the most well-known and commonly used datasets in machine learning and
statistics. It is included in the scikit-learn library, which is a popular machine learning library in Python. The
Iris dataset is often used for testing and experimenting with various machine learning algorithms, particularly in
the field of classification.
The Iris dataset contains a set of 150 records of iris flowers. Each record has four features:
• Sepal length in centimeters
• Sepal width in centimeters
• Petal length in centimeters
• Petal width in centimeters
The dataset is labeled with the corresponding species of iris flowers, which include three classes:
• Iris-setosa
• Iris-versicolor
• Iris-virginica
Code:
from scipy.stats import friedmanchisquare
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
friedman_test_statistic, p_value = friedmanchisquare(X[:, 0], X[:, 1], X[:, 2], X[:, 3])
28
Output:
Results:
Implemented Friedman test on iris dataset available in sklearn library.
Learnings:
Getting familiarized with Friedman test and its implementation in python through scipy library.
29
EXPERIMENT-9
Objective:
Write a program to implement Wilcoxon signed rank test for appropriate dataset.
Introduction:
The Wilcoxon signed-rank test is a non-parametric statistical test used to assess whether two related paired
samples come from the same distribution. It is suitable for situations where the data does not meet the
assumptions of normality or when the sample size is small. The test is particularly valuable for analyzing data
with an ordinal or continuous scale.
Assumptions:
• The differences between the paired samples are independent and come from the same population.
• The data should be measured on at least an ordinal scale.
Test Procedure:
1. The Wilcoxon signed-rank test involves the following steps:
2. Compute the differences between the paired samples.
3. Discard the differences that are equal to zero.
4. Rank the absolute values of the remaining differences.
5. Calculate the test statistic (T) as the sum of the ranks of the positive differences or the sum of the
absolute ranks if the signs are disregarded.
6. Compare the test statistic to the critical value from the standard normal distribution or use the p-value
to determine statistical significance.
Dataset: The Iris dataset is one of the most well-known and commonly used datasets in machine learning and
statistics. It is included in the scikit-learn library, which is a popular machine learning library in Python. The
Iris dataset is often used for testing and experimenting with various machine learning algorithms, particularly in
the field of classification.
The Iris dataset contains a set of 150 records of iris flowers. Each record has four features:
• Sepal length in centimeters
• Sepal width in centimeters
• Petal length in centimeters
• Petal width in centimeters
The dataset is labeled with the corresponding species of iris flowers, which include three classes:
• Iris-setosa
• Iris-versicolor
• Iris-virginica
Code:
from scipy.stats import wilcoxon
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
30
Output:
Results:
Implemented Wilcoxon signed rank test on iris dataset available in sklearn library.
Learnings:
Getting familiarized with Wilcoxon signed rank test and its implementation in python through scipy library.
31
EXPERIMENT-10
Objective:
Write a program to implement Kruskal Wallis test for appropriate dataset.
Introduction:
The Kruskal-Wallis test is a non-parametric statistical test used to determine whether there are significant
differences between two or more independent groups in terms of a continuous or ordinal dependent variable. It
is used when the data does not meet the assumptions of normality required by parametric tests like the analysis
of variance (ANOVA). The test is often used as an extension of the Mann-Whitney U test for comparing more
than two groups.
Assumptions:
• The observations are independent within and between groups.
• The dependent variable is measured on an ordinal or continuous scale.
• The distributions of the groups are similar.
Test Procedure:
1. Rank all the data from lowest to highest, combining all groups.
2. Calculate the sum of ranks for each group.
3. Calculate the test statistic (H) using the formula:
where:
N is the total number of observations,
Ri is the sum of ranks for the ith group,
ni is the number of observations in the ith group.
4. Compare the test statistic to the critical value from the chi-squared distribution with k-1degrees of
freedom, where k is the number of groups.
Dataset: The Iris dataset is one of the most well-known and commonly used datasets in machine learning and
statistics. It is included in the scikit-learn library, which is a popular machine learning library in Python. The
Iris dataset is often used for testing and experimenting with various machine learning algorithms, particularly in
the field of classification.
The Iris dataset contains a set of 150 records of iris flowers. Each record has four features:
• Sepal length in centimeters
• Sepal width in centimeters
• Petal length in centimeters
• Petal width in centimeters
The dataset is labeled with the corresponding species of iris flowers, which include three classes:
• Iris-setosa
• Iris-versicolor
• Iris-virginica
Code:
from scipy.stats import kruskal
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
kruskal_stat, p_value = kruskal(X[:, 0], X[:, 1], X[:, 2], X[:, 3])
32
Output:
Results:
Implemented Kruskal Wallis test on iris dataset available in sklearn library.
Learnings:
Getting familiarized with Kruskal Wallis test and its implementation in python through scipy library.
33
EXPERIMENT-11
Objective:
Write a program to implement Mann-Whitney U test for iris dataset.
Introduction:
The Mann-Whitney U test, also known as the Wilcoxon rank-sum test, is a non-parametric statistical test used
to compare two independent samples to assess whether they originate from the same population. It is suitable
for situations where the data does not meet the assumptions of normality, and the samples are not paired.
Assumptions:
• The observations from the two groups are independent of each other.
• The dependent variable should be measured on at least an ordinal scale.
• The distributions of the two groups should be similar.
Test Procedure:
1. Combine the data from both groups into a single dataset.
2. Rank all the data from lowest to highest, assigning average ranks to tied values.
3. Calculate the U statistic, which represents the smaller of the two sums of ranks for the two groups.
4. Compare the U statistic to critical values from the Mann-Whitney U distribution or use the p-value to
determine statistical significance.
Dataset: The Iris dataset is one of the most well-known and commonly used datasets in machine learning and
statistics. It is included in the scikit-learn library, which is a popular machine learning library in Python. The
Iris dataset is often used for testing and experimenting with various machine learning algorithms, particularly in
the field of classification.
The Iris dataset contains a set of 150 records of iris flowers. Each record has four features:
• Sepal length in centimeters
• Sepal width in centimeters
• Petal length in centimeters
• Petal width in centimeters
The dataset is labeled with the corresponding species of iris flowers, which include three classes:
• Iris-setosa
• Iris-versicolor
• Iris-virginica
Code:
from scipy.stats import mannwhitneyu
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
for i in range(X.shape[1]):
feature_data1 = X[y == 0, i]
feature_data2 = X[y == 1, i]
u_stat, p_value = mannwhitneyu(feature_data1, feature_data2)
print(f"Mann-Whitney U Test for Feature {i+1}:")
print("U Statistic:", u_stat)
print("P-value:", p_value)
print("-----------------------")
34
Output:
Results:
Implemented Mann-Whitney U test on iris dataset available in sklearn library.
Learnings:
Getting familiarized with Mann-Whitney U test and its implementation in python through scipy library.
35
EXPERIMENT-12
Objective:
Perform comparison of following data analysis tools
· WEKA
· SAS Enterprise Miner
. SPSS
· MATLAB
·R
Introduction:
1. WEKA:
• Features: Open-source, Java-based, provides a collection of machine learning algorithms for data
analysis, preprocessing tools, and visualization capabilities.
• Ease of Use: User-friendly graphical interface, suitable for beginners and experienced users.
• Learning Curve: Relatively easy to learn, especially for individuals with a basic understanding of data
analysis and machine learning concepts.
• Community Support: Active community support, ample documentation, and online resources available.
• Applications: Suitable for educational purposes, research, and small to medium-scale data analysis tasks.
• Data Types: Suitable for analyzing structured and unstructured data, primarily used for machine learning
tasks.
• Scalability: Limited scalability for handling large datasets and complex analyses compared to some
enterprise-level tools.
• Customization: Provides limited customization options compared to some other tools, primarily designed
for specific data analysis tasks.
3. SPSS:
• Features: Widely used for statistical analysis, data management, and data documentation, includes various
analytical tools, data visualization, and reporting functionalities.
• Ease of Use: Intuitive interface, user-friendly, suitable for both beginners and experienced researchers.
• Learning Curve: Relatively easy to learn, especially for individuals with limited programming experience
or statistical background.
• Community Support: Active user community, extensive documentation, and online resources available.
• Applications: Well-suited for data analysis in social sciences, market research, and business applications.
• Data Types: Designed for analyzing structured data, particularly effective for survey data, social sciences
research, and market research.
• Scalability: Suitable for small to medium-scale datasets, may face limitations when handling very large
datasets or complex analyses.
36
• Customization: Provides moderate customization options, primarily for statistical analyses and data
visualization tasks.
4. MATLAB:
• Features: Powerful computational and visualization capabilities, extensive toolboxes for data analysis,
simulation, and modeling, ideal for numerical computing and algorithm development.
• Ease of Use: Requires programming knowledge, steeper learning curve, suitable for users with a strong
mathematical and technical background.
• Learning Curve: High learning curve, especially for individuals without prior programming experience.
• Community Support: Active user community, comprehensive documentation, and extensive support
from MathWorks.
• Applications: Suitable for engineering, scientific research, and academic purposes that involve complex
mathematical computations and simulations.
• Data Types: Well-suited for numerical data, engineering data, and scientific data analysis, particularly
effective for mathematical modeling and simulations.
• Scalability: Offers good scalability for complex mathematical computations, simulations, and algorithm
development tasks.
• Customization: Provides extensive customization options and programming capabilities, allowing users to
create custom analytical solutions and algorithms.
5. R:
• Features: Open-source, widely used for statistical computing, data analysis, and graphical representation,
provides a vast collection of packages for various data analysis tasks.
• Ease of Use: Requires programming knowledge, user-friendly interface, particularly for individuals with a
background in programming or statistics.
• Learning Curve: Moderate learning curve, with ample online resources, tutorials, and a supportive user
community.
• Community Support: Active and extensive community support, numerous packages, and libraries
available for diverse data analysis tasks.
• Applications: Ideal for statistical computing, data visualization, and research purposes in various fields,
including academia, business, and data science.
• Data Types: Designed for analyzing diverse data types, including structured and unstructured data, ideal
for statistical analysis and data visualization tasks.
• Scalability: Offers good scalability for handling both small and large datasets, supports parallel processing
for improved performance.
• Customization: Provides extensive customization options through numerous packages and libraries,
allowing users to create tailored data analysis solutions.
Results:
Compared various data analysis tools on different points.
Learnings:
Getting familiarized with WEKA, SAS Enterprise Miner, SPSS, MATLAB and R.
37