Lab Manual

Lab Manual: Data Analysis and Machine
Learning Experiments
Experiment 1: Installing and Exploring Libraries
Problem Statement
To explore the features of NumPy, SciPy, Jupyter, Statsmodels, and Pandas packages and read
data from a text file, Excel file, and the web.
Aim
To familiarize with Python libraries essential for data analysis and scientific computation.
Algorithm
1. Install required libraries using pip install numpy scipy pandas statsmodels
jupyter.
2. Explore features of each library with basic and advanced examples.
3. Read data from a text file using Pandas.
4. Read data from an Excel file using Pandas.
5. Fetch and analyze data from a web source using Pandas and Requests.
Detailed Sample Program

# Importing libraries
import numpy as np
import pandas as pd
from scipy.stats import ttest_1samp, norm
from scipy.optimize import minimize
import statsmodels.api as sm
import requests
# NumPy Examples
print("NumPy Examples:")
# Create a 1D array and compute basic statistics
arr = np.array([1, 2, 3, 4, 5])
print("Array:", arr)
print("Mean of array:", np.mean(arr))
print("Standard Deviation:", np.std(arr))
# Advanced NumPy: Matrix operations

matrix = np.random.rand(3, 3)
print("\nMatrix:")
print(matrix)
print("Matrix Transpose:")
print(matrix.T)
print("Dot Product:")
print(np.dot(matrix, matrix.T))
# SciPy Examples
print("\nSciPy Examples:")
# Perform a t-test
print("T-test Example:")
t_stat, p_value = ttest_1samp(arr, 3)
print("T-statistic:", t_stat, "P-value:", p_value)
# Optimization Example
print("\nOptimization Example:")
result = minimize(lambda x: x**2 + 5, 0)
print("Optimization Result:", result)
# Pandas Examples
print("\nPandas Examples:")
# Creating a DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'Salary': [50000, 60000, 70000]
}
df = pd.DataFrame(data)
print("DataFrame:")
print(df)
print("Summary Statistics:")
print(df.describe())
# Reading data from a text file

print("\nReading from Text File:")
with open('sample.txt', 'w') as f:
f.write("Name,Age,Salary\nAlice,25,50000\nBob,30,60000\nCharlie,35,70000")
df_text = pd.read_csv('sample.txt')
print(df_text)
# Reading data from an Excel file

print("\nReading from Excel File:")
df.to_excel('sample.xlsx', index=False)
df_excel = pd.read_excel('sample.xlsx')
print(df_excel)
# Reading data from the web

print("\nReading from Web:")
url = 'https://people.sc.fsu.edu/~jburkardt/data/csv/hw_200.csv'
df_web = pd.read_csv(url)
print(df_web.head())
# Statsmodels Example
print("\nStatsmodels Example:")
X = sm.add_constant([4, 5, 6])
Y = [1, 2, 3]
model = sm.OLS(Y, X).fit()
print(model.summary())
Enhanced Output
 Extensive examples of NumPy functionalities including matrix operations.
 Statistical analysis using SciPy (e.g., t-tests, optimization).
 Comprehensive use of Pandas for data manipulation and reading files.
 Regression analysis and model summaries with Statsmodels.
Experiment 2: Descriptive Analytics using Kaggle Dataset

Problem Statement
To download a dataset from Kaggle and explore various commands for descriptive analytics.
Aim
To analyze a dataset and compute comprehensive statistical properties using Pandas.
Algorithm
1. Download a dataset from Kaggle.

2. Load the dataset into a Pandas DataFrame.
3. Perform extensive data exploration using commands like .head(), .describe(),
.info(), .value_counts(), .mean(), .median(), .mode(), and .corr().
4. Identify key insights such as correlations, missing values, and distribution patterns.
5. Visualize data using Matplotlib and Seaborn.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Load a dataset (e.g., Titanic dataset)

df = pd.read_csv('titanic.csv')
# Dataset Overview
print("Dataset Info:")
print(df.info())
# Summary Statistics
print("\nBasic Statistics:")
print(df.describe())
# Check Missing Values

print("\nMissing Values:")
missing = df.isnull().sum()
print(missing)
# Fill Missing Values

print("\nHandling Missing Values:")
df['Age'].fillna(df['Age'].mean(), inplace=True)
# Correlation Analysis
print("\nCorrelation Matrix:")
correlation_matrix = df.corr()
print(correlation_matrix)
# Analyzing Specific Columns

print("\nSurvival Count Analysis:")
survived = df['Survived'].value_counts()
print(survived)
# Visualization
print("\nVisualizations:")
# Heatmap
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title("Heatmap of Correlations")
plt.show()
# Histogram
sns.histplot(data=df, x='Age', kde=True, bins=30, color='purple')
plt.title("Age Distribution")
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.show()
# Bar Plot for Survival

sns.countplot(data=df, x='Survived', palette='pastel')
plt.title("Survival Counts")
plt.xlabel("Survived")
plt.ylabel("Count")
plt.show()
Enhanced Output
 Comprehensive dataset insights (e.g., missing values, descriptive statistics).

 In-depth correlation analysis with heatmap visualization.
 Visualizations of age distribution and survival counts.
Experiment 3: T-Hypothesis Testing Using Excel

Problem Statement
To perform t-hypothesis testing on a sample dataset using Excel.
Aim
To understand and conduct t-hypothesis testing using Excel’s built-in data analysis tools.
Algorithm
1. Input sample data into Excel in two columns.

2. Open the Data Analysis ToolPak.
3. Choose "t-Test: Two-Sample Assuming Equal Variances".
4. Select data ranges for the two samples.
5. Set a significance level (e.g., 0.05).
6. Review the output to interpret the p-value and t-statistic.
Enhanced Explanation
 Input Example Data:
Sample A Sample B
12 14
15 18
14 17
13 19
 Key Outputs:
o t-statistic and p-value from Excel output.
o Decision based on significance level.
Experiment 4: Visualizing IRIS Dataset

Problem Statement
To download the IRIS dataset and generate a box plot, scatter plot, and histogram.
Aim
To visualize the distribution and relationships between attributes of the IRIS dataset.
Algorithm
1. Download the IRIS dataset from the UCI repository.

2. Load the dataset into a Pandas DataFrame.
3. Generate detailed visualizations using Matplotlib and Seaborn.

import pandas as pd
import seaborn as sns
# Load dataset
df = pd.read_csv('iris.data', header=None, names=['sepal_length',
'sepal_width', 'petal_length', 'petal_width', 'class'])
# Box Plot
sns.boxplot(data=df[['sepal_length', 'sepal_width', 'petal_length',
'petal_width']])
plt.title("Box Plot of Iris Features")
plt.show()
# Scatter Plot
sns.scatterplot(data=df, x='sepal_length', y='sepal_width', hue='class',
palette='deep')
plt.title("Scatter Plot of Sepal Length vs Width by Class")
plt.show()
# Histogram
sns.histplot(data=df, x='sepal_length', hue='class', multiple='stack',
bins=20, palette='muted')
plt.title("Histogram of Sepal Length by Class")
plt.xlabel("Sepal Length")
plt.ylabel("Frequency")
plt.show()
Enhanced Output
 Box plots showing variability across attributes.

 Scatter plot revealing relationships between sepal dimensions across classes.
 Stacked histogram highlighting class-wise distribution of sepal lengths.
Experiment 5: Decision Tree for Car Evaluation Dataset

Problem Statement
To train a decision tree classification model and generate a decision tree for the Car Evaluation
dataset.
Aim
To build and visualize a decision tree for the classification of car evaluations.
Algorithm
1. Download the Car Evaluation dataset from the UCI repository.

2. Preprocess the dataset by encoding categorical values.
3. Split the dataset into training and testing sets.
4. Train a decision tree classifier using sklearn.
5. Evaluate and visualize the decision tree.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder
# Load dataset
df = pd.read_csv('car.data', header=None)
df.columns = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety',
'class']
# Encode categorical data

le = LabelEncoder()
for column in df.columns:
df[column] = le.fit_transform(df[column])
# Train-test split
X = df.drop(columns=['class'])
y = df['class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Train model
clf = DecisionTreeClassifier(random_state=42, max_depth=5)
clf.fit(X_train, y_train)
# Evaluate model
accuracy = clf.score(X_test, y_test)
print(f"Model Accuracy: {accuracy:.2f}")
# Classification report
y_pred = clf.predict(X_test)
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
# Visualize decision tree

plt.figure(figsize=(20, 10))
plot_tree(clf, feature_names=X.columns, class_names=le.classes_, filled=True,
fontsize=10)
plt.title("Decision Tree Visualization")
plt.show()
Enhanced Output
 Decision tree visualization with labeled nodes and colored classes.

 Detailed classification report and accuracy score.

Lab Manual

Uploaded by

Lab Manual

Uploaded by

Lab Manual: Data Analysis and Machine

Detailed Sample Program

# Advanced NumPy: Matrix operations

# Reading data from a text file

# Reading data from an Excel file

# Reading data from the web

Experiment 2: Descriptive Analytics using Kaggle Dataset

To analyze a dataset and compute comprehensive statistical properties using Pandas.

1. Download a dataset from Kaggle.

Detailed Sample Program

# Load a dataset (e.g., Titanic dataset)

# Check Missing Values

# Fill Missing Values

# Analyzing Specific Columns

# Bar Plot for Survival

 Comprehensive dataset insights (e.g., missing values, descriptive statistics).

Experiment 3: T-Hypothesis Testing Using Excel

To perform t-hypothesis testing on a sample dataset using Excel.

1. Input sample data into Excel in two columns.

 Input Example Data:

Experiment 4: Visualizing IRIS Dataset

1. Download the IRIS dataset from the UCI repository.

Detailed Sample Program

 Box plots showing variability across attributes.

Experiment 5: Decision Tree for Car Evaluation Dataset

1. Download the Car Evaluation dataset from the UCI repository.

Detailed Sample Program

# Encode categorical data

# Visualize decision tree

 Decision tree visualization with labeled nodes and colored classes.

You might also like