0% found this document useful (0 votes)
15 views7 pages

Lab Manual

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
0% found this document useful (0 votes)
15 views7 pages

Lab Manual

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 7

Lab Manual: Data Analysis and Machine

Learning Experiments
Experiment 1: Installing and Exploring Libraries
Problem Statement

To explore the features of NumPy, SciPy, Jupyter, Statsmodels, and Pandas packages and read
data from a text file, Excel file, and the web.

Aim

To familiarize with Python libraries essential for data analysis and scientific computation.

Algorithm

1. Install required libraries using pip install numpy scipy pandas statsmodels
jupyter.
2. Explore features of each library with basic and advanced examples.
3. Read data from a text file using Pandas.
4. Read data from an Excel file using Pandas.
5. Fetch and analyze data from a web source using Pandas and Requests.

Detailed Sample Program


# Importing libraries
import numpy as np
import pandas as pd
from scipy.stats import ttest_1samp, norm
from scipy.optimize import minimize
import statsmodels.api as sm
import requests

# NumPy Examples
print("NumPy Examples:")
# Create a 1D array and compute basic statistics
arr = np.array([1, 2, 3, 4, 5])
print("Array:", arr)
print("Mean of array:", np.mean(arr))
print("Standard Deviation:", np.std(arr))

# Advanced NumPy: Matrix operations


matrix = np.random.rand(3, 3)
print("\nMatrix:")
print(matrix)
print("Matrix Transpose:")
print(matrix.T)
print("Dot Product:")
print(np.dot(matrix, matrix.T))

# SciPy Examples
print("\nSciPy Examples:")
# Perform a t-test
print("T-test Example:")
t_stat, p_value = ttest_1samp(arr, 3)
print("T-statistic:", t_stat, "P-value:", p_value)

# Optimization Example
print("\nOptimization Example:")
result = minimize(lambda x: x**2 + 5, 0)
print("Optimization Result:", result)

# Pandas Examples
print("\nPandas Examples:")
# Creating a DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'Salary': [50000, 60000, 70000]
}
df = pd.DataFrame(data)
print("DataFrame:")
print(df)
print("Summary Statistics:")
print(df.describe())

# Reading data from a text file


print("\nReading from Text File:")
with open('sample.txt', 'w') as f:
f.write("Name,Age,Salary\nAlice,25,50000\nBob,30,60000\nCharlie,35,70000")
df_text = pd.read_csv('sample.txt')
print(df_text)

# Reading data from an Excel file


print("\nReading from Excel File:")
df.to_excel('sample.xlsx', index=False)
df_excel = pd.read_excel('sample.xlsx')
print(df_excel)

# Reading data from the web


print("\nReading from Web:")
url = 'https://people.sc.fsu.edu/~jburkardt/data/csv/hw_200.csv'
df_web = pd.read_csv(url)
print(df_web.head())

# Statsmodels Example
print("\nStatsmodels Example:")
X = sm.add_constant([4, 5, 6])
Y = [1, 2, 3]
model = sm.OLS(Y, X).fit()
print(model.summary())

Enhanced Output
 Extensive examples of NumPy functionalities including matrix operations.
 Statistical analysis using SciPy (e.g., t-tests, optimization).
 Comprehensive use of Pandas for data manipulation and reading files.
 Regression analysis and model summaries with Statsmodels.

Experiment 2: Descriptive Analytics using Kaggle Dataset


Problem Statement

To download a dataset from Kaggle and explore various commands for descriptive analytics.

Aim

To analyze a dataset and compute comprehensive statistical properties using Pandas.

Algorithm

1. Download a dataset from Kaggle.


2. Load the dataset into a Pandas DataFrame.
3. Perform extensive data exploration using commands like .head(), .describe(),
.info(), .value_counts(), .mean(), .median(), .mode(), and .corr().
4. Identify key insights such as correlations, missing values, and distribution patterns.
5. Visualize data using Matplotlib and Seaborn.

Detailed Sample Program


import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load a dataset (e.g., Titanic dataset)


df = pd.read_csv('titanic.csv')

# Dataset Overview
print("Dataset Info:")
print(df.info())

# Summary Statistics
print("\nBasic Statistics:")
print(df.describe())

# Check Missing Values


print("\nMissing Values:")
missing = df.isnull().sum()
print(missing)

# Fill Missing Values


print("\nHandling Missing Values:")
df['Age'].fillna(df['Age'].mean(), inplace=True)

# Correlation Analysis
print("\nCorrelation Matrix:")
correlation_matrix = df.corr()
print(correlation_matrix)

# Analyzing Specific Columns


print("\nSurvival Count Analysis:")
survived = df['Survived'].value_counts()
print(survived)

# Visualization
print("\nVisualizations:")
# Heatmap
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title("Heatmap of Correlations")
plt.show()

# Histogram
sns.histplot(data=df, x='Age', kde=True, bins=30, color='purple')
plt.title("Age Distribution")
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.show()

# Bar Plot for Survival


sns.countplot(data=df, x='Survived', palette='pastel')
plt.title("Survival Counts")
plt.xlabel("Survived")
plt.ylabel("Count")
plt.show()

Enhanced Output

 Comprehensive dataset insights (e.g., missing values, descriptive statistics).


 In-depth correlation analysis with heatmap visualization.
 Visualizations of age distribution and survival counts.

Experiment 3: T-Hypothesis Testing Using Excel


Problem Statement

To perform t-hypothesis testing on a sample dataset using Excel.

Aim

To understand and conduct t-hypothesis testing using Excel’s built-in data analysis tools.
Algorithm

1. Input sample data into Excel in two columns.


2. Open the Data Analysis ToolPak.
3. Choose "t-Test: Two-Sample Assuming Equal Variances".
4. Select data ranges for the two samples.
5. Set a significance level (e.g., 0.05).
6. Review the output to interpret the p-value and t-statistic.

Enhanced Explanation

 Input Example Data:

Sample A Sample B
12 14
15 18
14 17
13 19

 Key Outputs:
o t-statistic and p-value from Excel output.
o Decision based on significance level.

Experiment 4: Visualizing IRIS Dataset


Problem Statement

To download the IRIS dataset and generate a box plot, scatter plot, and histogram.

Aim

To visualize the distribution and relationships between attributes of the IRIS dataset.

Algorithm

1. Download the IRIS dataset from the UCI repository.


2. Load the dataset into a Pandas DataFrame.
3. Generate detailed visualizations using Matplotlib and Seaborn.

Detailed Sample Program


import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
df = pd.read_csv('iris.data', header=None, names=['sepal_length',
'sepal_width', 'petal_length', 'petal_width', 'class'])

# Box Plot
sns.boxplot(data=df[['sepal_length', 'sepal_width', 'petal_length',
'petal_width']])
plt.title("Box Plot of Iris Features")
plt.show()

# Scatter Plot
sns.scatterplot(data=df, x='sepal_length', y='sepal_width', hue='class',
palette='deep')
plt.title("Scatter Plot of Sepal Length vs Width by Class")
plt.show()

# Histogram
sns.histplot(data=df, x='sepal_length', hue='class', multiple='stack',
bins=20, palette='muted')
plt.title("Histogram of Sepal Length by Class")
plt.xlabel("Sepal Length")
plt.ylabel("Frequency")
plt.show()

Enhanced Output

 Box plots showing variability across attributes.


 Scatter plot revealing relationships between sepal dimensions across classes.
 Stacked histogram highlighting class-wise distribution of sepal lengths.

Experiment 5: Decision Tree for Car Evaluation Dataset


Problem Statement

To train a decision tree classification model and generate a decision tree for the Car Evaluation
dataset.

Aim

To build and visualize a decision tree for the classification of car evaluations.

Algorithm

1. Download the Car Evaluation dataset from the UCI repository.


2. Preprocess the dataset by encoding categorical values.
3. Split the dataset into training and testing sets.
4. Train a decision tree classifier using sklearn.
5. Evaluate and visualize the decision tree.

Detailed Sample Program


import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder

# Load dataset
df = pd.read_csv('car.data', header=None)
df.columns = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety',
'class']

# Encode categorical data


le = LabelEncoder()
for column in df.columns:
df[column] = le.fit_transform(df[column])

# Train-test split
X = df.drop(columns=['class'])
y = df['class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Train model
clf = DecisionTreeClassifier(random_state=42, max_depth=5)
clf.fit(X_train, y_train)

# Evaluate model
accuracy = clf.score(X_test, y_test)
print(f"Model Accuracy: {accuracy:.2f}")

# Classification report
y_pred = clf.predict(X_test)
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Visualize decision tree


plt.figure(figsize=(20, 10))
plot_tree(clf, feature_names=X.columns, class_names=le.classes_, filled=True,
fontsize=10)
plt.title("Decision Tree Visualization")
plt.show()

Enhanced Output

 Decision tree visualization with labeled nodes and colored classes.


 Detailed classification report and accuracy score.

You might also like