0% found this document useful (0 votes)
6 views14 pages

Mlviva

The document provides a comprehensive guide on setting up Python and essential libraries for data science, including NumPy and pandas. It covers key concepts of machine learning with scikit-learn, including installation, algorithms, and data handling techniques. Additionally, it discusses data visualization methods using Matplotlib and Seaborn, as well as various machine learning models like k-NN, linear regression, decision trees, and K-Means clustering.

Uploaded by

theswaran909
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views14 pages

Mlviva

The document provides a comprehensive guide on setting up Python and essential libraries for data science, including NumPy and pandas. It covers key concepts of machine learning with scikit-learn, including installation, algorithms, and data handling techniques. Additionally, it discusses data visualization methods using Matplotlib and Seaborn, as well as various machine learning models like k-NN, linear regression, decision trees, and K-Means clustering.

Uploaded by

theswaran909
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

1.

Install and Set Up Python and Essential Libraries


(NumPy, pandas)
Q1: Why is Python a preferred language for data science and machine
learning?

Answer:
Python is preferred because:
✔ Easy to Learn: Simple syntax, readable code.
✔ Rich Ecosystem: Libraries like NumPy, pandas, scikit-learn, TensorFlow.
✔ Community Support: Large open-source community.
✔ Versatility: Used in web development, automation, and AI.
✔ Integration: Works well with databases, big data tools (Spark), and visualization
libraries (Matplotlib, Seaborn).

Q2: What is the purpose of pip in Python, and how do you install a library
using it?

Answer:

 pip is Python’s package installer.

 To install a library:

pip install numpy pandas

 To upgrade:
pip install --upgrade pandas

Q3: Explain the difference between NumPy and pandas.

Answer:
NumPy pandas

Used for numerical computations (arrays,


Used for structured data (DataFrames, Series).
matrices).

Optimized for math operations (linear Optimized for data manipulation (filtering,
algebra). grouping).

[Link]() and [Link]() are core


[Link]() is the core data structure.
structures.

Q4: How do you check the installed version of a Python library like
pandas?

Answer:

import pandas as pd
print(pd.__version__)

or via terminal:

pip show pandas

Q5: What is a virtual environment, and why is it useful?

Answer:

 A virtual environment isolates Python dependencies for different projects.


 Why use it?
✔ Avoids version conflicts between projects.
✔ Ensures reproducibility.
 How to create one?
python -m venv myenv
source myenv/bin/activate # Linux/Mac
myenv\Scripts\activate # Windows
2. Introduction to Scikit-Learn
Q1: What is scikit-learn, and what are its key features?

Answer:

 Scikit-learn is a Python ML library for:


✔ Classification (k-NN, SVM, Decision Trees).
✔ Regression (Linear, Ridge, Lasso).
✔ Clustering (K-Means, DBSCAN).
✔ Preprocessing (Scaling, Encoding).
 Features:
✔ Open-source, easy-to-use API.
✔ Built on NumPy, SciPy, and Matplotlib.

Q2: Name some common ML algorithms in scikit-learn.

Answer:

 Supervised Learning:

o LinearRegression, LogisticRegression

o DecisionTreeClassifier, RandomForest

o SVM, k-NN

 Unsupervised Learning:

o KMeans, DBSCAN

o PCA (Dimensionality Reduction)

Q3: How does scikit-learn differ from TensorFlow/PyTorch?

Answer:

Scikit-Learn TensorFlow/PyTorch

Traditional ML (small datasets). Deep Learning (neural networks).


Scikit-Learn TensorFlow/PyTorch

Simpler API. Complex, GPU-accelerated.

No GPU support. Optimized for GPU/TPU.

Q4: What are the main modules in scikit-learn?

Answer:

 sklearn.model_selection (train-test split, cross-validation).

 [Link] (scaling, encoding).

 [Link] (accuracy, MSE, confusion matrix).

 sklearn.linear_model (LinearRegression, LogisticRegression).

Q5: Can scikit-learn be used for deep learning?

Answer:
❌ No, scikit-learn is for traditional ML. For deep learning, use:
✔ TensorFlow / Keras
✔ PyTorch

3. Install and Set Up Scikit-Learn


Q1: How to install scikit-learn?

Answer:

pip install scikit-learn

or with conda:

conda install scikit-learn

Q2: What are the dependencies of scikit-learn?


Answer:

 NumPy
 SciPy
 Matplotlib (for visualization)
 Joblib (for model saving)

Q3: How to verify scikit-learn installation?

Answer:

import sklearn
print(sklearn.__version__)

Q4: What is Jupyter Notebook, and why use it?

Answer:

 Interactive Python environment for data analysis.


 Why use it?
✔ Combines code, visualizations, and text.
✔ Great for exploratory data analysis (EDA).

Q5: Alternative tools alongside scikit-learn?

Answer:

 XGBoost (Gradient Boosting).


 StatsModels (Statistical modeling).
 LightGBM (Optimized decision trees).

4. Load and Explore CSV/Excel Data with Pandas


Q1: How to read a CSV file in pandas?
Answer:

import pandas as pd
data = pd.read_csv("[Link]")

Q2: Common pandas functions for exploration?

Answer:

 [Link]() → First 5 rows.

 [Link]() → Statistics (mean, std, min, max).

 [Link]() → Data types and missing values.

Q3: How to check for missing values?

Answer:

[Link]().sum()

Q4: Difference between loc and iloc?

Answer:

 loc → Label-based indexing ([Link][0, 'column']).

 iloc → Integer-based indexing ([Link][0, 1]).

Q5: How to filter rows in a DataFrame?

Answer:

filtered_data = data[data['column'] > 50]

5. Data Visualization with Matplotlib/Seaborn


Q1: Difference between Matplotlib and Seaborn?
Answer:

 Matplotlib: Basic, customizable plots.


 Seaborn: Built on Matplotlib, prettier statistical plots.

Q2: How to create a scatter plot?

Answer:

import [Link] as plt


[Link](x=data['X'], y=data['Y'])
[Link]()

Q3: Plots for distributions?

Answer:

 Histogram ([Link]()).
 Boxplot ([Link]()).

Q4: How to customize plot labels?

Answer:

[Link]("X-axis")
[Link]("Y-axis")
[Link]("Scatter Plot")

Q5: Bar Chart vs. Pie Chart?

Answer:

 Bar Chart: Compare categories (e.g., sales per month).


 Pie Chart: Show proportions (e.g., market share %).

(Continued in next message for remaining answers...)


Would you like me to proceed with the rest? 🚀

yes

Answers to Viva Questions (Continued)

6. Handling Missing Data, Encoding, and Feature Scaling


Q1: Strategies to handle missing data?

Answer:
✔ Deletion: dropna() (if few missing values).
✔ Imputation:

 Mean/Median (SimpleImputer).
 Mode (for categorical data).
 Advanced: KNN imputation.

Q2: What is one-hot encoding?

Answer:

 Converts categorical variables into binary columns.


 Example:
pd.get_dummies(data['category'])

 When to use? When categories are not ordinal (e.g., colors: Red, Green, Blue).

Q3: Why is feature scaling important?

Answer:

 Ensures all features contribute equally to distance-based algorithms (k-NN, SVM).


 Prevents dominance of high-magnitude features.
Q4: StandardScaler vs. MinMaxScaler?

Answer:

StandardScaler MinMaxScaler

Scales to mean=0, std=1. Scales to range [0, 1].

Good for Gaussian data. Good for bounded data.

Q5: How to encode categorical variables?

Answer:

 Label Encoding (Ordinal):


from [Link] import LabelEncoder
le = LabelEncoder()
data['category'] = le.fit_transform(data['category'])

 One-Hot Encoding (Nominal):


pd.get_dummies(data)

7. k-Nearest Neighbors (k-NN) Classifier


Q1: Basic principle of k-NN?

Answer:

 Classifies a data point based on majority vote of its k nearest neighbors.


 Steps:

1. Compute distances (Euclidean/Manhattan).


2. Find k nearest points.
3. Assign majority class.

Q2: How does k affect performance?


Answer:

 Small k:
✔ High variance (overfitting).
✔ Sensitive to noise.
 Large k:
✔ High bias (underfitting).
✔ Smoother decision boundaries.

Q3: Distance metrics in k-NN?

Answer:

 Euclidean: Default (sqrt(∑(x_i - y_i)²)).


 Manhattan: Sum of absolute differences (∑|x_i - y_i|).
 Minkowski: Generalization of both.

Q4: Train-test split in scikit-learn?

Answer:

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Q5: Evaluation metrics for classification?

Answer:

 Accuracy: (TP + TN) / Total.


 Precision: TP / (TP + FP).
 Recall: TP / (TP + FN).
 F1-Score: Harmonic mean of precision/recall.
8. Linear Regression Model
Q1: Mathematical formulation?

Answer:
y = β₀ + β₁x₁ + β₂x₂ + ... + βₙ xₙ + ε

 y: Target variable.

 β₀: Intercept.

 β₁...βₙ : Coefficients.

 ε: Error term.

Q2: How to interpret coefficients?

Answer:

 β₁: For 1-unit increase in x₁, y changes by β₁ (holding other features constant).
 Example: If β₁ = 2.5, y increases by 2.5 for every 1-unit increase in x₁.

Q3: fit() vs. predict()?

Answer:

 fit(): Trains the model on data ([Link](X_train, y_train)).

 predict(): Generates predictions (y_pred = [Link](X_test)).

Q4: What is Mean Squared Error (MSE)?

Answer:

 Average squared difference between predicted and actual values.


 Formula:
from [Link] import mean_squared_error
mse = mean_squared_error(y_test, y_pred)

Q5: How to check for overfitting?


Answer:

 High training accuracy, low test accuracy → Overfitting.


 Solutions:
✔ Regularization (Ridge/Lasso).
✔ Cross-validation.

9. Decision Tree Classifier


Q1: How does a decision tree make splits?

Answer:

 Splits based on:

o Gini Impurity: Measures misclassification probability.


o Entropy: Measures disorder (lower = better split).

Q2: What is Gini impurity?

Answer:

 Probability of incorrect classification for a randomly chosen element.


 Formula: Gini = 1 - ∑(p_i)² (where p_i is class proportion).

Q3: How to visualize a decision tree?

Answer:

from [Link] import plot_tree


plot_tree(model, feature_names=[Link], filled=True)

Q4: Pros and cons of decision trees?

Answer:
Pros Cons

Easy to interpret. Prone to overfitting.

No need for scaling. Unstable (small changes → different tree).

Q5: What is pruning?

Answer:

 Reduces tree size by removing unnecessary splits.


 Methods:

o Pre-pruning (limit depth/min_samples_split).


o Post-pruning (cost-complexity pruning).

10. K-Means Clustering


Q1: Objective of K-Means?

Answer:

 Partition data into k clusters where each point belongs to the nearest centroid.
 Goal: Minimize within-cluster variance.

Q2: How to choose optimal k?

Answer:

 Elbow Method: Plot Inertia (sum of squared distances) vs. k.


 Silhouette Score: Measures cluster separation.

Q3: Role of centroid?

Answer:
 Mean position of all points in a cluster.
 Updated iteratively to minimize distances.

Q4: Initialization sensitivity?

Answer:

 Random centroids can lead to suboptimal clusters.


 Solution: Use k-means++ (smart initialization).

Q5: Limitations of K-Means?

Answer:
❌ Assumes spherical clusters.
❌ Sensitive to outliers.
❌ Requires predefined k.

Summary Table: Key Scikit-Learn Methods

Task Function

Train-test split train_test_split()

Feature scaling StandardScaler(), MinMaxScaler()

k-NN Classifier KNeighborsClassifier()

Linear Regression LinearRegression()

Decision Tree DecisionTreeClassifier()

K-Means KMeans()

You might also like