1.
Install and Set Up Python and Essential Libraries
(NumPy, pandas)
Q1: Why is Python a preferred language for data science and machine
learning?
Answer:
Python is preferred because:
✔ Easy to Learn: Simple syntax, readable code.
✔ Rich Ecosystem: Libraries like NumPy, pandas, scikit-learn, TensorFlow.
✔ Community Support: Large open-source community.
✔ Versatility: Used in web development, automation, and AI.
✔ Integration: Works well with databases, big data tools (Spark), and visualization
libraries (Matplotlib, Seaborn).
Q2: What is the purpose of pip in Python, and how do you install a library
using it?
Answer:
pip is Python’s package installer.
To install a library:
pip install numpy pandas
To upgrade:
pip install --upgrade pandas
Q3: Explain the difference between NumPy and pandas.
Answer:
NumPy pandas
Used for numerical computations (arrays,
Used for structured data (DataFrames, Series).
matrices).
Optimized for math operations (linear Optimized for data manipulation (filtering,
algebra). grouping).
[Link]() and [Link]() are core
[Link]() is the core data structure.
structures.
Q4: How do you check the installed version of a Python library like
pandas?
Answer:
import pandas as pd
print(pd.__version__)
or via terminal:
pip show pandas
Q5: What is a virtual environment, and why is it useful?
Answer:
A virtual environment isolates Python dependencies for different projects.
Why use it?
✔ Avoids version conflicts between projects.
✔ Ensures reproducibility.
How to create one?
python -m venv myenv
source myenv/bin/activate # Linux/Mac
myenv\Scripts\activate # Windows
2. Introduction to Scikit-Learn
Q1: What is scikit-learn, and what are its key features?
Answer:
Scikit-learn is a Python ML library for:
✔ Classification (k-NN, SVM, Decision Trees).
✔ Regression (Linear, Ridge, Lasso).
✔ Clustering (K-Means, DBSCAN).
✔ Preprocessing (Scaling, Encoding).
Features:
✔ Open-source, easy-to-use API.
✔ Built on NumPy, SciPy, and Matplotlib.
Q2: Name some common ML algorithms in scikit-learn.
Answer:
Supervised Learning:
o LinearRegression, LogisticRegression
o DecisionTreeClassifier, RandomForest
o SVM, k-NN
Unsupervised Learning:
o KMeans, DBSCAN
o PCA (Dimensionality Reduction)
Q3: How does scikit-learn differ from TensorFlow/PyTorch?
Answer:
Scikit-Learn TensorFlow/PyTorch
Traditional ML (small datasets). Deep Learning (neural networks).
Scikit-Learn TensorFlow/PyTorch
Simpler API. Complex, GPU-accelerated.
No GPU support. Optimized for GPU/TPU.
Q4: What are the main modules in scikit-learn?
Answer:
sklearn.model_selection (train-test split, cross-validation).
[Link] (scaling, encoding).
[Link] (accuracy, MSE, confusion matrix).
sklearn.linear_model (LinearRegression, LogisticRegression).
Q5: Can scikit-learn be used for deep learning?
Answer:
❌ No, scikit-learn is for traditional ML. For deep learning, use:
✔ TensorFlow / Keras
✔ PyTorch
3. Install and Set Up Scikit-Learn
Q1: How to install scikit-learn?
Answer:
pip install scikit-learn
or with conda:
conda install scikit-learn
Q2: What are the dependencies of scikit-learn?
Answer:
NumPy
SciPy
Matplotlib (for visualization)
Joblib (for model saving)
Q3: How to verify scikit-learn installation?
Answer:
import sklearn
print(sklearn.__version__)
Q4: What is Jupyter Notebook, and why use it?
Answer:
Interactive Python environment for data analysis.
Why use it?
✔ Combines code, visualizations, and text.
✔ Great for exploratory data analysis (EDA).
Q5: Alternative tools alongside scikit-learn?
Answer:
XGBoost (Gradient Boosting).
StatsModels (Statistical modeling).
LightGBM (Optimized decision trees).
4. Load and Explore CSV/Excel Data with Pandas
Q1: How to read a CSV file in pandas?
Answer:
import pandas as pd
data = pd.read_csv("[Link]")
Q2: Common pandas functions for exploration?
Answer:
[Link]() → First 5 rows.
[Link]() → Statistics (mean, std, min, max).
[Link]() → Data types and missing values.
Q3: How to check for missing values?
Answer:
[Link]().sum()
Q4: Difference between loc and iloc?
Answer:
loc → Label-based indexing ([Link][0, 'column']).
iloc → Integer-based indexing ([Link][0, 1]).
Q5: How to filter rows in a DataFrame?
Answer:
filtered_data = data[data['column'] > 50]
5. Data Visualization with Matplotlib/Seaborn
Q1: Difference between Matplotlib and Seaborn?
Answer:
Matplotlib: Basic, customizable plots.
Seaborn: Built on Matplotlib, prettier statistical plots.
Q2: How to create a scatter plot?
Answer:
import [Link] as plt
[Link](x=data['X'], y=data['Y'])
[Link]()
Q3: Plots for distributions?
Answer:
Histogram ([Link]()).
Boxplot ([Link]()).
Q4: How to customize plot labels?
Answer:
[Link]("X-axis")
[Link]("Y-axis")
[Link]("Scatter Plot")
Q5: Bar Chart vs. Pie Chart?
Answer:
Bar Chart: Compare categories (e.g., sales per month).
Pie Chart: Show proportions (e.g., market share %).
(Continued in next message for remaining answers...)
Would you like me to proceed with the rest? 🚀
yes
Answers to Viva Questions (Continued)
6. Handling Missing Data, Encoding, and Feature Scaling
Q1: Strategies to handle missing data?
Answer:
✔ Deletion: dropna() (if few missing values).
✔ Imputation:
Mean/Median (SimpleImputer).
Mode (for categorical data).
Advanced: KNN imputation.
Q2: What is one-hot encoding?
Answer:
Converts categorical variables into binary columns.
Example:
pd.get_dummies(data['category'])
When to use? When categories are not ordinal (e.g., colors: Red, Green, Blue).
Q3: Why is feature scaling important?
Answer:
Ensures all features contribute equally to distance-based algorithms (k-NN, SVM).
Prevents dominance of high-magnitude features.
Q4: StandardScaler vs. MinMaxScaler?
Answer:
StandardScaler MinMaxScaler
Scales to mean=0, std=1. Scales to range [0, 1].
Good for Gaussian data. Good for bounded data.
Q5: How to encode categorical variables?
Answer:
Label Encoding (Ordinal):
from [Link] import LabelEncoder
le = LabelEncoder()
data['category'] = le.fit_transform(data['category'])
One-Hot Encoding (Nominal):
pd.get_dummies(data)
7. k-Nearest Neighbors (k-NN) Classifier
Q1: Basic principle of k-NN?
Answer:
Classifies a data point based on majority vote of its k nearest neighbors.
Steps:
1. Compute distances (Euclidean/Manhattan).
2. Find k nearest points.
3. Assign majority class.
Q2: How does k affect performance?
Answer:
Small k:
✔ High variance (overfitting).
✔ Sensitive to noise.
Large k:
✔ High bias (underfitting).
✔ Smoother decision boundaries.
Q3: Distance metrics in k-NN?
Answer:
Euclidean: Default (sqrt(∑(x_i - y_i)²)).
Manhattan: Sum of absolute differences (∑|x_i - y_i|).
Minkowski: Generalization of both.
Q4: Train-test split in scikit-learn?
Answer:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Q5: Evaluation metrics for classification?
Answer:
Accuracy: (TP + TN) / Total.
Precision: TP / (TP + FP).
Recall: TP / (TP + FN).
F1-Score: Harmonic mean of precision/recall.
8. Linear Regression Model
Q1: Mathematical formulation?
Answer:
y = β₀ + β₁x₁ + β₂x₂ + ... + βₙ xₙ + ε
y: Target variable.
β₀: Intercept.
β₁...βₙ : Coefficients.
ε: Error term.
Q2: How to interpret coefficients?
Answer:
β₁: For 1-unit increase in x₁, y changes by β₁ (holding other features constant).
Example: If β₁ = 2.5, y increases by 2.5 for every 1-unit increase in x₁.
Q3: fit() vs. predict()?
Answer:
fit(): Trains the model on data ([Link](X_train, y_train)).
predict(): Generates predictions (y_pred = [Link](X_test)).
Q4: What is Mean Squared Error (MSE)?
Answer:
Average squared difference between predicted and actual values.
Formula:
from [Link] import mean_squared_error
mse = mean_squared_error(y_test, y_pred)
Q5: How to check for overfitting?
Answer:
High training accuracy, low test accuracy → Overfitting.
Solutions:
✔ Regularization (Ridge/Lasso).
✔ Cross-validation.
9. Decision Tree Classifier
Q1: How does a decision tree make splits?
Answer:
Splits based on:
o Gini Impurity: Measures misclassification probability.
o Entropy: Measures disorder (lower = better split).
Q2: What is Gini impurity?
Answer:
Probability of incorrect classification for a randomly chosen element.
Formula: Gini = 1 - ∑(p_i)² (where p_i is class proportion).
Q3: How to visualize a decision tree?
Answer:
from [Link] import plot_tree
plot_tree(model, feature_names=[Link], filled=True)
Q4: Pros and cons of decision trees?
Answer:
Pros Cons
Easy to interpret. Prone to overfitting.
No need for scaling. Unstable (small changes → different tree).
Q5: What is pruning?
Answer:
Reduces tree size by removing unnecessary splits.
Methods:
o Pre-pruning (limit depth/min_samples_split).
o Post-pruning (cost-complexity pruning).
10. K-Means Clustering
Q1: Objective of K-Means?
Answer:
Partition data into k clusters where each point belongs to the nearest centroid.
Goal: Minimize within-cluster variance.
Q2: How to choose optimal k?
Answer:
Elbow Method: Plot Inertia (sum of squared distances) vs. k.
Silhouette Score: Measures cluster separation.
Q3: Role of centroid?
Answer:
Mean position of all points in a cluster.
Updated iteratively to minimize distances.
Q4: Initialization sensitivity?
Answer:
Random centroids can lead to suboptimal clusters.
Solution: Use k-means++ (smart initialization).
Q5: Limitations of K-Means?
Answer:
❌ Assumes spherical clusters.
❌ Sensitive to outliers.
❌ Requires predefined k.
Summary Table: Key Scikit-Learn Methods
Task Function
Train-test split train_test_split()
Feature scaling StandardScaler(), MinMaxScaler()
k-NN Classifier KNeighborsClassifier()
Linear Regression LinearRegression()
Decision Tree DecisionTreeClassifier()
K-Means KMeans()