0% found this document useful (0 votes)
111 views1 page

ERERER

Uploaded by

Harshit Malik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
111 views1 page

ERERER

Uploaded by

Harshit Malik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

3. Explain the role of entropy as an information measure 14.

Discuss the concept of feature engineering and its


20. Discuss Non-Parametric Testing. Provide examples
in probability theory.
significance in machine learning models.

where it is preferable to parametric methods.

Entropy quantifies uncertainty in a random variable. Its Feature engineering is the process of selecting, modifying,
Definition: Non-parametric tests do not assume specific
formula is:H(X) = -\sum_{i=1}^{n} p(x_i) \log p(x_i)
or creating new features (variables or attributes) from raw
distributions for the data.

Low Entropy: Less uncertainty (e.g., a biased coin with data to improve the performance of machine learning
Examples:

has lower entropy).Entropy helps measure unpredictability [Link] Concepts in Feature Engineering:

1. Mann-Whitney U Test: Compares medians of two groups


in information systems, aiding in decision-making and 1. Feature Creation: Creating new features based on
when normality cannot be assumed.

efficient data encoding. existing ones. This could involve mathematical


Example: Comparing customer satisfaction ratings (ordinal
transformations combining multiple features, or extracting
data) between two branches.

meaningful information

5E. xplain the differences between discrete and 2. Feature Selection: Identifying and retaining only the
2. Kruskal-Wallis Test: Non-parametric alternative to

ANOVA for comparing more than two groups.

continuous random variables. Provide examples of each.


most relevant features for the model. Irrelevant or
Example: Analyzing exam scores across schools with non-
Discrete Random Variable
redundant features can increase complexity, overfitting,
normal data.

1. Definition: A random variable that can take on a and reduce the model's generalizability.

Advantages:

countable number of distinct values.


3. Feature Transformation: This involves changing the
Robust to outliers. Applicable to small sample sizes.

2. Values: Usually integers or specific values (e.g., 0, 1, 2, scale, distribution, or representation of data.

Preferable When: Data is ordinal, non-normal, or has


3...).
4. Handling Missing Data: Dealing with missing values by
unequal variances.
3. Examples: Number of students in a class.
using imputation techniques, such as replacing missing

Rolling a dice (values: 1, 2, 3, 4, 5, 6). values with the mean, median, or mode, or using more

advanced techniques like KNN or regression imputation.

5E. xplain the differences between discrete and


5. Dealing with Categorical Data: Categorical features

need to be converted into numerical representations for


continuous random variables. Provide examples of each.

most machine learning models

Discrete Random Variable

Significance of Feature Engineering:

1. Definition: A random variable that can take on a


1. Improved Model Accuracy: 2. Dimensionality
countable number of distinct values.

Reduction3. Better Generalization

2. Values: Usually integers or specific values (e.g., 0, 1, 2,


4. Data Representation 5. Domain Knowledge Integration
3...).

3. Examples: Number of students in a class.

Rolling a dice (values: 1, 2, 3, 4, 5, 6). 15E . xplain regression methods (L inear and L ogistic ) and

their applications in predictive modeling.

L inear Regression:

4. Probabilit y Distribution: Represented using a


Definition: A statistical method for predicting a continuous

y ( PMF) which assigns


dependent variable based on one or more independent

Y=\ _0 + \ _X+\ n

probabilit mass function


[Link]: beta beta 1 epsilo

:I : :
probabilities to individual outcomes.

ndependent variable (input). Coefficients. Error term.

Continuous Random Variable

Application: Predicting house prices based on features like


1. Definition: A random variable that can take on any value
area, location, and number of bedrooms.

within a given range (uncountable).

2. Values: Real numbers, often measured rather than L ogistic Regression:Definition: A classification algorithm

used to predict probabilities of binary outcomes.

counted (e.g., height, weight, time).

3. Examples: The height of people in a group.


Equation( Y= = \
1) frac 1{ }{ + ^{ \
1 e -( beta _0 + \
beta 1 _ X)}}

Application: Predicting whether a patient has a disease


The time it takes to complete a task (e.g., 2.1, 2.11, 2.111
seconds).
(Y / es No) based on medical parameters.

4. Probability Distribution: Represented using a probability Comparison: L inear regression is used for continuous

density function (PDF). Probabilities are calculated for outputs. Logistic regression is used for classification tasks.

intervals, not specific values (e.g., ).


16W . hat is the C y? H
urse of Dimensionalit ow does it

8 . Define covariance and eigenfeatures. E xplain their role affect data models ?

in data science.
Definition: As the number of features (dimensions)
Covariance: Measures the relationship between two increases, the volume of data required to build a reliable
variables:Cov(X, Y) = \frac{\sum (X_i - \bar{X})(Y_i - model grows exponentially.

\bar{Y})}{n-1}Eigenfeatures: Principal components derived Effect:Sparse Data: Data becomes sparse in high-
from covariance matrices, used in dimensionality dimensional [Link] Overfitting: Models may
reduction techniques like [Link]: Covariance identifies memorize noise instead of learning patterns.

relationships, and eigenfeatures simplify complex datasets Increased Complexity: Computations and storage
while preserving variability. requirements [Link]: In a 2D space, 100 data
points may suffice to form clusters. In a 100D space,
9. W hat is the Maximum Lk
i elihood Estimation (MLE) exponentially more points are needed for similar
principle, and how is it used in By a esian classification ?
[Link]:Dimensionality reduction techniques
L
M E estimates parameters by maximi ing the likelihood of z like [Link] selection to retain only relevant features.
observed [Link]: For a Gaussian distribution, MLE
estimates mean () and variance () :
C Sq W y
L(\mu, \sigma^2) = \prod_{i=1}^{n} \frac{1} 17. Discuss ANOVA
y ?

and hi- uare tests. hen are the

{\sqrt{2\pi\sigma^2}} e^{-\frac{(x_i - \mu)^2}{2\sigma^2}}

used in data anal sis

In Bayesian Classification: MLE computes probabilities of ANOVA (Analysis of Variance):

Purpose: Compares means of three or more groups to


classes, aiding in optimal classification.
determine if at least one group mean differs significantly.

E C L Assumptions:

10. xplain the entral imit Theorem and its implications


Data follows a normal [Link] are equal
in statistical inference.

The CLT states that, for a sufficiently large sample size,


across [Link]: Comparing test scores of
the sample mean’s distribution approaches a normal
students across three different teaching methods.

Chi-Square Test:Purpose: Tests the association between


distribution, regardless of the population’s distribution.
categorical [Link]:

Implications:Enables use of normal distribution for \chi^2 = \sum \frac{(O - E)^2}{E}: Expected frequency.

confidence intervals and hypothesis testing.


Example: Analyzing the relationship between gender and
Simplifies analysis when population data is complex. choice of a [Link]:ANOVA deals with
continuous data (means).Chi-square focuses on
y
11. Define h pothesis testing. Explain the Ne y man- categorical data (frequencies).
Pearson framework with an example.

Hypothesis Testing: A method to decide whether to reject


a null hypothesis () based on evidence.
18E . xplain the concept of bias and variance in data
Neyman-Pearson Framework: Balances Type I () and Type models. H y
ow do the ?

impact model performance


II () errors for optimal decision-making.
Bias:

Example: Testing a new drug’s effectiveness:


Definition: Error introduced by approximating a real-world
: No improvement.
problem using a simplified model.

: Significant improvement.
Effect: High bias leads to underfitting.

The decision depends on statistical thresholds for errors . Example: Assuming a linear relationship when data follows
a polynomial trend.

W y Variance:Definition: Sensitivity of a model to small


12. hat are Z-scores,
?I
and how are the used in
changes in the training dataset.

detecting outliers llustrate with an example.

Effect: High variance leads to overfitting.

Z-scores measure how far a data point is from the mean in


Example: A decision tree capturing noise instead of
terms of standard deviations:

[Link]-off: A good model balances bias and


Z = \frac{X - \mu}{\sigma}

Outliers: Points with are considered outliers.

variance to minimize total error.


Example:

Data: [10, 12, 15, 50].


19. W hat is Nearest Neighbor (K-NN) classification, and
Mean = 21.75, .
how does it work ?

Z for 50 = (not an outlier). Definition: A non-parametric method that classifies a data


point based on the majority class of its nearest neighbors.

13. Explain the difference between simple linear Algorithm:

regression and logistic regression.


1. Choose the number of neighbors ().

Simple Linear Regression: Predicts continuous outcomes:


2. Compute distances between the query point and all
Y = aX + b
points in the training set.

Logistic Regression: Predicts probabilities for binary 3. Select closest neighbors.

outcomes:
4. Assign the most frequent class among neighbors to the
P = \frac{1}{1+e^{-(aX+b)}}
query point.

Example: Linear regression predicts sales; logistic Example: Dataset: [(1,1,‘A’), (2,2,‘A’), (5,5,‘B’)]. Query Point:
regression classifies email as spam or not. (1.5, 1.5). Nearest Neighbor: (1,1).Predicted Class: ‘A’.

Application: Used in recommendation systems and pattern


recognition.

You might also like