Data Science Related Interview Question
Data Science Related Interview Question
DATA SCIENCE:
Q1. What is Data Science? List the differences between supervised
and unsupervised learning.
How is this different from what statisticians have been doing for years?
The answer lies in the difference between explaining and predicting.
1
Follow Arpit Singh on LinkedIn for more such insightful posts
Selection bias is a kind of error that occurs when the researcher decides who is
going to be studied. It is usually associated with research where the selection
of participants isn‘t random. It is sometimes referred to as the selection effect.
It is the distortion of statistical analysis, resulting from the method of
collecting samples. If the selection bias is not taken into account, then some
conclusions of the study may not be accurate.
2
Follow Arpit Singh on LinkedIn for more such insightful posts
Low bias machine learning algorithms — Decision Trees, k-NN and SVM
Normally, as you increase the complexity of your model, you will see a
reduction in error due to lower bias in the model. However, this only happens
until a particular point. As you continue to make your model more complex,
you end up over-fitting your model and hence your model will start suffering
from high variance.
3
Follow Arpit Singh on LinkedIn for more such insightful posts
1. The k-nearest neighbour algorithm has low bias and high variance, but
the trade-off can be changed by increasing the value of k which
increases the number of neighbours that contribute to the prediction
and in turn increases the bias of the model.
2. The support vector machine algorithm has low bias and high variance,
but the trade-off can be changed by increasing the C parameter that
influences the number of violations of the margin allowed in the
training data which increases the bias but decreases the variance.
Confusion Matrix
4
Follow Arpit Singh on LinkedIn for more such insightful posts
A data set used for performance evaluation is called a test data set. It should
contain the correct labels and predicted labels.
The predicted labels will exactly the same if the performance of a binary
classifier is perfect.
The predicted labels usually match with part of the observed labels in real-
world scenarios.
A binary classifier predicts all data instances of a test data set as either
positive or negative. This produces four outcomes-
5
Follow Arpit Singh on LinkedIn for more such insightful posts
6
Follow Arpit Singh on LinkedIn for more such insightful posts
STATISTICS:
Q5. What is the difference between “long” and “wide” format data?
Data is usually distributed in different ways with a bias to the left or to the
right or it can all be jumbled up.
However, there are chances that data is distributed around a central value
without any bias to the left or right and reaches normal distribution in the
form of a bell-shaped curve.
7
Follow Arpit Singh on LinkedIn for more such insightful posts
8
Follow Arpit Singh on LinkedIn for more such insightful posts
Covariance: In covariance two items vary together and it‘s a measure that
indicates the extent to which two random variables change in cycle. It is a
statistical term; it explains the systematic relation between a pair of random
variables, wherein changes in one variable reciprocal by a corresponding
change in another variable.
An example of this could be identifying the click-through rate for a banner ad.
9
Follow Arpit Singh on LinkedIn for more such insightful posts
When you perform a hypothesis test in statistics, a p-value can help you
determine the strength of your results. p-value is a number between 0 and 1.
Based on the value it will denote the strength of the results. The claim which is
on trial is called the Null Hypothesis.
Low p-value (≤ 0.05) indicates strength against the null hypothesis which
means we can reject the null Hypothesis. High p-value (≥ 0.05) indicates
strength for the null hypothesis which means we can accept the null
Hypothesis p-value of 0.05 indicates the Hypothesis could go either way. To
put it in another way,
High P values: your data are likely with a true null. Low P values: your data
are unlikely with a true null.
Probability of not seeing any shooting star in the period of one hour
= (0.8) ^ 4 = 0.4096
Any die has six sides from 1-6. There is no way to get seven equal
outcomes from a single rolling of a die. If we roll the die twice and
consider the event of two rolls, we now have 36 different outcomes.
10
Follow Arpit Singh on LinkedIn for more such insightful posts
Q13. A certain couple tells you that they have two children, at least
one of which is a girl. What is the probability that they have two
girls?
where B = Boy and G = Girl and the first letter denotes the first child.
From the question, we can exclude the first case of BB. Thus from the
remaining 3 possibilities of BG, GB & BB, we have to find the probability of
the case with two girls.
Q14. A jar has 1000 coins, of which 999 are fair and 1 is double
headed. Pick a coin at random, and toss it 10 times. Given that you
see 10 heads, what is the probability that the next toss of that coin
is also a head?
There are two ways of choosing the coin. One is to pick a fair coin and the
other is to pick the one with two heads.
Sensitivity is nothing but ―Predicted True events/ Total events‖. True events
here are the events which were true and model also predicted them as true.
12
Follow Arpit Singh on LinkedIn for more such insightful posts
In statistics and machine learning, one of the most common tasks is to fit
a model to a set of training data, so as to be able to make reliable predictions
on general untrained data.
To combat overfitting and underfitting, you can resample the data to estimate
the model accuracy (k-fold cross-validation) and by having a validation
dataset to evaluate the model.
For example, if you are researching whether a lack of exercise leads to weight
gain,
A confounding variable here would be any other variable that affects both of
these variables, such as the age of the subject.
Q22. What Are the Types of Biases That Can Occur During
Sampling?
Selection bias
Under coverage bias
Survivorship bias
14
Follow Arpit Singh on LinkedIn for more such insightful posts
It is the logical error of focusing aspects that support surviving some process
and casually overlooking those that did not work because of their lack of
prominence. This can lead to wrong conclusions in numerous different means.
Selection bias occurs when the sample obtained is not representative of the
population intended to be analysed.
15
Follow Arpit Singh on LinkedIn for more such insightful posts
16
Follow Arpit Singh on LinkedIn for more such insightful posts
DATA ANALYSIS:
Q28. Python or R – Which one would you prefer for text analytics?
Python would be the best option because it has Pandas library that
provides easy to use data structures and high-performance data analysis
tools.
R is more suitable for machine learning than just text analysis.
Python performs faster for all types of text analytics.
Q29. How does data cleaning plays a vital role in the analysis?
17
Follow Arpit Singh on LinkedIn for more such insightful posts
Multivariate analysis deals with the study of more than two variables to
understand the effect of variables on the responses.
For eg., A researcher wants to survey the academic performance of high school
students in Japan. He can divide the entire population of Japan into different
clusters (cities). Then the researcher selects a number of clusters depending
on his research through simple or systematic random sampling.
Let‘s continue our Data Science Interview Questions blog with some more
statistics questions.
progressed from the top again. The best example of systematic sampling is
equal probability method.
Let us first understand what false positives and false negatives are.
False Positives are the cases where you wrongly classified a non-event
as an event a.k.a Type I error.
False Negatives are the cases where you wrongly classify events as
non-events, a.k.a Type II error.
Q36. Can you cite some examples where a false negative important
than a false positive?
Example 3: What if you rejected to marry a very good person based on your
predictive model and you happen to meet him/her after a few years and
realize that you had a false negative?
Q37. Can you cite some examples where both false positive and
false negatives are equally important?
In the Banking industry giving loans is the primary source of making money
but at the same time if your repayment rate is not good you will not make any
profit, rather you will risk huge losses.
Banks don‘t want to lose good customers and at the same point in time, they
don‘t want to acquire bad customers. In this scenario, both the false positives
and false negatives become very important to measure.
Q38. Can you explain the difference between a Validation Set and a
Test Set?
On the other hand, a Test Set is used for testing or evaluating the
performance of a trained machine learning model.
20
Follow Arpit Singh on LinkedIn for more such insightful posts
In simple terms, the differences can be summarized as; training set is to fit the
parameters i.e. weights and test set is to assess the performance of the model
i.e. evaluating the predictive power and generalization.
The goal of cross-validation is to term a data set to test the model in the
training phase (i.e. validation data set) in order to limit problems like
overfitting and get an insight on how the model will generalize to an
independent data set.
21
Follow Arpit Singh on LinkedIn for more such insightful posts
MACHINE LEARNING:
Q40. What is Machine Learning?
E.g. If you built a fruit classifier, the labels will be “this is an orange, this is
an apple and this is a banana”, based on showing the classifier examples of
apples, oranges and bananas.
E.g. In the same example, a fruit clustering will categorize as “fruits with soft
skin and lots of dimples”, “fruits with shiny hard skin” and “elongated yellow
fruits”.
The Algorithm is ‗naive‘ because it makes assumptions that may or may not
turn out to be correct.
23
Follow Arpit Singh on LinkedIn for more such insightful posts
In the diagram, we see that the thinner lines mark the distance from the
classifier to the closest data points called the support vectors (darkened data
points). The distance between the two thin lines is called the margin.
24
Follow Arpit Singh on LinkedIn for more such insightful posts
1. Linear Kernel
2. Polynomial kernel
3. Radial basis kernel
4. Sigmoid kernel
Entropy
25
Follow Arpit Singh on LinkedIn for more such insightful posts
A decision tree is built top-down from a root node and involve partitioning of
data into homogeneous subsets. ID3 uses enteropy to check the homogeneity
of a sample. If the sample is completely homogenious then entropy is zero and
if the sample is an equally divided it has entropy of one.
Information Gain
26
Follow Arpit Singh on LinkedIn for more such insightful posts
For example, if you want to predict whether a particular political leader will
win the election or not. In this case, the outcome of prediction is binary i.e. 0
or 1 (Win/Lose). The predictor variables here would be the amount of money
spent for election campaigning of a particular candidate, the amount of time
spent in campaigning, etc.
27
Follow Arpit Singh on LinkedIn for more such insightful posts
28
Follow Arpit Singh on LinkedIn for more such insightful posts
All extreme values are not outlier values. The most common ways to treat
outlier values
29
Follow Arpit Singh on LinkedIn for more such insightful posts
The extent of the missing values is identified after identifying the variables
with missing values. If any patterns are identified the analyst has to
concentrate on them as it could lead to interesting and meaningful business
insights.
If there are no patterns identified, then the missing values can be substituted
with mean or median values (imputation) or they can simply be
ignored. Assigning a default value which can be mean, minimum or maximum
value. Getting into the data is important.
If 80% of the values for a variable are missing then you can answer that you
would be dropping the variable instead of treating the missing values.
30
Follow Arpit Singh on LinkedIn for more such insightful posts
31
Follow Arpit Singh on LinkedIn for more such insightful posts
This is the widely used approach but few data scientists also use Hierarchical
clustering first to create dendrograms and identify the distinct groups from
there.
Ensemble learning has many types but two more popular ensemble learning
techniques are mentioned below.
Bagging
32
Follow Arpit Singh on LinkedIn for more such insightful posts
Boosting
33
Follow Arpit Singh on LinkedIn for more such insightful posts
Instead of using k-fold cross-validation, you should be aware of the fact that a
time series is not randomly distributed data — It is inherently ordered by
chronological order.
In case of time series data, you should use techniques like forward=chaining
— Where you will be model on past data then look at forward-facing data.
34
Follow Arpit Singh on LinkedIn for more such insightful posts
The dependent variable for a regression analysis might not satisfy one or more
assumptions of an ordinary least squares regression. The residuals could
either curve as the prediction increases or follow the skewed distribution. In
such scenarios, it is necessary to transform the response variable so that the
data meets the required assumptions. A Box cox transformation is a statistical
technique to transform non-normal dependent variables into a normal shape.
If the given data is not normal then most of the statistical techniques assume
normality. Applying a box cox transformation means that you can run a
broader number of tests.
35
Follow Arpit Singh on LinkedIn for more such insightful posts
Q68. If you are having 4GB RAM in your machine and you want to
train your model on 10GB data set. How would you go about this
problem? Have you ever faced this kind of problem in your
machine learning/data science experience so far?
First of all, you have to ask which ML model you want to train.
For Neural networks: Batch size with Numpy array will work.
Steps:
1. Load the whole data in the Numpy array. Numpy array has a property to
create a mapping of the complete data set, it doesn‘t load complete data
set in memory.
2. You can pass an index to Numpy array to get required data.
3. Use this data to pass to the Neural network.
4. Have a small batch size.
Steps:
However, you could actually face such an issue in reality. So, you could check
out the best laptop for Machine Learning to prevent that. Having said
that, let‘s move on to some questions on deep learning.
36
Follow Arpit Singh on LinkedIn for more such insightful posts
DEEP LEARNING:
Q69. What do you mean by Deep Learning?
37
Follow Arpit Singh on LinkedIn for more such insightful posts
Q71. What, in your opinion, is the reason for the popularity of Deep
Learning in recent times?
Now although Deep Learning has been around for many years, the major
breakthroughs from these techniques came just in recent years. This is
because of two main reasons:
GPUs are multiple times faster and they help us build bigger and deeper deep
learning models in comparatively less time than we required previously.
38
Follow Arpit Singh on LinkedIn for more such insightful posts
generates the best possible result without needing to redesign the output
criteria.
There are two methods here: we can either initialize the weights to zero or
assign them randomly.
Initializing all weights to 0: This makes your model similar to a linear model.
All the neurons and every layer perform the same operation, giving the same
output and making the deep net useless.
Initializing all weights randomly: Here, the weights are assigned randomly by
initializing them very close to 0. It gives better accuracy to the model since
every neuron performs different computations. This is the most commonly
used method.
39
Follow Arpit Singh on LinkedIn for more such insightful posts
When your learning rate is too low, training of the model will progress very
slowly as we are making minimal updates to the weights. It will take many
updates before reaching the minimum point.
If the learning rate is set too high, this causes undesirable divergent behaviour
to the loss function due to drastic updates in weights. It may fail to converge
(model can give a good output) or even diverge (data is too chaotic for the
network to train).
40
Follow Arpit Singh on LinkedIn for more such insightful posts
41
Follow Arpit Singh on LinkedIn for more such insightful posts
Both these networks RNN and feed-forward named after the way they channel
information through a series of mathematical orations performed at the nodes
of the network. One feeds information through straight(never touching the
same node twice), while the other cycles it through a loop, and the latter are
called recurrent.
Recurrent networks, on the other hand, take as their input, not just the
current input example they see, but also the what they have perceived
previously in time.
The decision a recurrent neural network reached at time t-1 affects the
decision that it will reach one moment later at time t. So recurrent networks
have two sources of input, the present and the recent past, which combine to
determine how they respond to new data, much as we do in life.
42
Follow Arpit Singh on LinkedIn for more such insightful posts
The error they generate will return via backpropagation and be used to adjust
their weights until error can‘t go any lower. Remember, the purpose of
recurrent nets is to accurately classify sequential input. We rely on the
backpropagation of error and gradient descent to do so.
Except for the input layer, each node in the other layers uses a nonlinear
activation function. This means the input layers, the data coming in, and the
activation function is based upon all nodes and weights being added together,
producing the output. MLP uses a supervised learning method called
43
Follow Arpit Singh on LinkedIn for more such insightful posts
44
Follow Arpit Singh on LinkedIn for more such insightful posts
This has the effect of your model is unstable and unable to learn from your
training data.
While training an RNN, your slope can become either too small; this makes
the training difficult. When the slope is too small, the problem is known as a
Vanishing Gradient. It leads to long training times, poor performance, and low
accuracy.
45
Follow Arpit Singh on LinkedIn for more such insightful posts
Pytorch
TensorFlow
Microsoft Cognitive Toolkit
Keras
Caffe
Chainer
Purpose Libraries
46
Follow Arpit Singh on LinkedIn for more such insightful posts
47
Follow Arpit Singh on LinkedIn for more such insightful posts
48
Follow Arpit Singh on LinkedIn for more such insightful posts
Tensorflow provides both C++ and Python APIs, making it easier to work on
and has a faster compilation time compared to other Deep Learning libraries
like Keras and Torch. Tensorflow supports both CPU and GPU computing
devices.
49
Follow Arpit Singh on LinkedIn for more such insightful posts
Suppose there is a wine shop purchasing wine from dealers, which they resell
later. But some dealers sell fake wine. In this case, the shop owner should be
able to distinguish between fake and authentic wine.
The forger will try different techniques to sell fake wine and make sure specific
techniques go past the shop owner‘s check. The shop owner would probably
get some feedback from wine experts that some of the wine is not original. The
owner would have to improve how he determines whether a wine is fake or
authentic.
The forger‘s goal is to create wines that are indistinguishable from the
authentic ones while the shop owner intends to tell if the wine is real or not
accurately
There is a noise vector coming into the forger who is generating fake wine.
The Discriminator gets two inputs; one is the fake wine, while the other is the
real authentic wine. The shop owner has to figure out whether it is real or fake.
50
Follow Arpit Singh on LinkedIn for more such insightful posts
1. Generator
2. Discriminator
The generator is a CNN that keeps keys producing images and is closer in
appearance to the real images while the discriminator tries to determine the
difference between real and fake images The ultimate aim is to make the
discriminator learn to identify real and fake images.
Q101. What are the important skills to have in Python with regard
to data analysis?
The following are some of the important skills to possess which will come
handy when performing data analysis using Python.
51
Follow Arpit Singh on LinkedIn for more such insightful posts
52
Follow Arpit Singh on LinkedIn for more such insightful posts
The formula and graph for the sigmoid function are as shown:
A random forest is built up of a number of decision trees. If you split the data
into different packages and make a decision tree in each of the different
groups of data, the random forest brings all those trees together.
1. Randomly select 'k' features from a total of 'm' features where k << m
2. Among the 'k' features, calculate the node D using the best split point
3. Split the node into daughter nodes using the best split
4. Repeat steps two and three until leaf nodes are finalized
5. Build forest by repeating steps one to four for 'n' times to create 'n' number
of trees
54
Follow Arpit Singh on LinkedIn for more such insightful posts
Overfitting refers to a model that is only set for a very small amount of data
and ignores the bigger picture. There are three main methods to avoid
overfitting:
Univariate
Univariate data contains only one variable. The purpose of the univariate
analysis is to describe the data and find patterns that exist within it.
164
55
Follow Arpit Singh on LinkedIn for more such insightful posts
167.3
170
174.2
178
180
Bivariate
Bivariate data involves two different variables. The analysis of this type of data
deals with causes and relationships and the analysis is done to determine the
relationship between the two variables.
56
Follow Arpit Singh on LinkedIn for more such insightful posts
20 2,000
25 2,100
26 2,300
28 2,400
30 2,600
36 3,100
57
Follow Arpit Singh on LinkedIn for more such insightful posts
Here, the relationship is visible from the table that temperature and sales are
directly proportional to each other. The hotter the temperature, the better the
sales.
Multivariate
Multivariate data involves three or more variables, it is categorized under
multivariate. It is similar to a bivariate but contains more than one dependent
variable.
No. of
Floors Area (sq ft) Price
rooms
2 0 900 $400,000
3 2 1,100 $600,000
4 3 2,100 $1,200,000
58
Follow Arpit Singh on LinkedIn for more such insightful posts
The patterns can be studied by drawing conclusions using mean, median, and
mode, dispersion or range, minimum, maximum, etc. You can start describing
the data and using it to guess what the price of the house will be.
Q108. What are the feature selection methods used to select the
right variables?
Filter Methods
This involves:
Wrapper Methods
This involves:
Forward Selection: We test one feature at a time and keep adding them
until we get a good fit
Backward Selection: We test all the features and start removing them to see
what works better
Recursive Feature Elimination: Recursively looks through all the different
features and how they pair together
Wrapper methods are very labor-intensive, and high-end computers are
needed if a lot of data analysis is performed with the wrapper method.
59
Follow Arpit Singh on LinkedIn for more such insightful posts
But for multiples of three, print "Fizz" instead of the number and for the
multiples of five, print "Buzz." For numbers which are multiples of both three
and five, print "FizzBuzz"
The code is shown below:
Note that the range mentioned is 51, which means zero to 50. However, the
range asked in the question is one to 50. Therefore, in the above code, you can
include the range as (1,51).
The output of the above code is as shown:
60
Follow Arpit Singh on LinkedIn for more such insightful posts
Q110. You are given a data set consisting of variables with more
than 30 percent missing values. How will you deal with them?
If the data set is large, we can just simply remove the rows with missing data
values. It is the quickest way; we use the rest of the data to predict the values.
For smaller data sets, we can substitute missing values with the mean or
average of the rest of the data using the pandas data frame in python. There
are different ways to do so, such as df.mean(), df.fillna(mean).
Q111. For the given points, how will you calculate the Euclidean
distance in Python?
plot1 = [1,3]
plot2 = [2,5]
This reduction helps in compressing data and reducing storage space. It also
reduces computation time as fewer dimensions lead to less computing. It
removes redundant features; for example, there's no point in storing a value in
two different units (meters and inches).
61
Follow Arpit Singh on LinkedIn for more such insightful posts
-2 -4 2
-2 1 2
4 2 5
Expanding determinant:
- λ3 + 4λ2 + 27λ – 90 = 0,
λ3 - 4 λ2 -27 λ + 90 = 0
33 – 4 x 32 - 27 x 3 +90 = 0
Hence, (λ - 3) is a factor:
62
Follow Arpit Singh on LinkedIn for more such insightful posts
For X = 1,
-5 - 4Y + 2Z =0,
-2 - 2Y + 2Z =0
3 + 2Y = 0,
Y = -(3/2)
Z = -(1/2)
Monitor
Constant monitoring of all models is needed to determine their performance
accuracy. When you change something, you want to figure out how your
changes are going to affect things. This needs to be monitored to ensure it's
doing what it's supposed to do.
63
Follow Arpit Singh on LinkedIn for more such insightful posts
Evaluate
Evaluation metrics of the current model are calculated to determine if a new
algorithm is needed.
Compare
The new models are compared to each other to determine which model
performs the best.
Rebuild
The best performing model is re-built on the current state of data.
Collaborative filtering
As an example, Last.fm recommends tracks that other users with similar
interests play often. This is also commonly seen on Amazon after making a
purchase; customers may notice the following message accompanied by
product recommendations: "Users who bought this also bought…"
Content-based filtering
As an example: Pandora uses the properties of a song to recommend music
with similar properties. Here, we look at content, instead of looking at who
else is listening to music.
RMSE and MSE are two of the most common measures of accuracy for a
linear regression model.
64
Follow Arpit Singh on LinkedIn for more such insightful posts
We use the elbow method to select k for k-means clustering. The idea of the
elbow method is to run k-means clustering on the data set where 'k' is the
number of clusters.
Within the sum of squares (WSS), it is defined as the sum of the squared
distance between each member of the cluster and its centroid.
This indicates strong evidence against the null hypothesis; so you reject the
null hypothesis.
This indicates weak evidence against the null hypothesis, so you accept the
null hypothesis.
65
Follow Arpit Singh on LinkedIn for more such insightful posts
Example: height of an adult = abc ft. This cannot be true, as the height cannot
be a string value. In this case, outliers can be removed.
If the outliers have extreme values, they can be removed. For example, if all
the data points are clustered between zero to 10, but one point lies at 100,
then we can remove this point.
Try a different model. Data detected as outliers by linear models can be fit
by nonlinear models. Therefore, be sure you are choosing the correct
model.
Try normalizing the data. This way, the extreme data points are pulled to a
similar range.
You can use algorithms that are less affected by outliers; an example would
be random forests.
It is stationary when the variance and mean of the series are constant with
time.
66
Follow Arpit Singh on LinkedIn for more such insightful posts
In the first graph, the variance is constant with time. Here, X is the time factor
and Y is the variable. The value of Y goes through the same points all the time;
in other words, it is stationary.
In the second graph, the waves get bigger, which means it is non-stationary
and the variance is changing with time.
You can see the values for total data, actual values, and predicted values.
67
Follow Arpit Singh on LinkedIn for more such insightful posts
= 609 / 650
= 0.93
Q122. Write the equation and calculate the precision and recall
rate.
= 262 / 277
= 0.94
= 262 / 288
= 0.90
68
Follow Arpit Singh on LinkedIn for more such insightful posts
The engine makes predictions on what might interest a person based on the
preferences of other users. In this algorithm, item features are unknown.
For example, a sales page shows that a certain number of people buy a new
phone and also buy tempered glass at the same time. Next time, when a
person buys a phone, he or she may see a recommendation to buy tempered
glass as well.
Q124. You are given a dataset on cancer detection. You have built a
classification model and achieved an accuracy of 96 percent. Why
shouldn't you be happy with your model performance? What can
you do about it?
K-means clustering
Linear regression
69
Follow Arpit Singh on LinkedIn for more such insightful posts
When you're dealing with K-means clustering or linear regression, you need to
do that in your pre-processing, otherwise, they'll crash. Decision trees also
have the same problem, although there is some variance.
Q126. Below are the eight actual values of the target variable in the
train file. What is the entropy of the target variable?
[0, 0, 0, 1, 1, 1, 1, 1]
70
Follow Arpit Singh on LinkedIn for more such insightful posts
1. Logistic Regression
2. Linear Regression
3. K-means clustering
4. Apriori algorithm
The most appropriate algorithm for this case is A, logistic regression.
1. K-means clustering
2. Linear regression
3. Association rules
4. Decision trees
As we are looking for grouping people together specifically by four different
similarities, it indicates the value of k. Therefore, K-means clustering (answer
A) is the most appropriate algorithm for this study.
71
Follow Arpit Singh on LinkedIn for more such insightful posts
1. One-way ANOVA
2. K-means clustering
3. Association rules
4. Student's t-test
The answer is A: One-way ANOVA
72
Follow Arpit Singh on LinkedIn for more such insightful posts
Root cause analysis was initially developed to analyze industrial accidents but
is now widely used in other areas. It is a problem-solving technique used for
isolating the root causes of faults or problems. A factor is called a root cause if
its deduction from the problem-fault-sequence averts the final undesirable
event from recurring.
73
Follow Arpit Singh on LinkedIn for more such insightful posts
The goal of cross-validation is to term a data set to test the model in the
training phase (i.e. validation data set) to limit problems like overfitting and
gain insight into how the model will generalize to an independent data set.
Most recommender systems use this filtering process to find patterns and
information by collaborating perspectives, numerous data sources, and several
agents.
They do not, because in some cases, they reach a local minima or a local
optima point. You would not reach the global optima point. This is governed
by the data and the starting conditions.
74
Follow Arpit Singh on LinkedIn for more such insightful posts
75
Follow Arpit Singh on LinkedIn for more such insightful posts
148. What are the types of biases that can occur during sampling?
1. Selection bias
2. Undercoverage bias
3. Survivorship bias
76
Follow Arpit Singh on LinkedIn for more such insightful posts
I hope this set of Data Science Interview Questions and Answers will help you
in preparing for your interviews. All the best!!
77