1. What is Machine Learning?
Machine Learning is a branch of artificial intelligence (AI) that enables
systems to automatically learn from data and improve their performance over time
without being explicitly programmed. It involves algorithms that identify patterns in
data, allowing systems to make predictions or decisions based on past experiences.
2. Mention the difference between Data Mining and Machine learning
The key differences between Data Mining and Machine Learning are:
Purpose:
Data Mining focuses on discovering patterns, relationships, and insights from large
datasets, typically used for data analysis and decision support.
Machine Learning aims to develop algorithms that allow systems to learn from data
and make predictions or decisions, enabling automation and predictive analytics.
Process:
Data Mining is often a manual or semi-automated process that involves cleaning,
transforming, and analyzing data using statistical methods.
Machine Learning automates the learning process by training models to recognize
patterns and make decisions with minimal human intervention.
Output:
Data Mining results in insights and patterns that can be interpreted by humans for
decision-making.
Machine Learning produces models that can make predictions or decisions
autonomously based on new data.
Dependency:
Data Mining can be a stand-alone process that may or may not use machine learning.
Machine Learning often uses data mining as a preliminary step to gather and
preprocess data for model training.
3. What is ‘Overfitting’ and ‘Underfitting’ in Machine learning 2 marks
In Machine Learning:
Overfitting occurs when a model learns the training data too well, capturing noise
and specific patterns that do not generalize to new data. This leads to high accuracy
on training data but poor performance on unseen data.
Underfitting happens when a model is too simple to capture the underlying patterns
in the data, resulting in poor performance on both the training and test data. This
indicates that the model hasn’t learned enough from the data.
4. Define Logistic regression
Logistic Regression is a statistical method used in machine learning for binary
classification tasks, where the goal is to predict the probability that a given input
belongs to one of two possible classes. It models the relationship between input
features and the probability of a specific outcome using a logistic (sigmoid) function,
which outputs values between 0 and 1. Logistic regression is particularly useful for
predicting binary outcomes, such as "yes/no," "spam/not spam," or "disease/no
disease."
5. Why overfitting happens?
Overfitting happens when a machine learning model is too complex relative to
the amount and quality of data available, resulting in the model capturing noise and
specific patterns in the training data that don’t generalize to new data.
Key reasons for overfitting include:
Excessive Model Complexity
Insufficient Training Data
Lack of Regularization
High Variability in Data
6. How can you avoid overfitting
To avoid overfitting in machine learning models, several strategies can be
employed:
Simplify the Model
Regularization
Cross-Validation
Train with More Data
Early Stopping
Data Augmentation
Feature Selection
Use Dropout
Implementing these strategies can help improve a model's ability to generalize to
unseen data and reduce the likelihood of overfitting.
7. What are the five popular algorithms of Machine Learning
Here are five popular algorithms in machine learning:
Linear Regression:
Logistic Regression:
Decision Trees:
Support Vector Machines (SVM):
K-Nearest Neighbors (KNN):
These algorithms are foundational in machine learning and serve various applications
across different domains.
8. What are the different Algorithm techniques in Machine Learning?
In machine learning, various algorithm techniques can be broadly categorized
based on their learning style and application. Here are the primary techniques:
Supervised Learning:
Involves training a model on labeled data, where the input-output pairs are
provided.
Examples:
Linear Regression
Logistic Regression
Decision Trees
Support Vector Machines (SVM)
Random Forest
Neural Networks
Unsupervised Learning:
Involves training a model on unlabeled data to discover underlying patterns or
groupings.
Examples:
K-Means Clustering
Hierarchical Clustering
Principal Component Analysis (PCA)
Autoencoders
t-SNE (t-distributed Stochastic Neighbor Embedding)
Semi-Supervised Learning:
Combines both labeled and unlabeled data during training, leveraging the
small amount of labeled data to improve learning from the larger unlabeled set.
Reinforcement Learning:
Involves training an agent to make decisions by taking actions in an
environment to maximize cumulative reward. The model learns through trial and
error.
Examples:
Q-Learning
Deep Q-Networks (DQN)
Proximal Policy Optimization (PPO)
Ensemble Learning:
Combines multiple models to improve performance and robustness over individual
models. It aims to leverage the strengths of different algorithms.
Examples:
Bagging (e.g., Random Forest)
Boosting (e.g., AdaBoost, Gradient Boosting)
Stacking
Deep Learning:
A subfield of machine learning that uses neural networks with many layers (deep
neural networks) to model complex patterns in large datasets.
Examples:
Convolutional Neural Networks (CNNs) for image processing
Recurrent Neural Networks (RNNs) for sequence data
Transformers for natural language processing tasks
These techniques can be applied to a wide range of problems, from image
classification and speech recognition to recommendation systems and predictive
analytics.
9. What are the three stages to build the hypotheses or model in machine learning
The three stages to build a hypothesis or model in machine learning typically
include:
Model Training:
In this stage, the machine learning algorithm learns from the training data. The
dataset is divided into features (inputs) and labels (outputs), and the model uses this
data to find patterns and relationships.
During training, the model's parameters are adjusted to minimize the error in
predictions using techniques like gradient descent or backpropagation (for neural
networks).
Model Validation:
After training, the model is evaluated on a separate validation dataset to assess
its performance. This step helps to ensure that the model generalizes well to unseen
data and is not overfitting to the training set.
Various metrics (like accuracy, precision, recall, F1 score, etc.) are used to evaluate
the model's performance on the validation set. Hyperparameters may also be tuned
during this stage.
Model Testing:
Finally, the model is tested on an independent test dataset to provide an
unbiased evaluation of its performance. This step assesses how well the model can
make predictions on new, unseen data.
The results from the testing phase provide insights into the model's effectiveness and
can help determine whether it is suitable for deployment in real-world applications.
These stages are crucial for developing robust and effective machine learning models
that perform well in practical scenarios.
10. What is the standard approach to supervised learning?
The standard approach to supervised learning typically involves the following
steps:
Data Collection:
Data Preprocessing:
Feature Selection/Engineering:
Choosing a Model:
Model Training:
Model Validation:
Model Evaluation:
Model Deployment:
Monitoring and Maintenance:
11. What is ‘Training set’ and ‘Test set’
In machine learning, a training set and a test set are two distinct subsets of a
dataset used for developing and evaluating models. Here’s what each term means:
Training Set
Definition: The training set is a subset of the dataset used to train a machine learning
model. It contains input data along with the corresponding labels (outputs).
Purpose: The model learns from this data by identifying patterns and relationships
between the features (inputs) and the labels (outputs). It adjusts its parameters during
this phase to minimize prediction errors.
Example: If you are building a model to classify emails as spam or not spam, the
training set would consist of a large number of emails along with their labels
indicating whether each email is spam or not.
Test Set
Definition: The test set is a separate subset of the dataset that is not used during the
training phase. It is used to evaluate the model's performance after it has been trained.
Purpose: The test set helps assess how well the trained model generalizes to new,
unseen data. It provides an unbiased evaluation of the model's predictive capability.
Example: Continuing with the email classification example, the test set would consist
of a different set of emails that the model has not seen before, along with their actual
spam/not spam labels. The model's predictions on this set are compared to the actual
labels to measure performance metrics like accuracy, precision, and recall.
12. What is the difference between artificial learning and machine learning?
The terms "artificial intelligence" (AI) and "machine learning" (ML) are often
used interchangeably, but they refer to different concepts within the field of computer
science. Here are the key differences:
13.
Feature Artificial Intelligence (AI) Machine Learning (ML)
A broad field focused on A subset of AI focused on
Definition
creating intelligent systems. learning from data.
Encompasses various techniques, Involves algorithms and
Scope
including ML, NLP, etc. statistical models.
Mimic human cognitive
Build models that learn
Goal functions and perform intelligent
and improve from data.
tasks.
Decision trees, neural
Chatbots, autonomous vehicles,
Examples networks, clustering
recommendation systems.
algorithms.
In summary, while machine learning is an essential component of artificial
intelligence, AI encompasses a wider range of technologies and approaches aimed at
stimulating human-like intelligence.
14. What is Bias and what are the types of Bias?
Bias in machine learning refers to systematic errors that result from incorrect
assumptions in the learning algorithm. It can lead to inaccuracies in the model's
predictions and affect its performance. Bias can manifest in various forms, often
influencing how well a model generalizes to unseen data.
Types of Bias
Algorithmic Bias:
Sample Bias:
Confirmation Bias:
Measurement Bias:
Selection Bias:
Exclusion Bias:
Label Bias:
15. What is the main key difference between supervised and unsupervised machine
learning?
The main key difference between supervised and unsupervised machine
learning lies in the presence or absence of labeled data during the training process.
Here’s a breakdown of the differences:
Feature Supervised Learning Unsupervised Learning
Uses labeled data (input- Uses unlabeled data (input
Data Type
output pairs) only)
Objective Predict outcomes for new Discover patterns and
Feature Supervised Learning Unsupervised Learning
data structures in data
Clustering, anomaly
Applications Classification, regression
detection, association
Example Decision trees, SVM, K-means, PCA,
Algorithms neural networks hierarchical clustering
16. What is a Linear Regression?
Linear Regression is a fundamental statistical method used in machine
learning and data analysis to model the relationship between a dependent variable
(also called the target variable) and one or more independent variables (also known as
features or predictors). It is primarily used for predictive modeling and understanding
relationships among variables.
17. What are the disadvantages of the linear regression model
While linear regression is a widely used and powerful statistical tool, it has
several disadvantages and limitations.
Here are some of the key disadvantages:
Assumption of Linearity:
Sensitivity to Outliers:
Multicollinearity:
Assumption of Independence:
Homoscedasticity Assumption:
Limited Capacity:
Overfitting with High-Dimensional Data:
Interpretation Challenges:
No Handling of Categorical Variables:
Lack of Flexibility:
18. What is the difference between classification and regression?
Classification and regression are both types of supervised learning tasks in
machine learning, but they differ fundamentally in their objectives, output types, and
applications. Here are the key differences between the two:
Summary of Differences
Feature Classification Regression
Predicting categorical
Definition Predicting continuous values
labels
Output Type Discrete class labels Continuous numerical values
Spam detection, image House price prediction,
Examples
recognition temperature forecasting
Feature Classification Regression
Common Logistic regression, Linear regression,
Algorithms decision trees polynomial regression
Evaluation Accuracy, precision,
MAE, MSE, R-squared
Metrics recall
19. What is cross validation?
Cross-validation is a statistical technique used to assess the performance and
generalizability of a machine learning model. It involves partitioning a dataset into
multiple subsets (folds) and using these subsets to train and validate the model. This
approach helps to ensure that the model performs well not just on the training data but
also on unseen data.
20. What Is Bayesian Linear Regression and state its advantages
Bayesian Linear Regression is a statistical method that extends traditional linear
regression by incorporating Bayesian inference. In this approach, instead of
estimating fixed coefficients, Bayesian linear regression treats the model parameters
as random variables and uses probability distributions to represent the uncertainty
about these parameters.
Key Features of Bayesian Linear Regression:
Prior Distribution: Specifies beliefs about the parameters before observing any data.
Likelihood: Represents the probability of the observed data given the model
parameters.
Posterior Distribution: Combines the prior distribution and the likelihood using
Bayes' theorem to update beliefs about the parameters after observing the data.
Advantages of Bayesian Linear Regression:
Uncertainty Quantification:
Incorporation of Prior Knowledge:
Regularization:
Flexibility:
Prediction Intervals:
21.Define Inductive Bias.
Inductive Bias refers to the set of assumptions that a machine learning algorithm
makes to predict outputs for inputs it has not encountered during training. It is the
preference of a learning algorithm for a particular type of hypothesis or model when
generalizing from a training dataset to unseen data. Inductive bias is crucial for the
learning process because it guides the algorithm in making predictions beyond the
observed data.
Key Points about Inductive Bias:
Role in Learning:
Inductive bias enables the algorithm to generalize from specific training examples to
broader conclusions. Without some form of bias, learning algorithms would struggle
to make predictions, as they would have no basis for inferring unseen data.
Types of Inductive Bias:
Different algorithms have different inductive biases. For example:
Linear Regression assumes a linear relationship between input and output variables.
Decision Trees assume that data can be split into subsets based on feature values.
Neural Networks assume that the underlying function can be approximated using
layers of interconnected nodes.
22.Define Linear Algebra and its application in Machine Learning?
Linear Algebra is a branch of mathematics that deals with vectors, vector
spaces, matrices, and linear transformations. It provides the foundational concepts and
operations used to manipulate and analyze linear equations and systems. Linear
algebra is essential in many fields, including physics, engineering, computer science,
and machine learning.
Applications of Linear Algebra in Machine Learning:
Data Representation:
Linear Transformations:
Optimization:
Solving Systems of Equations:
Machine Learning Algorithms:
23. What is Hypothesis?
In the context of machine learning and statistics, a hypothesis refers to a
proposed explanation or model that describes the relationship between input
variables (features) and output variables (targets). It represents a specific
assumption or prediction about the underlying data and is a fundamental
concept in the process of building and evaluating machine learning models.
24. What the types are of cross validation?
Types of Cross-Validation:
k-Fold Cross-Validation: The most common method, where the dataset is divided
into kkk equal parts (folds). The model is trained kkk times, each time using a
different fold as the validation set and the remaining folds as the training set.
Stratified k-Fold Cross-Validation: Similar to k-fold, but ensures that each fold has
the same proportion of class labels as the original dataset. This is particularly useful
for imbalanced datasets.
Leave-One-Out Cross-Validation (LOOCV): A special case of k-fold where kkk
equals the number of data points in the dataset. Each training set is created by leaving
out one data point for validation.
Group k-Fold: Used when the dataset contains groups of related samples. It ensures
that the same group is not represented in both the training and validation sets.