Data Science for Civil Engineering Unit 4 Notes
Data Science for Civil Engineering Unit 4 Notes
Probabilistic models
Probabilistic models are a fundamental concept in machine learning and statistics. They are
used to model uncertainty and represent the likelihood of different events or outcomes. These
models are particularly useful when dealing with situations where there is inherent randomness
or uncertainty in the data.
Here are some key points about probabilistic models in machine learning:
1. Probability Distribution: Probabilistic models describe the probability distribution of random
variables. They help us understand how likely different outcomes are.
2. Generative Models: Some probabilistic models, like Gaussian Mixture Models (GMMs) or
Hidden Markov Models (HMMs), are generative models. They can generate new data points that
resemble the distribution of the training data.
3. Bayesian Inference: Bayesian models are a type of probabilistic model that uses Bayes'
theorem to update probabilities based on new evidence. They are widely used for tasks like
parameter estimation and hypothesis testing.
4. Uncertainty Quantification: Probabilistic models provide a way to quantify uncertainty in
predictions. For example, a model might not only make a binary classification but also provide a
probability score indicating the confidence in that classification.
5. Examples: Probabilistic models can include simple models like the Gaussian distribution for
modeling continuous data, or more complex models like Markov Random Fields (MRFs) for
structured data.
Regression models
Regression models are a class of machine learning models used for predicting a continuous
numerical value based on input features. They are commonly employed in various fields,
including economics, finance, biology, and engineering. Here are some key points about
regression models:
1. Objective: The primary goal of regression models is to establish a relationship between the
input variables (features) and the target variable (output) in a way that allows us to make
predictions about the target variable for new, unseen data points.
2. Types of Regression:
- Linear Regression: Linear regression assumes a linear relationship between the features and
the target variable. It aims to find the best-fit line (in simple linear regression) or hyperplane (in
multiple linear regression) that minimizes the sum of squared errors.
- Polynomial Regression: Polynomial regression extends linear regression by introducing
polynomial terms to capture nonlinear relationships between features and the target.
- Ridge and Lasso Regression: These are techniques used to combat overfitting in linear
regression by introducing regularization terms.
- Logistic Regression: Despite its name, logistic regression is used for binary classification
tasks, where it models the probability of a sample belonging to a particular class.
Classification models
Classification models are a type of machine learning model used to categorize input data into
predefined classes or labels. The k-Nearest Neighbors (k-NN) algorithm is a popular and simple
classification technique. Here's an overview of classification and the k-NN algorithm:
Decision Trees
Decision Trees are a popular and interpretable class of machine learning algorithms used for
both classification and regression tasks. They model decisions as a tree-like structure, where
each node represents a feature, each branch represents a decision or a rule, and each leaf node
represents a class label (in classification) or a numerical value (in regression). Here are some key
points about Decision Trees:
1. Tree Structure:
- Decision Trees are hierarchical structures that start with a root node and split into internal
nodes and leaf nodes as you move down the tree.
- Internal nodes represent decisions or conditions based on feature values.
- Leaf nodes represent the predicted class (in classification) or the predicted value (in
regression).
2. Splitting Criteria:
- At each internal node, a Decision Tree algorithm selects a feature and a threshold (or value)
to split the data into two or more subsets.
- The choice of splitting criteria depends on the task. Common criteria include Gini impurity
(for classification) and mean squared error (for regression).
3. Training:
- The training process of a Decision Tree involves recursively splitting the data based on the
selected criteria until a stopping condition is met.
- The stopping conditions can include a maximum tree depth, a minimum number of samples
in a leaf node, or reaching a purity threshold in classification.
4. Pruning:
- Decision Trees are prone to overfitting, where they capture noise in the training data. Pruning
is a technique used to trim the tree by removing branches that do not significantly improve
predictive accuracy on validation data.
Hyperparameter Tuning:
1. Hyperparameters: Hyperparameters are settings that control the behavior of a machine
learning model. Common hyperparameters include the learning rate, regularization strength, the
number of layers in a neural network, or the depth of a decision tree.
Model-Based Diagnostics:
Model-based diagnostics is a set of techniques used to assess the performance and reliability of
statistical and machine learning models. This involves examining the assumptions and evaluating
the quality of the model's predictions. Here are some key aspects:
1. Assumption Checking: In model-based diagnostics, it is essential to check the underlying
assumptions of the model. For example, in linear regression, assumptions like linearity,
independence of errors, and homoscedasticity should be validated.
2. Residual Analysis: Residuals, which are the differences between the observed values and the
predicted values, are analyzed to ensure they meet the assumptions of the model. This includes
checking for patterns or trends in the residuals.
3. Outlier Detection: Identifying outliers or influential data points that can significantly impact
the model's performance or assumptions is a crucial part of model-based diagnostics.
4. Model Fit Statistics: Various statistics, such as R-squared, AIC (Akaike Information
Criterion), and BIC (Bayesian Information Criterion), are used to evaluate the quality and fit of
the model.
Anomaly detection
Anomaly detection, also known as outlier detection, is a machine learning and data analysis
technique used to identify unusual patterns or data points in a dataset that do not conform to
expected behavior. Anomalies, also called outliers, can be indicative of errors, fraud, or other
interesting and potentially significant events. Here are key aspects of anomaly detection:
1. Types of Anomalies:
- Anomalies can be broadly categorized into three types:
- Point Anomalies: These are individual data points that deviate significantly from the norm.
Examples include fraudulent credit card transactions, sensor malfunctions, or rare disease cases.
- Contextual Anomalies: These anomalies are context-dependent, meaning they are
considered unusual in a specific context but not in others. For instance, a sudden temperature
increase during winter may not be an anomaly, but the same increase in summer could be.
- Collective Anomalies: Collective anomalies involve a group of data points that exhibit
abnormal behavior when considered together but may appear normal individually. Examples
include network traffic spikes or click fraud by a coordinated group of users.
2. Detection Techniques:
- Anomaly detection can be performed using various techniques, including statistical methods,
machine learning algorithms, and domain-specific rule-based approaches.
- Statistical methods involve calculating summary statistics (e.g., mean, standard deviation)
and identifying data points that fall outside a predefined range.