Datasets in machine learning Unit 2
Datasets in machine learning Unit 2
1) What is datasets?
1. Instances (Samples):
o Each instance is a single data point or record.
o Examples: A row in a spreadsheet, a single image, a document.
2. Features (Attributes or Variables):
o Features are individual measurable properties or characteristics of the
instances.
o Examples: Age, height, and weight in a dataset about people; pixel values in
an image.
3. Labels (Targets or Outputs):
o Labels are the outcome or the variable that a model tries to predict (used in
supervised learning).
o Examples: Species of a flower in a dataset of iris flowers, the price of a house.
Types of Datasets :-
1. Structured Datasets:
o Organized in a tabular format with rows and columns.
o Example: Excel spreadsheets, SQL databases.
2. Unstructured Datasets:
o Do not have a predefined structure.
o Example: Text documents, images, audio files.
3. Semi-Structured Datasets:
o Contain elements of both structured and unstructured data.
o Example: JSON files, XML files.
Dataset Formats
1
json
[
{"Name": "John", "Age": 28, "Salary": 50000},
{"Name": "Jane", "Age": 32, "Salary": 60000}
]
3. SQL Databases:
o Structured data stored in relational database management systems (RDBMS).
o Example: A table in an SQL database with columns and rows.
4. Images:
o Stored in formats like JPEG, PNG, and GIF.
o Example: A dataset of images of handwritten digits.
5. Text Files:
o Contain unstructured data in formats like TXT, DOCX, and PDF.
o Example: A dataset of movie reviews.
1. Training:
o The dataset is used to train a machine learning model by learning patterns and
relationships in the data.
2. Validation:
o A separate portion of the dataset used to tune hyperparameters and evaluate
the model during training to prevent overfitting.
3. Testing:
o A final portion of the dataset used to assess the model's performance on
unseen data to ensure it generalizes well.
Features:
o Sepal length
o Sepal width
o Petal length
o Petal width
Label:
o Species of the iris flower (Setosa, Versicolour, Virginica)
Sources of Datasets
Importance of Datasets
Datasets are crucial for the development and evaluation of models and algorithms. The
quality and diversity of the dataset significantly impact the model's ability to learn and
generalize to new data. Properly curated and preprocessed datasets lead to more accurate and
reliable models.
2
Types of data in datasets:-
Datasets can include various types of data, each suitable for different analytical tasks and
applications. Here are the primary types of data found in datasets:
1. Numerical Data
Numerical data represents quantifiable measurements and is usually divided into two
subcategories:
Continuous Data:
o Can take any value within a range.
o Examples: Height, weight, temperature, time.
Discrete Data:
o Takes distinct, separate values.
o Examples: Number of children, number of cars, shoe size.
2. Categorical Data
Categorical data represents qualitative attributes and is divided into categories. It can be
further split into:
Nominal Data:
o Categories with no inherent order.
o Examples: Gender, color, nationality.
Ordinal Data:
o Categories with a meaningful order but no fixed interval between categories.
o Examples: Education level (high school, bachelor’s, master’s, PhD), customer
satisfaction ratings (poor, fair, good, excellent).
Time series data consists of observations collected at regular intervals over time. This type of
data is crucial for tasks involving trends and patterns over time.
Examples: Stock prices, weather data, sales figures over months or years.
4. Text Data
Text data consists of words and sentences and is often unstructured or semi-structured.
3
5. Image Data
Image data includes visual information stored in formats such as JPEG, PNG, and GIF. Each
image can be considered as an array of pixel values.
6. Audio Data
Audio data consists of sound recordings, which can be in various formats like WAV, MP3,
and FLAC.
7. Video Data
Video data includes sequences of images (frames) along with audio, capturing motion and
sound over time.
8. Sensor Data
Sensor data is collected from various sensors that measure physical quantities.
9. Geospatial Data
Geospatial data includes information about geographic locations and features on the Earth’s
surface.
Binary data consists of data in a binary format, often used in computer systems and digital
communications.
In many real-world datasets, you may encounter mixed data types where different types of
data are combined.
Example: A dataset containing patient records might include numerical data (age,
weight), categorical data (gender, diagnosis), and text data (doctor’s notes).
4
Understanding the types of data in a dataset is crucial because different types of data require
different preprocessing techniques, analytical methods, and model types. Identifying the data
type helps in selecting the appropriate algorithms and tools for analysis and modeling.
Need of Datasets:-
Datasets are fundamental in various fields, especially in data science, machine learning,
and artificial intelligence. Here are key reasons why datasets are crucial:
Pattern Recognition: Datasets provide the examples from which machine learning
models learn patterns, relationships, and dependencies in the data.
Feature Learning: Models learn which features are most important for making
predictions or classifications based on the data provided.
Validation: A portion of the dataset is used to tune model parameters and select the
best model configuration.
Testing: Datasets are essential for evaluating the performance of a trained model on
unseen data, ensuring that the model generalizes well to new, real-world data.
Data Analysis: Datasets allow for the analysis of trends, patterns, and insights that
inform decision-making processes in businesses and organizations.
Reporting: Datasets are used to generate reports, dashboards, and visualizations that
communicate important information to stakeholders.
User Behavior Analysis: Companies use datasets to understand how users interact
with their products and services, leading to improvements and new features.
Personalization: Datasets enable personalized experiences for users by analyzing
their preferences and behaviors.
5
Scenario Simulation: Datasets are used to simulate different scenarios and test the
outcomes, which is particularly useful in fields like finance, healthcare, and
engineering.
Stress Testing: Datasets help in stress testing models and systems to ensure they can
handle extreme conditions and rare events.
Bias Detection: Datasets are analyzed to detect and mitigate biases in machine
learning models and decision-making processes.
Fair Representation: Diverse and representative datasets help ensure that models are
fair and do not discriminate against any particular group.
In essence, datasets are the backbone of data-driven processes, enabling the training,
validation, and testing of machine learning models, driving research and development,
informing decisions, and facilitating innovation. Their importance cannot be overstated, as
they provide the foundation for understanding and leveraging data in various applications and
industries.
The machine learning lifecycle encompasses the stages involved in developing, deploying,
and maintaining machine learning models. Here’s a detailed overview of each phase:
1. Problem Definition
Objective Setting: Clearly define the problem you want to solve and set the
objectives.
Understanding the Context: Understand the business or application context,
including constraints and requirements.
6
2. Data Collection
Gathering Data: Collect relevant data from various sources such as databases, APIs,
and web scraping.
Data Integration: Combine data from different sources to create a comprehensive
dataset.
3. Data Preparation
Data Cleaning: Handle missing values, remove duplicates, correct errors, and deal
with outliers.
Data Transformation: Normalize, standardize, or encode data into suitable formats
for analysis.
Feature Engineering: Create new features from existing data that can help improve
model performance.
Data Splitting: Split the data into training, validation, and test sets to ensure unbiased
evaluation of the model.
5. Model Selection
6. Model Training
Training the Model: Use the training data to train the chosen model.
Hyperparameter Tuning: Optimize model hyperparameters to improve
performance.
Cross-Validation: Use cross-validation techniques to ensure the model’s robustness
and generalization ability.
7. Model Evaluation
Performance Metrics: Evaluate the model using metrics such as accuracy, precision,
recall, F1-score, RMSE, etc.
Validation Set: Assess the model’s performance on the validation set to fine-tune and
select the best model.
Test Set: Evaluate the final model on the test set to measure its generalization
performance.
8. Model Deployment
Understanding Model Predictions: Use techniques like SHAP, LIME, and feature
importance to explain model predictions.
Stakeholder Communication: Communicate model insights and performance to
stakeholders in an understandable manner.
Feedback Loop: Incorporate feedback from users and stakeholders to improve the
model and its applications.
Research and Development: Stay updated with the latest advancements in machine
learning to continually improve the model.
The machine learning lifecycle is an iterative process that involves defining the problem,
collecting and preparing data, selecting and training models, deploying the model, and
continuously monitoring and improving it. Each stage is crucial for building robust, reliable,
and effective machine learning applications.
Data preprocessing is a critical step in the machine learning pipeline, involving various
techniques to clean and transform raw data into a format suitable for analysis and modeling.
Proper data preprocessing ensures that the model can learn effectively from the data and
produce accurate predictions. Here are the main steps involved in data preprocessing:
1. Data Cleaning
8
Removal: Remove rows or columns with missing values if they are not significant or
if the dataset is large.
Imputation: Fill missing values with mean, median, mode, or use more sophisticated
methods like K-nearest neighbors (KNN) imputation or regression imputation.
Forward/Backward Fill: For time series data, fill missing values with the previous
or next observation.
Removing Duplicates:
Identify and remove duplicate records to avoid redundancy and bias in the dataset.
Handling Outliers:
Identification: Use statistical methods (like Z-score, IQR) or visualization tools (like
box plots) to identify outliers.
Removal/Transformation: Remove outliers or transform them using techniques like
log transformation to reduce their impact.
2. Data Transformation
Normalization/Standardization:
Normalization: Scale data to a range, typically [0, 1], using min-max scaling.
Standardization: Transform data to have a mean of 0 and a standard deviation of 1
using z-score normalization.
Feature Scaling:
Ensures that features are on a similar scale to improve the performance of gradient-
based algorithms.
3. Feature Engineering
Domain Knowledge: Use domain knowledge to create new features that capture
important aspects of the data.
Polynomial Features: Create polynomial combinations of existing features to capture
non-linear relationships.
Feature Selection:
Filter Methods: Use statistical tests to select features with high correlation to the
target variable.
Wrapper Methods: Use algorithms like Recursive Feature Elimination (RFE) to
select the best subset of features.
9
Embedded Methods: Use algorithms that have built-in feature selection mechanisms
like Lasso and Ridge regression.
4. Data Reduction
Dimensionality Reduction:
Sampling:
5. Data Splitting
Train-Validation-Test Split:
Cross-Validation:
K-Fold Cross-Validation: Split the data into k subsets and train the model k times,
each time using a different subset as the validation set and the remaining as the
training set.
Stratified K-Fold Cross-Validation: Ensures that each fold has a proportional
representation of each class, useful for imbalanced datasets.
Data preprocessing is essential for preparing raw data for analysis and modeling. It involves
cleaning the data, transforming it into a suitable format, engineering new features, reducing
dimensionality, and splitting it into training, validation, and test sets. Proper preprocessing
ensures that the machine learning model can learn effectively and generalize well to new
data.
Artificial Intelligence (AI) and Machine Learning (ML) are closely related fields, but they are
distinct in their scope and objectives. Here's a detailed comparison to highlight their
differences:
Definition:
Goals:
To create systems that can reason, learn, perceive, and interact with the environment
in a human-like manner.
Techniques:
Expert Systems: Use rules and logic to mimic the decision-making ability of a
human expert.
Natural Language Processing (NLP): Enable machines to understand and respond
to human language.
Computer Vision: Enable machines to interpret and understand visual information
from the world.
Robotics: Involves designing and building robots that can carry out tasks
autonomously.
Speech Recognition: Enable machines to understand and process human speech.
Applications:
Self-driving cars
Virtual assistants (e.g., Siri, Alexa)
Recommendation systems
Healthcare diagnostics
Game playing (e.g., chess, Go)
Definition:
Scope:
A specific approach within the broader AI field, emphasizing data-driven learning and
adaptation.
Goals:
To develop systems that can automatically improve and adapt their performance by
learning from experience without being explicitly programmed.
Techniques:
Applications:-
Key Differences
Scope:
AI: Encompasses a broad range of technologies and goals aimed at mimicking human
intelligence.
ML: A specific branch of AI focused on data-driven learning and prediction.
Techniques:
Applications:
AI: Applications are broader and can include non-ML techniques (e.g., expert systems
in healthcare, robotics).
ML: Applications are specifically about making predictions or decisions based on
data (e.g., fraud detection, recommendation engines).
Development Focus:
AI: Involves creating systems with broader cognitive abilities, including reasoning,
planning, learning, and natural language understanding.
ML: Focuses on optimizing specific algorithms for tasks like classification,
regression, clustering, and reinforcement learning.
AI is a broad field aiming to create intelligent systems that can perform tasks requiring
human-like intelligence. Machine learning is a subset of AI that focuses on developing
algorithms that learn from data to make predictions or decisions. While all machine learning
is a form of AI, not all AI is machine learning.
A neural network is a computational model inspired by the way biological neural networks in
the human brain process information. It consists of interconnected units called neurons,
which work together to solve specific problems.
1. Neurons (Nodes):
o Basic units of a neural network.
o Each neuron receives inputs, processes them, and produces an output.
o Neurons are organized into layers.
2. Layers:
o Input Layer: Receives the initial data.
o Hidden Layers: Intermediate layers between the input and output layers
where computations are performed. The number of hidden layers and the
number of neurons in each layer define the depth and capacity of the network.
o Output Layer: Produces the final output of the network.
3. Weights and Biases:
o Weights: Parameters that determine the strength of the connection between
neurons. Each connection has an associated weight.
o Biases: Additional parameters that help adjust the output along with the
weighted sum of inputs.
4. Activation Functions:
o Functions applied to the output of each neuron. They introduce non-linearity,
allowing the network to model complex relationships.
o Common activation functions include:
Sigmoid: σ(x)=11+e−x\sigma(x) = \frac{1}{1 + e^{-x}}σ(x)=1+e−x1
Tanh: tanh(x)=21+e−2x−1\tanh(x) = \frac{2}{1 + e^{-2x}} -
1tanh(x)=1+e−2x2−1
ReLU (Rectified Linear Unit): ReLU(x)=max(0,x)\text{ReLU}(x)
= \max(0, x)ReLU(x)=max(0,x)
Leaky ReLU: Leaky ReLU(x)=max(0.01x,x)\text{Leaky ReLU}(x)
= \max(0.01x, x)Leaky ReLU(x)=max(0.01x,x)
13
How Neural Networks Work
1. Forward Propagation:
o Input data is passed through the network layer by layer.
o Each neuron computes a weighted sum of its inputs, applies an activation
function, and passes the result to the next layer.
o The final output is produced in the output layer.
2. Loss Function:
o Measures the difference between the predicted output and the actual target
values.
o Common loss functions include Mean Squared Error (MSE) for regression and
Cross-Entropy Loss for classification.
3. Backward Propagation (Backpropagation):
o The process of updating the weights and biases based on the loss.
o Uses gradient descent to minimize the loss function.
o Gradients are calculated using the chain rule of calculus to propagate errors
backward through the network.
4. Training Process:
o Initialize weights and biases randomly or using a specific initialization
technique.
o Repeat the following steps for a number of epochs or until convergence:
Perform forward propagation to compute the output.
Compute the loss.
Perform backward propagation to compute the gradients.
Update weights and biases using gradient descent or its variants (e.g.,
stochastic gradient descent, Adam optimizer).
Mathematical Representation
1. Forward Propagation:
o Let XXX be the input vector.
o Let W1W_1W1 and b1b_1b1 be the weights and biases for the hidden layer.
o Let W2W_2W2 and b2b_2b2 be the weights and biases for the output layer.
Hidden Layer:
Output Layer:
2. Loss Calculation:
14
Loss=Binary Cross-Entropy(A2,Y)\text{Loss} = \text{Binary Cross-Entropy}(A_2,
Y)Loss=Binary Cross-Entropy(A2,Y)
3. Backward Propagation:
o Compute gradients of the loss with respect to weights and biases.
o Update weights and biases using the gradients.
Neural networks are powerful models inspired by the human brain, consisting of
interconnected neurons organized into layers. They use forward propagation to compute
outputs, a loss function to measure errors, and backward propagation to update parameters
and minimize the loss. With various architectures like FNNs, RNNs, and CNNs, neural
networks can tackle a wide range of tasks, from image and speech recognition to natural
language processing and beyond.
15