0% found this document useful (0 votes)
9 views15 pages

Datasets in machine learning Unit 2

Uploaded by

riteshpc13
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
0% found this document useful (0 votes)
9 views15 pages

Datasets in machine learning Unit 2

Uploaded by

riteshpc13
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 15

Unit 2

Datasets in Machine Learning

1) What is datasets?

A dataset is a collection of data that is organized in a structured or semi-structured manner,


typically used for analysis, research, or machine learning. It consists of individual data points,
known as instances or samples, which can include a variety of attributes or features. Datasets
are essential for developing and evaluating algorithms and models in various fields.

Key Components of a Dataset

1. Instances (Samples):
o Each instance is a single data point or record.
o Examples: A row in a spreadsheet, a single image, a document.
2. Features (Attributes or Variables):
o Features are individual measurable properties or characteristics of the
instances.
o Examples: Age, height, and weight in a dataset about people; pixel values in
an image.
3. Labels (Targets or Outputs):
o Labels are the outcome or the variable that a model tries to predict (used in
supervised learning).
o Examples: Species of a flower in a dataset of iris flowers, the price of a house.

Types of Datasets :-

1. Structured Datasets:
o Organized in a tabular format with rows and columns.
o Example: Excel spreadsheets, SQL databases.
2. Unstructured Datasets:
o Do not have a predefined structure.
o Example: Text documents, images, audio files.
3. Semi-Structured Datasets:
o Contain elements of both structured and unstructured data.
o Example: JSON files, XML files.

Dataset Formats

1. CSV (Comma-Separated Values):


o A plain text format where each line represents a record, and columns are
separated by commas.
o Example:

Name, Age, Salary


John, 28, 50000
Jane, 32, 60000

2. JSON (JavaScript Object Notation):


o A lightweight data-interchange format that is easy for humans to read and
write.
o Example:

1
json
[
{"Name": "John", "Age": 28, "Salary": 50000},
{"Name": "Jane", "Age": 32, "Salary": 60000}
]

3. SQL Databases:
o Structured data stored in relational database management systems (RDBMS).
o Example: A table in an SQL database with columns and rows.
4. Images:
o Stored in formats like JPEG, PNG, and GIF.
o Example: A dataset of images of handwritten digits.
5. Text Files:
o Contain unstructured data in formats like TXT, DOCX, and PDF.
o Example: A dataset of movie reviews.

Usage in Machine Learning

1. Training:
o The dataset is used to train a machine learning model by learning patterns and
relationships in the data.
2. Validation:
o A separate portion of the dataset used to tune hyperparameters and evaluate
the model during training to prevent overfitting.
3. Testing:
o A final portion of the dataset used to assess the model's performance on
unseen data to ensure it generalizes well.

Example of a Dataset: The Iris Dataset

The Iris dataset is a classic example used in machine learning:

 Features:
o Sepal length
o Sepal width
o Petal length
o Petal width
 Label:
o Species of the iris flower (Setosa, Versicolour, Virginica)

Sources of Datasets

1. UCI Machine Learning Repository: A collection of databases, domain theories, and


datasets.
2. Kaggle: A platform for data science competitions that hosts various datasets.
3. Google Dataset Search: A tool to find datasets stored across the web.
4. Amazon Web Services (AWS) Public Datasets: Datasets available for use in AWS.

Importance of Datasets

Datasets are crucial for the development and evaluation of models and algorithms. The
quality and diversity of the dataset significantly impact the model's ability to learn and
generalize to new data. Properly curated and preprocessed datasets lead to more accurate and
reliable models.

2
Types of data in datasets:-

Datasets can include various types of data, each suitable for different analytical tasks and
applications. Here are the primary types of data found in datasets:

1. Numerical Data

Numerical data represents quantifiable measurements and is usually divided into two
subcategories:

 Continuous Data:
o Can take any value within a range.
o Examples: Height, weight, temperature, time.
 Discrete Data:
o Takes distinct, separate values.
o Examples: Number of children, number of cars, shoe size.

2. Categorical Data

Categorical data represents qualitative attributes and is divided into categories. It can be
further split into:

 Nominal Data:
o Categories with no inherent order.
o Examples: Gender, color, nationality.
 Ordinal Data:
o Categories with a meaningful order but no fixed interval between categories.
o Examples: Education level (high school, bachelor’s, master’s, PhD), customer
satisfaction ratings (poor, fair, good, excellent).

3. Time Series Data

Time series data consists of observations collected at regular intervals over time. This type of
data is crucial for tasks involving trends and patterns over time.

 Examples: Stock prices, weather data, sales figures over months or years.

4. Text Data

Text data consists of words and sentences and is often unstructured or semi-structured.

 Examples: Customer reviews, social media posts, articles, books.

3
5. Image Data

Image data includes visual information stored in formats such as JPEG, PNG, and GIF. Each
image can be considered as an array of pixel values.

 Examples: Photos, X-rays, satellite images.

6. Audio Data

Audio data consists of sound recordings, which can be in various formats like WAV, MP3,
and FLAC.

 Examples: Speech recordings, music tracks, environmental sounds.

7. Video Data

Video data includes sequences of images (frames) along with audio, capturing motion and
sound over time.

 Examples: Movies, surveillance footage, video lectures.

8. Sensor Data

Sensor data is collected from various sensors that measure physical quantities.

 Examples: Temperature sensors, accelerometers, gyroscopes, GPS data.

9. Geospatial Data

Geospatial data includes information about geographic locations and features on the Earth’s
surface.

 Examples: Maps, satellite imagery, GPS coordinates.

10. Binary Data

Binary data consists of data in a binary format, often used in computer systems and digital
communications.

 Examples: Digital images in binary format, machine code.

Mixed Data Types

In many real-world datasets, you may encounter mixed data types where different types of
data are combined.

 Example: A dataset containing patient records might include numerical data (age,
weight), categorical data (gender, diagnosis), and text data (doctor’s notes).

4
Understanding the types of data in a dataset is crucial because different types of data require
different preprocessing techniques, analytical methods, and model types. Identifying the data
type helps in selecting the appropriate algorithms and tools for analysis and modeling.

Need of Datasets:-

Datasets are fundamental in various fields, especially in data science, machine learning,
and artificial intelligence. Here are key reasons why datasets are crucial:

1. Training Machine Learning Models

 Pattern Recognition: Datasets provide the examples from which machine learning
models learn patterns, relationships, and dependencies in the data.
 Feature Learning: Models learn which features are most important for making
predictions or classifications based on the data provided.

2. Model Validation and Testing

 Validation: A portion of the dataset is used to tune model parameters and select the
best model configuration.
 Testing: Datasets are essential for evaluating the performance of a trained model on
unseen data, ensuring that the model generalizes well to new, real-world data.

3. Research and Development

 Algorithm Development: Researchers use datasets to develop and test new


algorithms.
 Benchmarking: Standard datasets are used to compare the performance of different
algorithms and approaches, providing a common ground for evaluation.

4. Insights and Decision Making

 Data Analysis: Datasets allow for the analysis of trends, patterns, and insights that
inform decision-making processes in businesses and organizations.
 Reporting: Datasets are used to generate reports, dashboards, and visualizations that
communicate important information to stakeholders.

5. Product and Service Improvement

 User Behavior Analysis: Companies use datasets to understand how users interact
with their products and services, leading to improvements and new features.
 Personalization: Datasets enable personalized experiences for users by analyzing
their preferences and behaviors.

6. Simulation and Testing

5
 Scenario Simulation: Datasets are used to simulate different scenarios and test the
outcomes, which is particularly useful in fields like finance, healthcare, and
engineering.
 Stress Testing: Datasets help in stress testing models and systems to ensure they can
handle extreme conditions and rare events.

7. Innovation and Creativity

 New Applications: Datasets inspire the development of new applications and


innovations by providing raw material for experimentation and discovery.
 Data-Driven Products: Many modern products and services are data-driven, relying
on datasets for their core functionality, such as recommendation systems and
predictive analytics.

8. Education and Training

 Learning Resources: Datasets are used as educational tools in academic and


professional training programs to teach data analysis, machine learning, and statistical
methods.
 Hands-On Practice: Students and professionals use datasets to gain hands-on
experience and practical skills in handling real-world data.

9. Legal and Regulatory Compliance

 Compliance Monitoring: Datasets help organizations monitor compliance with legal


and regulatory requirements by tracking relevant data points.
 Audit Trails: Maintaining datasets with proper records can provide an audit trail for
compliance and accountability purposes.

10. Enhancing Fairness and Reducing Bias

 Bias Detection: Datasets are analyzed to detect and mitigate biases in machine
learning models and decision-making processes.
 Fair Representation: Diverse and representative datasets help ensure that models are
fair and do not discriminate against any particular group.

In essence, datasets are the backbone of data-driven processes, enabling the training,
validation, and testing of machine learning models, driving research and development,
informing decisions, and facilitating innovation. Their importance cannot be overstated, as
they provide the foundation for understanding and leveraging data in various applications and
industries.

Machine learning Life cycle

The machine learning lifecycle encompasses the stages involved in developing, deploying,
and maintaining machine learning models. Here’s a detailed overview of each phase:

1. Problem Definition

 Objective Setting: Clearly define the problem you want to solve and set the
objectives.
 Understanding the Context: Understand the business or application context,
including constraints and requirements.

6
2. Data Collection

 Gathering Data: Collect relevant data from various sources such as databases, APIs,
and web scraping.
 Data Integration: Combine data from different sources to create a comprehensive
dataset.

3. Data Preparation

 Data Cleaning: Handle missing values, remove duplicates, correct errors, and deal
with outliers.
 Data Transformation: Normalize, standardize, or encode data into suitable formats
for analysis.
 Feature Engineering: Create new features from existing data that can help improve
model performance.
 Data Splitting: Split the data into training, validation, and test sets to ensure unbiased
evaluation of the model.

4. Exploratory Data Analysis (EDA)

 Descriptive Statistics: Summarize and understand the main characteristics of the


data.
 Visualization: Use plots and graphs to identify patterns, trends, and relationships in
the data.
 Hypothesis Testing: Form and test hypotheses to gain deeper insights into the data.

5. Model Selection

 Algorithm Choice: Choose appropriate machine learning algorithms based on the


problem type (classification, regression, clustering, etc.).
 Model Comparison: Compare different models to select the one with the best
performance metrics.

6. Model Training

 Training the Model: Use the training data to train the chosen model.
 Hyperparameter Tuning: Optimize model hyperparameters to improve
performance.
 Cross-Validation: Use cross-validation techniques to ensure the model’s robustness
and generalization ability.

7. Model Evaluation

 Performance Metrics: Evaluate the model using metrics such as accuracy, precision,
recall, F1-score, RMSE, etc.
 Validation Set: Assess the model’s performance on the validation set to fine-tune and
select the best model.
 Test Set: Evaluate the final model on the test set to measure its generalization
performance.

8. Model Deployment

 Deployment Environment: Choose the appropriate environment for deployment


(cloud, on-premises, edge devices).
7
 Integration: Integrate the model into the existing system or application.
 API Creation: Create APIs for the model to be accessed by other applications or
services.

9. Monitoring and Maintenance

 Performance Monitoring: Continuously monitor the model’s performance in


production to detect any degradation.
 Model Retraining: Retrain the model periodically with new data to maintain its
accuracy and relevance.
 Alerts and Logging: Implement alerting and logging mechanisms to detect issues in
real-time.

10. Model Interpretation and Explainability

 Understanding Model Predictions: Use techniques like SHAP, LIME, and feature
importance to explain model predictions.
 Stakeholder Communication: Communicate model insights and performance to
stakeholders in an understandable manner.

11. Documentation and Reporting

 Documentation: Document the entire process, including data sources, preprocessing


steps, model selection rationale, and performance metrics.
 Reporting: Generate reports and dashboards to present the findings and model
performance to stakeholders.

12. Continuous Improvement

 Feedback Loop: Incorporate feedback from users and stakeholders to improve the
model and its applications.
 Research and Development: Stay updated with the latest advancements in machine
learning to continually improve the model.

The machine learning lifecycle is an iterative process that involves defining the problem,
collecting and preparing data, selecting and training models, deploying the model, and
continuously monitoring and improving it. Each stage is crucial for building robust, reliable,
and effective machine learning applications.

Data Pre-processing in machine learning :-

Data preprocessing is a critical step in the machine learning pipeline, involving various
techniques to clean and transform raw data into a format suitable for analysis and modeling.
Proper data preprocessing ensures that the model can learn effectively from the data and
produce accurate predictions. Here are the main steps involved in data preprocessing:

1. Data Cleaning

Handling Missing Values:

8
 Removal: Remove rows or columns with missing values if they are not significant or
if the dataset is large.
 Imputation: Fill missing values with mean, median, mode, or use more sophisticated
methods like K-nearest neighbors (KNN) imputation or regression imputation.
 Forward/Backward Fill: For time series data, fill missing values with the previous
or next observation.

Removing Duplicates:

 Identify and remove duplicate records to avoid redundancy and bias in the dataset.

Handling Outliers:

 Identification: Use statistical methods (like Z-score, IQR) or visualization tools (like
box plots) to identify outliers.
 Removal/Transformation: Remove outliers or transform them using techniques like
log transformation to reduce their impact.

2. Data Transformation

Normalization/Standardization:

 Normalization: Scale data to a range, typically [0, 1], using min-max scaling.
 Standardization: Transform data to have a mean of 0 and a standard deviation of 1
using z-score normalization.

Encoding Categorical Variables:

 Label Encoding: Convert categorical values into integer labels.


 One-Hot Encoding: Convert categorical variables into binary vectors.
 Binary Encoding: Combine the benefits of both label encoding and one-hot encoding
for high cardinality categorical variables.

Feature Scaling:

 Ensures that features are on a similar scale to improve the performance of gradient-
based algorithms.

3. Feature Engineering

Creating New Features:

 Domain Knowledge: Use domain knowledge to create new features that capture
important aspects of the data.
 Polynomial Features: Create polynomial combinations of existing features to capture
non-linear relationships.

Feature Selection:

 Filter Methods: Use statistical tests to select features with high correlation to the
target variable.
 Wrapper Methods: Use algorithms like Recursive Feature Elimination (RFE) to
select the best subset of features.

9
 Embedded Methods: Use algorithms that have built-in feature selection mechanisms
like Lasso and Ridge regression.

4. Data Reduction

Dimensionality Reduction:

 Principal Component Analysis (PCA): Reduce the number of features by


transforming the data into a lower-dimensional space.
 Linear Discriminant Analysis (LDA): Similar to PCA but considers class labels to
maximize the separation between classes.
 t-Distributed Stochastic Neighbor Embedding (t-SNE): Useful for visualizing
high-dimensional data in 2D or 3D.

Sampling:

 Undersampling: Reduce the size of the majority class in imbalanced datasets.


 Oversampling: Increase the size of the minority class using techniques like SMOTE
(Synthetic Minority Over-sampling Technique).

5. Data Splitting

Train-Validation-Test Split:

 Training Set: Used to train the model.


 Validation Set: Used to tune hyperparameters and evaluate the model during training.
 Test Set: Used to evaluate the final model’s performance on unseen data.

Cross-Validation:

 K-Fold Cross-Validation: Split the data into k subsets and train the model k times,
each time using a different subset as the validation set and the remaining as the
training set.
 Stratified K-Fold Cross-Validation: Ensures that each fold has a proportional
representation of each class, useful for imbalanced datasets.

Data preprocessing is essential for preparing raw data for analysis and modeling. It involves
cleaning the data, transforming it into a suitable format, engineering new features, reducing
dimensionality, and splitting it into training, validation, and test sets. Proper preprocessing
ensures that the machine learning model can learn effectively and generalize well to new
data.

Difference between Artificial intelligence and Machine learning:-

Artificial Intelligence (AI) and Machine Learning (ML) are closely related fields, but they are
distinct in their scope and objectives. Here's a detailed comparison to highlight their
differences:

Artificial Intelligence (AI)

Definition:

 AI is a broad field of computer science focused on creating systems capable of


performing tasks that typically require human intelligence.
10
Scope:

 Encompasses a wide range of technologies and approaches, including logic, rule-


based systems, decision trees, genetic algorithms, and machine learning.

Goals:

 To create systems that can reason, learn, perceive, and interact with the environment
in a human-like manner.

Techniques:

 Expert Systems: Use rules and logic to mimic the decision-making ability of a
human expert.
 Natural Language Processing (NLP): Enable machines to understand and respond
to human language.
 Computer Vision: Enable machines to interpret and understand visual information
from the world.
 Robotics: Involves designing and building robots that can carry out tasks
autonomously.
 Speech Recognition: Enable machines to understand and process human speech.

Applications:

 Self-driving cars
 Virtual assistants (e.g., Siri, Alexa)
 Recommendation systems
 Healthcare diagnostics
 Game playing (e.g., chess, Go)

Machine Learning (ML)

Definition:

 ML is a subset of AI that focuses on developing algorithms and statistical models that


enable computers to learn from and make predictions or decisions based on data.

Scope:

 A specific approach within the broader AI field, emphasizing data-driven learning and
adaptation.

Goals:

 To develop systems that can automatically improve and adapt their performance by
learning from experience without being explicitly programmed.

Techniques:

 Supervised Learning: Algorithms learn from labeled data (input-output pairs) to


predict outcomes for new data.
11
o Examples: Linear regression, decision trees, support vector machines, neural
networks.
 Unsupervised Learning: Algorithms find patterns and structures in unlabeled data.
o Examples: Clustering (e.g., K-means, hierarchical clustering), dimensionality
reduction (e.g., PCA, t-SNE).
 Semi-Supervised Learning: Combines labeled and unlabeled data to improve
learning accuracy.
 Reinforcement Learning: Algorithms learn by interacting with an environment and
receiving feedback in the form of rewards or penalties.
o Examples: Q-learning, Deep Q-Networks (DQN).

Applications:-

 Image and speech recognition


 Fraud detection
 Predictive maintenance
 Recommendation systems
 Personalized marketing

Key Differences

Scope:

 AI: Encompasses a broad range of technologies and goals aimed at mimicking human
intelligence.
 ML: A specific branch of AI focused on data-driven learning and prediction.

Techniques:

 AI: Includes a variety of techniques beyond machine learning, such as rule-based


systems, expert systems, and symbolic AI.
 ML: Primarily focuses on developing algorithms that can learn from data, including
supervised, unsupervised, semi-supervised, and reinforcement learning.

Applications:

 AI: Applications are broader and can include non-ML techniques (e.g., expert systems
in healthcare, robotics).
 ML: Applications are specifically about making predictions or decisions based on
data (e.g., fraud detection, recommendation engines).

Development Focus:

 AI: Involves creating systems with broader cognitive abilities, including reasoning,
planning, learning, and natural language understanding.
 ML: Focuses on optimizing specific algorithms for tasks like classification,
regression, clustering, and reinforcement learning.

AI is a broad field aiming to create intelligent systems that can perform tasks requiring
human-like intelligence. Machine learning is a subset of AI that focuses on developing
algorithms that learn from data to make predictions or decisions. While all machine learning
is a form of AI, not all AI is machine learning.

Basics of neural network.


12
Neural networks are a key component of many modern machine learning systems, especially
deep learning. Here are the basics of neural networks, including their structure, components,
and how they work:

What is a Neural Network?

A neural network is a computational model inspired by the way biological neural networks in
the human brain process information. It consists of interconnected units called neurons,
which work together to solve specific problems.

Components of a Neural Network

1. Neurons (Nodes):
o Basic units of a neural network.
o Each neuron receives inputs, processes them, and produces an output.
o Neurons are organized into layers.
2. Layers:
o Input Layer: Receives the initial data.
o Hidden Layers: Intermediate layers between the input and output layers
where computations are performed. The number of hidden layers and the
number of neurons in each layer define the depth and capacity of the network.
o Output Layer: Produces the final output of the network.
3. Weights and Biases:
o Weights: Parameters that determine the strength of the connection between
neurons. Each connection has an associated weight.
o Biases: Additional parameters that help adjust the output along with the
weighted sum of inputs.
4. Activation Functions:
o Functions applied to the output of each neuron. They introduce non-linearity,
allowing the network to model complex relationships.
o Common activation functions include:
 Sigmoid: σ(x)=11+e−x\sigma(x) = \frac{1}{1 + e^{-x}}σ(x)=1+e−x1
 Tanh: tanh⁡(x)=21+e−2x−1\tanh(x) = \frac{2}{1 + e^{-2x}} -
1tanh(x)=1+e−2x2−1
 ReLU (Rectified Linear Unit): ReLU(x)=max⁡(0,x)\text{ReLU}(x)
= \max(0, x)ReLU(x)=max(0,x)
 Leaky ReLU: Leaky ReLU(x)=max⁡(0.01x,x)\text{Leaky ReLU}(x)
= \max(0.01x, x)Leaky ReLU(x)=max(0.01x,x)

Structure of a Neural Network

1. Feedforward Neural Network (FNN):


o The simplest type of artificial neural network.
o Information flows in one direction, from input to output.
o No cycles or loops in the network.
2. Recurrent Neural Network (RNN):
o Neurons are connected in a cyclic manner, allowing the network to maintain a
'memory' of previous inputs.
o Used for sequential data like time series and natural language processing.
3. Convolutional Neural Network (CNN):
o Specialized for processing grid-like data such as images.
o Uses convolutional layers to automatically and adaptively learn spatial
hierarchies of features.

13
How Neural Networks Work

1. Forward Propagation:
o Input data is passed through the network layer by layer.
o Each neuron computes a weighted sum of its inputs, applies an activation
function, and passes the result to the next layer.
o The final output is produced in the output layer.
2. Loss Function:
o Measures the difference between the predicted output and the actual target
values.
o Common loss functions include Mean Squared Error (MSE) for regression and
Cross-Entropy Loss for classification.
3. Backward Propagation (Backpropagation):
o The process of updating the weights and biases based on the loss.
o Uses gradient descent to minimize the loss function.
o Gradients are calculated using the chain rule of calculus to propagate errors
backward through the network.
4. Training Process:
o Initialize weights and biases randomly or using a specific initialization
technique.
o Repeat the following steps for a number of epochs or until convergence:
 Perform forward propagation to compute the output.
 Compute the loss.
 Perform backward propagation to compute the gradients.
 Update weights and biases using gradient descent or its variants (e.g.,
stochastic gradient descent, Adam optimizer).

Example of a Simple Neural Network

Consider a simple feedforward neural network with one hidden layer:

1. Input Layer: 2 neurons (for two input features).


2. Hidden Layer: 3 neurons with ReLU activation.
3. Output Layer: 1 neuron with sigmoid activation (for binary classification).

Mathematical Representation

1. Forward Propagation:
o Let XXX be the input vector.
o Let W1W_1W1 and b1b_1b1 be the weights and biases for the hidden layer.
o Let W2W_2W2 and b2b_2b2 be the weights and biases for the output layer.

Hidden Layer:

Z1=X⋅W1+b1Z_1 = X \cdot W_1 + b_1Z1=X⋅W1+b1 A1=ReLU(Z1)A_1 = \


text{ReLU}(Z_1)A1=ReLU(Z1)

Output Layer:

Z2=A1⋅W2+b2Z_2 = A_1 \cdot W_2 + b_2Z2=A1⋅W2+b2 A2=Sigmoid(Z2)A_2 = \


text{Sigmoid}(Z_2)A2=Sigmoid(Z2)

2. Loss Calculation:

14
Loss=Binary Cross-Entropy(A2,Y)\text{Loss} = \text{Binary Cross-Entropy}(A_2,
Y)Loss=Binary Cross-Entropy(A2,Y)

Where YYY is the actual target value.

3. Backward Propagation:
o Compute gradients of the loss with respect to weights and biases.
o Update weights and biases using the gradients.

Neural networks are powerful models inspired by the human brain, consisting of
interconnected neurons organized into layers. They use forward propagation to compute
outputs, a loss function to measure errors, and backward propagation to update parameters
and minimize the loss. With various architectures like FNNs, RNNs, and CNNs, neural
networks can tackle a wide range of tasks, from image and speech recognition to natural
language processing and beyond.

15

You might also like