Unit-II Notes
Unit-II Notes
Introduction and Preprocessing Data Mining Introduction: An overview of Data Mining – Kinds of
data and pattern to be mined –Technologies – Targeted Applications - Major Issues in Data Mining
– Data Objects and Attribute Types – Measuring Data Similarity and Dissimilarity
Data Preprocessing: Data Cleaning –Data Integration–Data Reduction–Data Transformation –
Data Discretization
Data Mining is the process of extracting useful information from large datasets. It involves the
application of statistical, machine learning, and database techniques to identify patterns, trends, and
relationships that would be difficult or impossible to discover by human inspection alone.
data mining is about finding needles in haystacks. It helps us to:
Discover hidden patterns: Identify trends, correlations, or anomalies that are not immediately
apparent.
Make predictions: Forecast future events or behaviors based on historical data.
Improve decision-making: Provide insights that can inform strategic planning and operational
decisions.
Common data mining techniques include:
Classification: Assigning data instances to predefined categories.
Regression: Predicting numerical values.
Clustering: Grouping data instances based on similarity.
Association rule mining: Discovering relationships between items in a dataset.
Outlier detection: Identifying data points that deviate significantly from the norm.
Data mining is used in a wide range of applications, including:
Business: Customer segmentation, market basket analysis, fraud detection
Healthcare: Disease diagnosis, patient risk assessment, drug discovery
Science: Scientific discovery, climate modeling, genomics
Government: Law enforcement, intelligence analysis, social science research.
Data warehouse: A data warehouse is a centralized repository that consolidates large volumes of
structured and unstructured data from various sources. It supports business intelligence activities,
enabling efficient querying and analysis for decision-making, reporting, and trend analysis, facilitating
insights across an organization.
1
Advanced Data Sets and Advanced Applications:
Data Streams and Sensor Data: Data streams are continuous flows of data generated in real-time, while
sensor data consists of measurements collected from IoT devices. Data generated from IoT devices and
sensors, which is often real-time and used for monitoring, predictive maintenance, and automation.
Time-Series Data, Temporal Data, Sequence Data: Time-series data records measurements over time,
temporal data relates to time-stamped events, and sequence data tracks ordered events or items. Data
points collected sequentially over time, such as stock prices, weather data, or sensor readings, useful for
trend analysis and forecasting.
Spatial Data And Spatiotemporal Data: Spatial data represents geographic locations and features, while
spatiotemporal data combines spatial information with time, capturing changes across locations over time.
Multimedia Database: A multimedia database stores various data types, including images, audio, and
video, enabling efficient retrieval and management of rich content.
World-Wide Web: The World-Wide Web is an interconnected network of information accessed via the
internet, utilizing hyperlinks, web pages, and browsers.
Technologies Used:
Visualization
Applications Data Mining
High
Data Base Performance
Algorithms
Technology Computing
Machine Learning: Machine learning is a branch of artificial intelligence that enables systems to learn
from data, improving performance without explicit programming.
Pattern Recognition: Pattern recognition is the process of identifying patterns and regularities in data,
often used in image analysis, speech recognition, and classification.
Statistics: Statistics is the science of collecting, analyzing, interpreting, and presenting data, helping to
uncover trends, patterns, and relationships in information.
Visualization: Visualization is the graphical representation of data, enabling easier interpretation and
insight through charts, graphs, and interactive displays to highlight patterns.
2
High-Performance Computing (HPC): High-performance computing (HPC) involves using powerful
processors and parallel processing to solve complex computational problems rapidly, enabling advanced
simulations and analyses.
Database Technology: Database Technology encompasses systems and tools for storing, managing, and
retrieving data efficiently, supporting applications like transactions, analytics, and reporting.
Algorithm: An Algorithm is a step-by-step procedure or set of rules for solving a problem or performing
a specific task systematically.
Data Objects:
Data objects represent individual entities or instances in a dataset. Each object is characterized by its
attributes and can vary in complexity. Common examples include:
Customers: Each customer can be an object with attributes like name, age, and purchase history.
Products: Products can be characterized by attributes such as price, category, and supplier.
Transactions: Each transaction can include attributes like transaction ID, date, amount, and items
purchased.
Attribute Types
Attributes are the properties or characteristics that define data objects. They can be classified into several
types:
1. Nominal Attributes
o Description: Categorical data without any intrinsic ordering.
o Examples: Gender (male, female), product category (electronics, clothing), and color (red,
blue).
2. Ordinal Attributes
o Description: Categorical data with a meaningful order but no fixed interval between
values.
o Examples: Customer satisfaction ratings (low, medium, high), education level (high
school, bachelor’s, master’s).
3. Interval Attributes
o Description: Numeric data with meaningful intervals, but no true zero point, allowing for
addition and subtraction.
o Examples: Temperature in Celsius or Fahrenheit, dates (where differences are meaningful
but ratios are not).
4. Ratio Attributes
o Description: Numeric data with a meaningful zero point, allowing for all arithmetic
operations (addition, subtraction, multiplication, and division).
o Examples: Weight, height, income, and age.
5. Binary Attributes
o Description: A special case of nominal attributes with only two possible values (often
represented as 0 and 1).
o Examples: Yes/No questions, presence/absence of a feature.
3
6. Text Attributes
o Description: Attributes containing textual data that may require processing, such as
natural language processing for analysis.
o Examples: Reviews, comments, and descriptions.
7. Date/Time Attributes
o Description: Attributes representing date and time information, which can be analyzed for
trends over time.
o Examples: Timestamp of transactions, user login times.
Measuring data similarity and dissimilarity refers to the process of quantifying how alike or different two
data objects (or instances) are based on their attributes. This concept is fundamental in various fields,
including data mining, machine learning, and statistics, as it helps in:
Key Concepts
1. Similarity:
o Definition: Indicates how closely two data objects resemble each other.
o Purpose: High similarity suggests that the objects share common characteristics, which
can be useful for clustering, classification, or recommendations.
2. Dissimilarity:
o Definition: Indicates how different two data objects are.
o Purpose: High dissimilarity implies that the objects do not share common features, which
can help in identifying outliers or distinct categories.
Importance
Clustering: In clustering algorithms, similar data points are grouped together, while dissimilar
points are separated. Measures of similarity and dissimilarity help determine the groups.
Classification: In classification tasks, understanding similarity allows algorithms to assign labels
based on the proximity of data points to known categories.
Recommendation Systems: Similarity measures help recommend products or content by finding
items that are alike based on user preferences.
o Used for: Continuous numerical data. It calculates the straight-line distance between two
points in Euclidean space.
4
Cosine Similarity:
o Formula:
o Used for: High-dimensional data, particularly text data represented as vectors. It measures
the cosine of the angle between two vectors, indicating how similar they are regardless of
magnitude.
Jaccard Similarity:
Formula:
o
o Used for: Binary data or strings. It counts the number of positions at which the
corresponding elements are different.
2. Dissimilarity Measures
Dissimilarity measures quantify how different two data objects are. Common methods include:
Used for: Continuous numerical data. It calculates the distance based on the sum of absolute
differences.
Minkowski Distance:
Generalized distance formula:
Used for: Continuous data, where rrr can be adjusted to represent different types of distance (e.g.,
r=1r = 1r=1 for Manhattan, r=2r = 2r=2 for Euclidean).
Mahalanobis Distance:
Formula:
Used for: Continuous data with correlations. It accounts for the distribution of the dataset and is
useful for identifying outliers.
5
Data Preprocessing:
Data preprocessing involves preparing raw data for analysis by cleaning, transforming, and
normalizing it to enhance quality and ensure consistency, enabling more accurate and effective data
mining and machine learning outcomes.
Data cleaning:
Data cleaning, also known as data cleansing or data scrubbing, is a crucial step in the data preparation
process. It involves identifying and correcting errors, inconsistencies, and inaccuracies in datasets to
improve data quality.
Importance of Data Cleaning
1. Improves Data Quality: Ensures accuracy, completeness, and consistency, leading to more
reliable analyses and insights.
2. Enhances Decision-Making: High-quality data supports better business decisions and strategies.
3. Reduces Costs: Cleaning data can save time and resources by preventing errors in analysis or
reporting.
4. Facilitates Compliance: Helps meet regulatory requirements by ensuring that data is accurate and
up to date.
6
Data Integration:
Data integration is the process of combining data from different sources to create a unified view, enabling
comprehensive analysis and reporting. It involves various techniques and tools to ensure consistency,
accuracy, and accessibility of the integrated data.
Key Components of Data Integration
1. Data Sources:
o Diverse origins like databases, spreadsheets, cloud services, APIs, and flat files.
2. ETL Process:
o Extract: Retrieving data from different sources.
o Transform: Cleaning, normalizing, and converting data into a suitable format.
o Load: Storing the transformed data into a target system, such as a data warehouse.
3. Data Mapping:
o Aligning data fields from different sources to ensure consistency and compatibility.
4. Data Quality Assurance:
o Ensuring the accuracy and reliability of the integrated data through validation and cleaning
processes.
Techniques for Data Integration
Manual Integration: Combining data through manual processes, suitable for small datasets.
Automated Tools: Using software solutions (e.g., Talend, Apache Nifi) for automating data
integration tasks.
Middleware: Implementing middleware solutions that facilitate communication between different
data sources and applications.
Data Reduction:
Data reduction is the process of minimizing the volume of data while maintaining its integrity and
essential information. This is crucial in data mining and analysis to improve efficiency, reduce storage
costs, and enhance processing speed.
Objectives of Data Reduction
1. Reduce Storage Space: Decrease the amount of space needed to store data.
2. Improve Processing Speed: Enhance the performance of algorithms by working with smaller
datasets.
3. Maintain Data Integrity: Ensure that the essential characteristics and information of the data are
preserved.
4. Facilitate Analysis: Simplify data handling and visualization.
Techniques for Data Reduction
1. Dimensionality Reduction:
o Principal Component Analysis (PCA): Transforms data to a lower-dimensional space
while preserving variance.
o t-Distributed Stochastic Neighbor Embedding (t-SNE): Useful for visualizing high-
dimensional data in two or three dimensions.
2. Data Aggregation:
o Summarizing data by combining multiple records into a single record (e.g., calculating
average sales per region).
3. Data Sampling:
o Selecting a representative subset of the data for analysis, which can reduce the dataset size
while maintaining essential characteristics.
7
4. Feature Selection:
o Identifying and retaining only the most relevant features or attributes, thereby discarding
redundant or irrelevant ones.
5. Compression:
o Using algorithms to compress data files (e.g., ZIP) without losing information, making
storage more efficient.
6. Clustering:
o Grouping similar data points together and representing them with a single cluster centroid,
thus reducing the number of data points.
Data Transformation:
Data Transformation is the process of converting data from one format or structure into another to make it
suitable for analysis, storage, or processing. It is a crucial step in data preprocessing, especially in data
integration and data warehousing. Here’s a detailed overview:
Objectives of Data Transformation
1. Standardization: Ensure consistency in data formats, making it easier to analyse and compare.
2. Normalization: Adjust data to a common scale without distorting differences in the ranges of
values.
3. Enhancement: Improve the quality and usefulness of the data by applying transformations that
derive new attributes or features.
4. Integration: Prepare data from various sources to be combined into a unified dataset.
Common Data Transformation Techniques
1. Scaling:
o Min-Max Scaling: Rescales data to a range of [0, 1].
o Standardization (Z-score normalization): Transforms data to have a mean of 0 and a
standard deviation of 1.
2. Aggregation:
o Combining multiple records into a single summary record (e.g., calculating total sales per
month).
3. Encoding Categorical Variables:
o One-Hot Encoding: Converts categorical variables into a binary format.
o Label Encoding: Assigns numerical values to categories, preserving order where
applicable.
4. Data Binning:
o Dividing continuous data into discrete bins or intervals to simplify analysis (e.g., age
groups).
5. Data Parsing:
o Splitting a single data field into multiple fields (e.g., separating a full name into first and
last names).
6. Data Type Conversion:
o Changing the data type of a field (e.g., converting strings to datetime objects).
7. Feature Engineering:
o Creating new features from existing ones to enhance predictive modeling (e.g., deriving a
"year" feature from a date).
8. String Manipulation:
8
o Modifying text data by trimming, concatenating, or replacing substrings.
Data Discretization:
Data discretization is the process of converting continuous data into discrete categories or intervals. This
technique is commonly used in data mining and machine learning to simplify data analysis and enhance
the performance of algorithms.
Objectives of Data Discretization
1. Simplification: Reduces the complexity of data by converting continuous variables into a
manageable number of categories.
2. Improved Interpretability: Makes the data easier to understand and interpret, especially for
decision-making processes.
3. Enhanced Model Performance: Certain algorithms, particularly classification algorithms, may
perform better with discrete data.
4. Handling Outliers: By grouping continuous values into bins, the impact of outliers can be
minimized.
Common Techniques for Data Discretization
1. Equal Width Binning:
o Divides the range of the continuous variable into equal-width intervals.
o Example: If the range is from 0 to 100 and you want 5 bins, each bin would be 20 units
wide (0-20, 21-40, etc.).
2. Equal Frequency Binning (Quantile Binning):
o Divides the data into bins so that each bin contains approximately the same number of
observations.
o Example: If you have 100 data points and create 5 bins, each bin will ideally contain 20
data points.
3. Clustering-Based Discretization:
o Uses clustering algorithms (like K-means) to group similar data points, creating bins based
on clusters.
o This method can adapt to the data distribution more flexibly than fixed-width or frequency
bins.
4. Decision Tree-Based Discretization:
o Uses decision tree algorithms to find optimal cut points for discretizing continuous
variables based on target outcomes.
o This method is particularly effective for ensuring that the bins are informative with respect
to the target variable.
5. Custom Binning:
o Defining specific intervals based on domain knowledge or business rules.
o Example: Grouping age into categories like "Child," "Teen," "Adult," and "Senior."