Chapter no.
4 – Introduction to Data Mining
(SOLUTION)
1. Define the term Data mining.
ANS:
Data Mining is a process used by organizations to extract specific
data from huge databases to solve business problems. It primarily
turns raw data into useful information.
The process of extracting information to identify patterns, trends, and
useful data that would allow the business to take the data-driven
decision from huge sets of data is called Data Mining or knowledge
discovery of database (KDD).
OR
Data Mining is the process of investigating or searching hidden
patterns of information to various perspectives for categorization into
useful data.
OR
Data mining is the act of automatically searching for large stores of
information to find trends and patterns that go beyond simple analysis
procedures.
2. Describe any four Challenges of Data mining.
ANS: Data mining system face a lot of challenges and issues in
today’s world. Some of them are mining methodology and user
interaction issues, performance issues, issues relating to the diversity
of database types etc.
1. Mining Methodology and User Interaction Issues:
Mining different Kinds of Knowledge in Databases: Different
user-different knowledge-different way. That means different client
want a different kind of information so it becomes difficult to cover
vast range of data that can meet the client requirement.
Interactive Mining of Knowledge at Multiple Levels of
Abstraction: Interactive mining allows users to focus the search for
patterns from different angles. The data mining process should be
interactive because it is difficult to know what can be discovered
within a database.
Incorporation of Background Knowledge: Background knowledge
is used to guide discovery process and to express the discovered
patterns.
Dhrupesh Sir 9699692059 DWM
Query Languages and Ad-hoc Mining: Relational query languages
(such as SQL) allow users to pose ad-hoc queries for data retrieval.
The language of data mining query language should be in perfectly
matched with the query language of data warehouse.
Handling Noisy or Incomplete Data: In a large database, many of
the attribute values will be incorrect. This may be due to human error
or because of any instruments fail. Data cleaning methods and data
analysis methods are used to handle noise data.
2. Performance Issues:
Efficiency and Scalability of Data Mining Algorithms: To
effectively extract information from a huge amount of data in databases,
data mining algorithms must be efficient and scalable.
Parallel, Distributed and Incremental Mining Algorithms: The
huge size of many databases, the wide distribution of data, and
complexity of some data mining methods are factors motivating the
development of parallel and distributed data mining algorithms. Such
algorithms divide the data into partitions, which are processed in parallel.
3. Issues Related to the Diversity of Database Types:
Handling of Relational and Complex Types of Data: There are
many kinds of data stored in databases and data warehouses. It is not
possible for one system to mine all these kind of [Link] different data
mining system should be construed for different kinds data.
Mining Information from Heterogeneous Databases and
Global Information Systems: Since data is fetched from different
data sources on Local Area Network (LAN) and Wide Area Network
(WAN). The discovery of knowledge from different sources of
structured is a great challenge to data mining.
4. Accuracy of Data Issues:
Data mining technique is not a 100 percent accurate. Most of the time
while collecting information about certain elements one used to seek
help from their clients, but nowadays everything has changed.
And now the process of information collection made things easy with
the mining technology and their methods. One of the most possible
limitations of this data mining system is that it can provide accuracy
of data with its own limits.
Dhrupesh Sir 9699692059 DWM
3. Explain steps involved in KDD process with
diagram.
OR
4. Explain in detail Knowledge Discovery of Database
(KDD).
ANS: Knowledge Discovery of Database (KDD):
Data mining is the process of discovering interesting patterns and
knowledge from large amounts of data.
Data mining is used by companies in order to get customer
preferences, determine price of their product and services and to
analyse market.
Data mining is also known as knowledge discovery in Database
(KDD).
Steps in the process of KDD:
Steps in KDD:
1. Data cleaning:
In data cleaning it removes the noise (error) and inconsistent data.
2. Data integration:
Multiple data sources may be combined in single unit.
3. Data selection:
The data relevant to the analysis task are retrieved from the database.
4. Data transformation:
The data are transformed and consolidated into forms appropriate for
mining by performing summary or aggregation operations. i.e. the data
from different data sources which is of varied types can be converted into
a single standard format.
Dhrupesh Sir 9699692059 DWM
5. Data mining:
Data mining is the process in which intelligent methods or algorithms are
applied on data to extract useful data patterns.
6. Pattern evaluation:
This process identifies the truly interesting patterns representing actual
knowledge based on user requirements for analysis.
7. Knowledge presentation:
In this process, visualization and knowledge representation techniques are
used to present mined knowledge to users for analysis.
5. Explain various data objects and attributes types.
ANS: Data Objects:
Data sets are made up of data objects.
A data object represents an entity.
Example: in a sales database, the objects may be customers, store
items, and sales; in a medical database, the objects may be patients.
Data objects are typically described by attributes.
If the data objects are stored in a database, they are data tuples. That
is, the rows of a database correspond to the data objects, and the
columns correspond to the attributes.
Attribute:
Attribute is a data field that represents characteristics or features of a
data object.
For a customer object, attributes can be customer Id, address etc.
Set of attributes used to describe an object.
Types of attributes:
1. Qualitative Attributes
2. Quantitative Attributes
1. Qualitative Attributes:
a. Nominal Attributes (N):
These attributes are related to names.
Dhrupesh Sir 9699692059 DWM
The values of a Nominal attribute are name of things, some kind of
symbols. Values of Nominal attributes represents some category or
state and that’s why nominal attribute also referred as categorical
attributes and there is no order (rank, position) among values of
nominal attribute.
Example:
b. Binary Attributes (B):
Binary data has only 2 values/states.
Example: yes or no, affected or unaffected, true or false.
[Link]: Both values are equally important (Gender).
[Link]: Both values are not equally important (Result).
Example:
c. Ordinal Attributes (O):
The Ordinal Attributes contains values that have a meaningful sequence
or ranking(order) between them.
Example:
2. Quantitative Attributes:
a. Numeric:
A numeric attribute is quantitative because, it is a measurable quantity,
represented in integer or real values.
Example:
b. Discrete:
Discrete data have finite values, it can be numerical and can also be in
categorical form.
These attributes have finite or countably infinite set of values.
Example:
Dhrupesh Sir 9699692059 DWM
c. Continuous:
Continuous data have infinite no. of states. Continuous data is of float
type. There can be many values between 2 and 3.
Example:
6. Explain Data preprocessing technique in data
mining
OR
7. Explain major tasks in data preprocessing.
ANS: The major tasks in Data Preprocessing:
1. Data Cleaning.
2. Data Integration.
3. Data Transformation.
4. Data Reduction.
5. Data Discretization.
Data Cleaning
Real world data tend to be incomplete, noisy, and inconsistent. Data
cleaning (or data cleansing) routines attempt to fill in missing values,
smooth out noise while identifying outliers, and correct inconsistencies in
the data.
a. Handling Missing Values.
b. Cleaning of noisy data.
Data Integration
Dhrupesh Sir 9699692059 DWM
Data integration is one of the steps of data pre-processing that
involves combining data residing in different sources and providing
users with a unified view of these data.
It merges the data from multiple data stores (data sources). It includes
multiple databases, cubes or flat files.
There are mainly two major approaches for data integration
commonly known as "tight coupling approach" and "loose coupling
approach".
Data Transformation
In data mining pre-processes and especially in metadata and data
warehouse, we use data transformation in order to convert data from a
source data format into destination data.
- 2, 32, 100, 59,48 -0.02, 0.32, 1.00, 0.59, 0.48
Here, the data are transformed or consolidated into forms appropriate for
mining. Data Transformation operations would contribute toward the
success of the mining process.
Data Reduction
A database or date warehouse may store terabytes of data. So it may
take very long to perform data analysis and mining on such huge
amounts of data.
Data reduction techniques can be applied to obtain a reduced
representation of the data set that is much smaller in volume, yet
closely maintains the integrity of the original data.
That is, mining on the reduced data set should be more efficient yet
produce the same (or almost the same) analytical results.
Data Discretization
Discretization and concept hierarchy generation are powerful tools for
data mining, in that they allow the mining of data at multiple levels of
abstraction.
Data discretization and concept hierarchy generation are also forms of
data reduction. The raw data are replaced by a smaller number of
interval or concept labels. This simplifies the original data and makes
the mining more efficient.
The resulting patterns mined are typically easier to understand.
Concept hierarchies are also useful for mining at multiple abstraction
levels.
Discretization, where the raw values of a numeric attribute (e.g., age)
are replaced by interval labels (e.g.. 0-10, 11-20, etc.) or conceptual
labels (e.g., youth, adult, senior). The labels, in turn, can be
Dhrupesh Sir 9699692059 DWM
recursively organized into higher-level concepts, resulting in aconcept
hierarchy for the numeric attribute.
8. Describe the need of data preprocessing.
ANS: Data preprocessing in data mining is the key step to identifying the
missing key values, inconsistencies, and noise, containing errors and
outliers. Without data preprocessing in data science, these data errors
would survive and lower the quality of data mining. It improves
accuracy and reliability. Preprocessing data removes missing or
inconsistent data values resulting from human or computer error, which
can improve the accuracy and quality of a dataset, making it more reliable.
9. List methods of data preprocessing.
ANS: Methods of Data preprocessing are:
Dhrupesh Sir 9699692059 DWM
[Link] any two data cleaning methods.
ANS: Data Cleaning:
Real-world data tend to be incomplete, noisy, and inconsistent.
Data cleaning (or data cleansing) routines attempt to fill in missing
values, smooth out noise while identifying outliers, and correct
inconsistencies in the data.
Method 1: Handling Missing Values:
1. Ignore the tuple
2. Fill in the missing value manually
3. Use a global constant to fill in the missing value Replace all
missing attribute values by the same constant such as a label like
“Unknown” or
4. Use a measure of central tendency for the attribute (e.g., the mean
or median) to fill the missing value
5. Use the attribute mean or median for all samples belonging to the
same class as the given tuple:
6. Use the most probable value to fill in the missing value:
Dhrupesh Sir 9699692059 DWM
Method 2: Handling Noisy Data:
1. Binning Method:
This method works on sorted data in order to smooth it. The whole
data is divided into segments of equal size and then various methods
are performed to complete the task. Each segmented is handled
separately. One can replace all data in a segment by its mean or
boundary values can be used to complete the task.
2. Regression:
Data smoothing can also be done by regression, a technique that
conforms data values to a function.
Linear regression involves finding the “best” line to fit two attributes
(or variables) so that one attribute can be used to predict the other.
Multiple linear regression is an extension of linear regression, where
more than two attributes are involved and the data are fit to a
multidimensional surface.
3. Clustering and Outlier analysis:
Clustering groups the similar data in a cluster. Outliers may be
detected by clustering, for example, where similar values are
organized into groups, or “clusters.” Intuitively, values that fall
outside of the set of clusters may be considered outliers.
[Link] Data cleaning.
ANS: Data Cleaning
Real world data tend to be incomplete, noisy, and inconsistent. Data
cleaning (or data cleansing) routines attempt to fill in missing values,
smooth out noise while identifying outliers, and correct inconsistencies in
the data.
a. Handling Missing Values.
b. Cleaning of noisy data.
Fig: Data Cleaning
[Link] Data Cleaning Process
Dhrupesh Sir 9699692059 DWM
ANS: Data Cleaning as a Process
Upto now, under data cleaning we have seen different techniques for
handling missing data and for smoothing data. Missing values, noise,
outliers and inconsistencies all these contribute to inaccurate data.
An outlier is a data object that deviates significantly from the rest of
the objects, as if it were generated by a different mechanism.
The first step in data cleaning as a process is discrepancy detection.
Discrepancies can be caused by several factors,
1. Poor data entry by human beings or certain data may not be
considered important at the time of entry.
2. Inconsistent with other recorded data or inconsistencies due to data
integration.
3. Data not entered due to misunderstanding.
4. Not register history or changes of the data.
5. Errors in instrumentation devices that record data and system errors.
6. Errors can also occur when the data are (inadequately) used for
purposes other than originally intended.
7. Experimental errors (data extraction or experiment
planning/executing errors).
Dhrupesh Sir 9699692059 DWM