0% found this document useful (0 votes)
22 views54 pages

2nd Slides

The document discusses the importance of understanding data types and attributes for effective data mining, emphasizing the need to analyze data characteristics such as distribution and quality. It categorizes attributes into nominal, binary, ordinal, and numeric types, explaining their properties and examples. Additionally, it covers statistical measures of central tendency and dispersion to assess data distributions and identify outliers.

Uploaded by

ahmed.s.g9800
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views54 pages

2nd Slides

The document discusses the importance of understanding data types and attributes for effective data mining, emphasizing the need to analyze data characteristics such as distribution and quality. It categorizes attributes into nominal, binary, ordinal, and numeric types, explaining their properties and examples. Additionally, it covers statistical measures of central tendency and dispersion to assess data distributions and identify outliers.

Uploaded by

ahmed.s.g9800
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

• To conduct successful data mining, the first important thing is to get familiar with

your data.
• You may want to know the following: What are the types of attributes or fields that
make up your data?
• What kind of values does each attribute have?
• How are the values distributed?
• How can we measure the similarity of some data objects with respect to others?
• Gaining such insights into the data will help with the subsequent analysis.
• Moreover, real-world data are typically noisy, enormous in volume (often several
gigabytes or more), and may originate from a hodgepodge of heterogeneous
sources.
• How can we measure the quality of data?
• How can we clean and integrate data from multiple heterogeneous sources?
• How can we normalize, compress, or transform the data?
• How can we reduce the dimensionality of data to help subsequent analysis?
Data types
• Data sets are made up of data objects.
• A data object represents an entity—in a sales database, the objects
may be customers, store items, and sales; in a medical database, the
objects may be patients; in a university database, the objects may be
students, professors, and courses.
• Data objects are typically described by attributes.
• Data objects can also be referred to as samples, examples, instances,
data points, or objects.
• If the data objects are stored in a database, they are data tuples.
• That is, the rows of a database correspond to the data objects, and
the columns correspond to the attributes.
• What is an attribute?
• An attribute is a data field, representing a characteristic or feature of a data object.
• The nouns attribute, dimension, feature, and variable are often used interchangeably in the
literature.
• The term dimension is commonly used in data warehousing.
• Machine learning literature tends to use the term feature, whereas statisticians prefer the term
variable.
• Data mining and database professionals commonly use the term attribute, and we do here as
well.
• Attributes describing a customer object can include, for example, customer_ID, name, and
address.
• Observed values for a given attribute are known as observations.
• A set of attributes used to describe a given object is called an attribute vector (or feature vector).
• The distribution of data involving one attribute (or variable) is called univariate.
• A bivariate distribution involves two attributes, and so on.
• The type of an attribute is determined by the set of possible values—nominal, binary, ordinal, or
numeric—the attribute can have.
Nominal attributes
• Nominal means “relating to names.”
• The values of a nominal attribute are symbols or names of things.
• Each value represents some kind of category, code, or state, and so nominal
attributes are also referred to as categorical.
• The values do not have any meaningful order. In computer science, the values
are also known as enumerations.
• Example of nominal attributes:
• Suppose that hair_color and marital_status are two attributes describing person
objects.
• In our application, possible values for hair_color are black, brown, blond, red, auburn,
gray, and white.
• The attribute marital_status can take on the values single, married, divorced, and
widowed. Both hair_color and marital_status are nominal attributes.
• Another example of a nominal attribute is occupation, with the values teacher, dentist,
programmer, farmer, and so on.
• Although we said that the values of a nominal attribute are symbols or “names
of things,” it is possible to represent such symbols or “names” with numbers.
• With hair_color, for instance, we can assign a code of 0 for black, 1 for brown,
and so on.
• Another example is customer_ID, with possible values that are all numeric.
However, in such cases, the numbers are not intended to be used
quantitatively.
• That is, mathematical operations on values of nominal attributes are not
meaningful.
• It makes no sense to subtract one customer ID number from another, unlike,
say, subtracting an age value from another (where age is a numeric attribute).
• Even though a nominal attribute may have integers as values, it is not
considered a numeric attribute because the integers are not meant to be used
quantitatively.
• Because nominal attribute values do not have any meaningful order
about them and are not quantitative, it makes no sense to find the
mean (average) value or median (middle) value for such an attribute,
given a set of objects.
• One thing that is of interest, however, is the attribute’s most
commonly occurring value.
• This value, known as the mode, is one of the measures of central
tendency.
Binary attributes
• A binary attribute is a nominal attribute with only two categories or
states: 0 or 1, where 0 typically means that the attribute is absent,
and 1 means that it is present.
• Binary attributes are referred to as Boolean if the two states
correspond to true and false.
• Example of binary attributes.
• Given the attribute smoker describing a patient object, 1 indicates that the
patient smokes, whereas 0 indicates that the patient does not.
• Similarly, suppose the patient undergoes a medical test that has two possible
outcomes.
• The attribute medical_test is binary, where a value of 1 means the result of
the test for the patient is positive, whereas 0 means the result is negative.
• A binary attribute is symmetric if both of its states are equally
valuable and carry the same weight; that is, there is no preference on
which outcome should be coded as 0 or 1.
• One such example could be the attribute gender having the states
male and female.
• A binary attribute is asymmetric if the outcomes of the states are not
equally important, such as the positive and negative outcomes of a
medical test for HIV.
• By convention, we code the most important outcome, which is usually
the rarer one, by 1 (e.g., HIV positive) and the other by 0 (e.g., HIV
negative).
Ordinal attributes
• An ordinal attribute is an attribute with possible values that have a
meaningful order or ranking among them, but the magnitude between
successive values is not known.
• Example of ordinal attributes.
• Suppose that drink_size corresponds to the size of drinks available at a fast-food
restaurant.
• This nominal attribute has three possible values: small, medium, and large.
• The values have a meaningful sequence (which corresponds to increasing drink
size); however, we cannot tell from the values how much bigger, say, a large is
than a medium.
• Other examples of ordinal attributes include grade (e.g., A+, A, A−, B+, and so
on) and professional_rank.
• Professional ranks can be enumerated in a sequential order: for example,
assistant, associate, and full for professors, and private, private second class,
private first class, specialist, corporal, sergeant, . . . for army ranks.
• Ordinal attributes are useful for registering subjective assessments of
qualities that cannot be measured objectively; thus ordinal attributes
are often used in surveys for ratings.
• In one survey, participants were asked to rate how satisfied they were
as customers.
• Customer satisfaction had the following ordinal categories: 1: very
dissatisfied, 2: dissatisfied, 3: neutral, 4: satisfied, and 5: very
satisfied.
• Ordinal attributes may also be obtained from the discretization of
numeric quantities by splitting the value range into a finite number of
ordered categories.
• The central tendency of an ordinal attribute can be represented by its
mode and its median (the middle value in an ordered sequence), but
the mean cannot be defined.
• Note that nominal, binary, and ordinal attributes are qualitative.
• That is, they describe a feature of an object without giving an actual
size or quantity.
• The values of such qualitative attributes are typically words
representing categories.
• If integers are used, they represent computer codes for the
categories, as opposed to measurable quantities (e.g., 0 for small
drink size, 1 for medium, and 2 for large).
• We will look at numeric attributes, which provide quantitative
measurements of an object as given below:
Numeric attributes
• A numeric attribute is quantitative; that is, it is a measurable
quantity, represented in integer or real values.
• Numeric attributes can be interval-scaled or ratio-scaled.
Interval-scaled attributes:
• Interval-scaled attributes are measured on a scale of equal-size units.
• The values of interval-scaled attributes have order and can be positive, 0, or negative.
• Thus, in addition to providing a ranking of values, such attributes allow us to compare and quantify the
difference between values.
• Example of interval-scaled attributes.
• A temperature attribute is interval-scaled.
• Suppose that we have the outdoor temperature values for a number of different days, where each day is
an object.
• By ordering the values, we obtain a ranking of the objects with respect to temperature.
• In addition, we can quantify the difference between values.
• For example, a temperature of 20 ◦C is five degrees higher than a temperature of 15 ◦C.
• Calendar dates are another example.
• For instance, the years 2012 and 2020 are eight years apart.
• Temperatures in Celsius and Fahrenheit do not have a true zero-point, that is,
neither 0 ◦C nor 0 ◦F indicates “no temperature.”
• (On the Celsius scale, for example, the unit of measurement is 1/100 of the
difference between the melting temperature and the boiling temperature of
water in atmospheric pressure.)
• Although we can compute the difference between temperature values, we
cannot talk of one temperature value as being a multiple of another.
• Without a true zero, we cannot say, for instance, that 10 ◦C is twice as warm as 5
◦C.
• That is, we cannot speak of the values in terms of ratios.
• Similarly, there is no true zero-point for calendar dates.
• (The year 0 does not correspond to the beginning of time.)
• This brings us to ratio-scaled attributes, for which a true zero-point exists.
• Because interval-scaled attributes are numeric, we can compute their mean
value, in addition to the median and mode measures of central tendency.
Ratio-scaled attributes
• A ratio-scaled attribute is a numeric attribute with an inherent zero-point.
• That is, if a measurement is ratio-scaled, we can speak of a value as being a
multiple (or ratio) of another value.
• In addition, the values are ordered, and we can also compute the difference
between values, as well as the mean, median, and mode.
• Example of ratio-scaled attributes.
• Unlike temperatures in Celsius and Fahrenheit, the Kelvin (K) temperature scale has
what is considered a true zero-point (0 K=−273.15 ◦C):
• It is the point at which all thermal motion ceases in the classical description of
thermodynamics.
• Other examples of ratio-scaled attributes include count attributes such as
years_of_experience (e.g., the objects are employees) and number_of_words (e.g., the
objects are documents).
• Additional examples include attributes to measure weight, height, and speed, and
monetary quantities (e.g., you are 100 times richer with $100 than with $1).
Discrete vs. continuous attributes
• In our presentation, we have organized attributes into nominal,
binary, ordinal, and numeric types.
• There are many ways to organize attribute types.
• The types are not mutually exclusive.
• Classification algorithms developed from the field of machine learning
often consider attributes as being either discrete or continuous.
• Each type may be processed differently.
• A discrete attribute has a finite or countably infinite set of values,
which may or may not be represented as integers.
• The attributes hair_color, smoker, medical_test, and drink_size each
have a finite number of values, and so are discrete.
• Note that discrete attributes may have numeric values, such as 0 and
1 for binary attributes or, the values 0 to 110 for the attribute age.
• An attribute is countably infinite if the set of possible values is infinite
but the values can be put in a one-to-one correspondence with
natural numbers.
• For example, the attribute customer_ID is countably infinite.
• The number of customers can grow to infinity, but in reality, the
actual set of values is countable (where the values can be put in one-
to-one correspondence with the set of integers).
• Zip codes are another example.
• If an attribute is not discrete, it is continuous.
• The terms numeric attribute and continuous attribute are often used
interchangeably in the literature.
• (This can be confusing because, in the classic sense, continuous values
are real numbers, whereas numeric values can be either integers or
real numbers.)
• In practice, real values are represented using a finite number of digits.
• Continuous attributes are typically represented as floating-point
variables.
Statistics of data
• For data preprocessing to be successful, it is essential to have an
overall picture of your data.
• Basic statistical descriptions can be used to identify properties of the
data and highlight which data values should be treated as noise or
outliers.
• We will discuss three areas of basic statistical descriptions.
• We start with measures of central tendency, which measure the
location of the middle or center of a data distribution.
• Intuitively speaking, given an attribute, where do most of its values
fall?
• In particular, we discuss the mean, median, mode, and midrange.
• In addition to assessing the central tendency of our data set, we also would like to
have an idea of the dispersion of the data.
• That is, how are the data spread out?
• The most common data dispersion measures are the range, quartiles (e.g., Q1,
which is the first quartile, i.e., the 25th percentile), and interquartile range; the five-
number summary and boxplots; and the variance and standard deviation of the
data.
• These measures are useful for identifying outliers.
• To facilitate the description of relations among multiple variables, the concepts of
co-variance and correlation coefficient for numerical data and χ2 correlation test for
nominal data will also be explained.
• We will mention graphic displays of basic statistical descriptions to visually inspect
our data.
• Most statistical or graphical data presentation software packages include bar charts, pie
charts, and line graphs.
• Other popular displays of data summaries and distributions include quantile plots, quantile-
quantile plots, histograms, and scatter plots.
Measuring the central tendency
• We will look at various ways to measure the central tendency of data.
• Suppose that we have some attribute X, like salary, which has been
recorded for a set of objects. Let x1, x2, . . . , xN be the set of N
observed values or observations for X.
• Here, these values may also be referred to as the data set (for X).
• If we were to plot the observations for salary, where would most of
the values fall?
• This gives us an idea of the central tendency of the data.
• Measures of central tendency include the mean, median, mode, and
midrange.
• Although the mean is the single most useful quantity for describing a data set, it
is not always the best way of measuring the center of the data.
• A major problem with the mean is its sensitivity to extreme (e.g., outlier) values.
• Even a small number of extreme values can corrupt the mean.
• For example, the mean salary at a company may be substantially pushed up by
that of a few highly paid managers.
• Similarly, the mean score of a class in an exam could be pulled down quite a bit
by a few very low scores.
• To offset the effect caused by a small number of extreme values, we can instead
use the trimmed mean, which is the mean obtained after chopping off values at
the high and low extremes.
• For example, we can sort the values observed for salary and remove the top and
bottom 2% before computing the mean.
• We should avoid trimming too large a portion (such as 20%) at both ends, as this
can result in the loss of valuable information.
• For skewed (asymmetric) data, a better measure of the center of data is
the median, which is the middle value in a set of ordered data values.
• It is the value that separates the higher half of a data set from the
lower half.
• In probability and statistics, the median generally applies to numeric
data; however, we may extend the concept to ordinal data.
• Suppose that a given data set of N values for an attribute X is sorted in
ascending order.
• If N is odd, then the median is the middle value of the ordered set.
• If N is even, then the median is not unique; it is the two middlemost
values and any value in between.
• If X is a numeric attribute in this case, by convention, the median is
taken as the average of the two middlemost values.
• Mode is another measure of central tendency.
• The mode for a set of data is the value that occurs most frequently compared
to all neighboring values in the set.
• Therefore, it can be determined for qualitative and quantitative attributes.
• It is possible for the greatest frequency to correspond to several different
values, which results in more than one mode.
• Data sets with one, two, or three modes are respectively called unimodal,
bimodal, and trimodal.
• In general, a data set with two or more modes is multimodal.
• The midrange can also be used to assess the central tendency of a
numeric data set.
• It is the average of the largest and smallest values in the set.
• In a unimodal frequency curve with perfect symmetric data
distribution, the mean, median, and mode are all at the same center
value, as shown in Fig a given below:
• Data in most real applications are not symmetric.
• They may instead be either positively skewed, where the mode occurs at a
value that is smaller than the median (Fig. B given below), or negatively
skewed, where the mode occurs at a value greater than the median (Fig. C
given below).

Figure: Mean, median, and mode of symmetric vs. positively


and negatively skewed data.
Measuring the dispersion of data
• We now look at measures to assess the dispersion or spread of
numeric data.
• The measures include range, quantiles, quartiles, percentiles, and the
interquartile range.
• The five-number summary, which can be displayed as a boxplot, is
useful in identifying outliers.
• Variance and standard deviation also indicate the spread of a data
distribution.
Range, quartiles, and interquartile range
• To start off, let’s study the range, quantiles, quartiles, percentiles, and
the interquartile range as measures of data dispersion.

• Suppose that the data for attribute X are sorted in ascending numeric order.
• Imagine that we can pick certain data points so as to split the data distribution into equal-size consecutive
sets, as in the following figure:
• These data points are called quantiles.
• Quantiles are points taken at regular intervals of a data distribution,
dividing it into essentially equal-size consecutive sets.
• (We say “essentially” because there may not be data values of X that
divide the data into exactly equal-size subsets.
• For readability, we will refer to them as equal.)
• The kth q-quantile for a given data distribution is the value x such that
at most k/q of the data values are less than x and at most (q −k)/q of
the data values are more than x, where k is an integer such that 0< k
<q.
• There are q − 1 q-quantiles.
• The 2-quantile is the data point dividing the lower and upper halves of
the data distribution.
• It corresponds to the median.
• The 4-quantiles are the three data points that split the data
distribution into four equal parts; each part represents one-fourth of
the data distribution.
• They are more commonly referred to as quartiles.
• The 100-quantiles are more commonly referred to as percentiles;
they divide the data distribution into 100 equal-size consecutive sets.
• The median, quartiles, and percentiles are the most widely used
forms of quantiles.
• The quartiles give an indication of a distribution’s center, spread, and
shape.
• The first quartile, denoted by Q1, is the 25th percentile.
• It cuts off the lowest 25% of the data.
• The third quartile, denoted by Q3, is the 75th percentile—it cuts off
the lowest 75% (or highest 25%) of the data.
• The second quartile is the 50th percentile.
• As the median, it gives the center of the data distribution.
• The distance between the first and third quartiles is a simple measure
of spread that gives the range covered by the middle half of the data.
• This distance is called the interquartile range (IQR) and is defined as
Five-number summary, boxplots, and outliers
• No single numeric measure of spread (e.g., IQR) is very useful for
describing skewed distributions.
• In the symmetric distribution, the median (and other measures of
central tendency) splits the data into equal-size halves.
• This does not occur for skewed distributions.
• Therefore it is more informative to also provide the two quartiles Q1
and Q3, along with the median.
• A common rule of thumb for identifying suspected outliers is to single
out values falling at least 1.5×IQR above the third quartile or below
the first quartile.
• Because Q1, the median, and Q3 together contain no information about the
endpoints (e.g., tails) of the data, a fuller summary of the shape of a
distribution can be obtained by providing the lowest and highest data
values as well.
• This is known as the five-number summary.
• The five-number summary of a distribution consists of the median (Q2), the
quartiles Q1 and Q3, and the smallest and largest individual observations,
written in the order of Minimum, Q1, Median, Q3, Maximum.
• Boxplots are a popular way of visualizing a distribution.
• A boxplot incorporates the five-number summary as follows:
• Typically, the ends of the box are at the quartiles so that the box length is the
interquartile range.
• The median is marked by a line within the box.
• Two lines (called whiskers) outside the box extend to the smallest (Minimum) and
largest (Maximum) observations.
• When dealing with a moderate number of observations, it is
worthwhile to plot potential outliers individually.
• To do this in a boxplot, the whiskers are extended to the extreme low
and high observations only if these values are less than 1.5×IQR
beyond the quartiles.
• Otherwise, the whiskers terminate at the most extreme observations
occurring within 1.5×IQR of the quartiles.
• The remaining cases are plotted individually.
• Boxplots can be used in the comparisons of several sets of compatible
data.
Boxplot for the unit price data for items sold
at four branches of an online store during a
given time period.
Variance and standard deviation
• Variance and standard deviation are measures of data dispersion
• They indicate how spread out a data distribution is.
• A low standard deviation means that the data observations tend to be
very close to the mean, whereas a high standard deviation indicates
that the data are spread out over a large range of values.
Covariance and correlation analysis
Covariance of numeric data
• Note that correlation does not imply causality.
• That is, if A and B are correlated, this does not necessarily imply that A causes B or that B causes A.
• For example, in analyzing a demographic database, we may find that attributes representing the number of
hospitals and the number of car thefts in a region are correlated.
• This does not mean that one causes the other. Both are actually causally linked to a third attribute, namely,
population.
Similarity and distance measures
• In data mining applications, such as clustering, outlier analysis, and
nearest-neighbor classification, we need ways to assess how alike or
unalike objects are in comparison to one another.
• For example, a store may want to search for clusters of customer
objects, resulting in groups of customers with similar characteristics
(e.g., similar income, area of residence, and age).
• Such information can then be used for marketing.
• A cluster is a collection of data objects such that the objects within a
cluster are similar to one another and dissimilar to the objects in
other clusters.
• Outlier analysis also employs clustering based techniques to identify
potential outliers as objects that are highly dissimilar to others.
• Knowledge of object similarities can also be used in nearest-neighbor
classification schemes where a given object (e.g., a patient) is
assigned a class label (relating to, say, a diagnosis) based on its
similarity toward other objects in the model.
• This section presents similarity and dissimilarity measures, which are
referred to as measures of proximity.
• Similarity and dissimilarity are related.
• A similarity measure for two objects, i and j, will typically return value
0 if the objects are completely unalike.
• The higher the similarity value, the greater the similarity between
objects. (Typically, a value of 1 indicates complete similarity, that is,
the objects are identical.)
• A dissimilarity measure works the opposite way.
• It returns a value of 0 if the objects are the same (and therefore, far from being dissimilar).
• The higher the dissimilarity value, the more dissimilar the two objects are.
• We present two data structures that are commonly used in the above types of applications:
• the data matrix (used to store the data objects) and
• the dissimilarity matrix (used to store dissimilarity values for pairs of objects).
• We also switch to a different notation for data objects than we previously used since now we
are dealing with objects described by more than one attribute.
• We will then discuss how object dissimilarity can be computed for objects described by
nominal attributes, by binary attributes, by numeric attributes, by ordinal attributes, or by
combinations of these attribute types.
• We will provide similarity measures for very long and sparse data vectors, such as term-
frequency vectors representing documents in information retrieval.
• Finally, we will discuss how to measure the difference between two probability distributions
over the same variable x, and introduces a measure, called the Kullback-Leibler divergence, or
simply, the KL divergence, which has been popularly used in the data mining literature.
• Knowing how to compute dissimilarity is useful in studying attributes
and will also be referenced in later topics on clustering, outlier
analysis, and nearest-neighbor classification.
Data matrix vs. dissimilarity matrix
• Previously, we looked at ways of studying the central tendency,
dispersion, and spread of observed values for some attribute X.
• Our objects there were one-dimensional, that is, described by a single
attribute.
• Here we will talk about objects described by multiple attributes.
• Therefore we need a change in notation.
• Suppose that we have n objects (e.g., persons, items, or courses)
described by p attributes (also called measurements or features, such
as age, height, weight, or gender).
• The objects are
and so on, where xij is the value for object xi of the jth attribute.
• For brevity, we hereafter refer to object xi as object i.
• The objects may be tuples in a relational database and are also referred to as data samples or feature vectors.
• Main memory-based clustering and nearest-neighbor algorithms
typically operate on either of the following two data structures:
• A data matrix is made up of two entities or “things,” namely rows (for objects) and columns (for attributes)
• Therefore, the data matrix is often called a two-mode matrix.
• The dissimilarity matrix contains one kind of entity (dissimilarities) and so is called a one-mode matrix.
• Many clustering and nearest-neighbor algorithms operate on a dissimilarity matrix.
• Data in the form of a data matrix can be transformed into a dissimilarity matrix before applying such algorithms.
Proximity measures for nominal attributes
• A nominal attribute can take on two or more states.
• For example, map_color is a nominal attribute that may have, say, five
states: red, yellow, green, pink, and blue.
• “How is dissimilarity computed between objects described by nominal
attributes?”
• The dissimilarity between two objects i and j can be computed based
on the ratio of mismatches:

where m is the number of matches (i.e., the number of attributes for which i and j are in the same state),
and p is the total number of attributes describing the objects.
• Weights can be assigned to increase the effect of m or to assign greater weight to the matches in
attributes having a larger number of states.
• Proximity between objects described by nominal attributes can be
computed using an alternative encoding scheme.
• Nominal attributes can be encoded using asymmetric binary
attributes by creating a new binary attribute for each of the M states.
• For an object with a given state value, the binary attribute
representing that state is set to 1, whereas the remaining binary
attributes are set to 0.
• For example, to encode the nominal attribute map_color, a binary
attribute can be created for each of the five colors previously listed.
• For an object having the color yellow, the yellow attribute is set to 1,
whereas the remaining four attributes are set to 0.
Dissimilarity of numeric data: Minkowski distance
• We describe distance measures that are commonly used for computing the
dissimilarity of objects described by numeric attributes.
• These measures include the Euclidean, Manhattan, and Minkowski distances.
• In some cases, the data are normalized before applying distance calculations.
• This involves transforming the data to fall within a smaller or common range,
such as [−1.0, 1.0] or [0.0, 1.0].
• Consider a height attribute, for example, which could be measured in either
meters or inches.
• In general, expressing an attribute in smaller units will lead to a larger range
for that attribute and thus tend to give such attributes greater effect or
“weight.”
• Normalizing the data attempts to give all attributes an equal weight.
• It may or may not be useful in a particular application.
(We have kept p as the number of attributes to be consistent with the rest of notations we use here.)
Cosine similarity
• Cosine similarity measures the similarity between two vectors of an
inner product space.
• It is measured by the cosine of the angle between two vectors and
determines whether two vectors are pointing in roughly the same
direction.
• It is often used to measure document similarity in text analysis.
• A document can be represented by thousands of attributes, each
recording the frequency of a particular word (such as a keyword) or
phrase in the document.
• Thus each document is an object represented by what is called a
term-frequency vector.
• Term-frequency vectors are typically very long and sparse (i.e., they
have many 0 values).
• Applications using such structures include information retrieval, text
document clustering, and biological data analysis.
• The traditional distance measures that we have studied until now do
not work well for such sparse numeric data.
• For example, two term-frequency vectors may have many 0 values in
common, meaning that the corresponding documents do not share
many words, but this does not make them similar.
• We need a measure that will focus on the words that the two
documents do have in common, and the occurrence frequency of such
words.
• In other words, we need a measure for numeric data that ignores zero-
matches.
• Cosine similarity is a measure of similarity that can be used to
compare documents or, say, give a ranking of documents with respect
to a given vector of query words.
• Let x and y be two vectors for comparison.
• Using the cosine measure as a similarity function, we have
Example:

You might also like