0% found this document useful (0 votes)
10 views

Lecture 9 Statistical Learning

statistical learning in data mining

Uploaded by

Ifra Luqman
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Lecture 9 Statistical Learning

statistical learning in data mining

Uploaded by

Ifra Luqman
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

CIT- 652. DATA MINING.

COURSE INSTRUCTOR : Sheza Naeem

Lecture# 12

Statistical methods in data mining;

 Statistical Inference:-
Whether their number is finite or infinite, constitutes what we call a population. The term
refers to anything of statistical interest, whether it is a group of people, objects, or events. The
number of observations in the population is defined as the size of the population.
In general, populations may be finite or infinite, but some finite populations are so
large that, in theory, we assume them to be infinite. In the field of statistical inference, we are
interested in arriving at conclusions concerning a population when it is impossible or impractical to
observe the entire set of observations that make up the population. For example, in attempting to
determine the average length of the life of a certain brand of light bulbs, it would be practically
impossible to test all such bulbs. Therefore, we must depend on a subset of observations from the
population for most statistical-analysis applications. In statistics, a subset of a population is called a
sample,or dataset.
From a given data set, we build a statistical model of the population that will help us to
make inferences concerning that same population. If our inferences from the data set are to be
valid, we must obtain samples that are representative of the population. Very often, we are
tempted to choose a data set by selecting the most convenient members of the population. But
such an approach may lead to erroneous inferences concerning the population. Any sampling
procedure that produces inferences that consistently overestimate or underestimate some
characteristics of the population is said to be biased. To eliminate any possibility of bias in the
sampling procedure, it is desirable to choose a random data set in the sense that the observations
are made independently and at random. The main purpose of selecting random samples is to elicit
information about unknown population parameters.
Statistical inference is the main form of reasoning relevant to data analysis. The
theory of statistical inference consists of those methods by which one makes inferences or
generalizations about a population. These methods may be categorized into two major areas:
estimation and tests of hypotheses.
In estimation, one wants to come up with a plausible value or a range of plausible values
for the unknown parameters of the system. The goal is to gain information from a data set T in
order to estimate one or more parameters w belonging to the model of the real-world system.
In statistical testing, on the other hand, one has to decide whether a hypothesis
concerning the value of the population characteristic should be accepted or rejected in the light of
an analysis of the data set. A statistical hypothesis is an assertion or conjecture concerning one or
more populations. The truth or falsity of a statistical hypothesis can never be known with absolute
certainty, unless we examine the entire population. This, of course, would be impractical in most
situations, sometimes even impossible. Instead, we test a hypothesis on a randomly selected data
set. Evidence from the data set that is inconsistent with the stated hypothesis leads to a rejection of
the hypothesis, whereas evidence supporting the hypothesis leads to its acceptance.

 ASSESSING DIFFERENCES IN DATA SETS:-


For many data-mining tasks, it would be useful to learn the more general characteristics
about the given data set, regarding both central tendency and data dispersion. These simple parameters
of data sets are obvious descriptors for assessing differences between different data sets. Typical
measures of central tendency include mean, median, and mode, while measures of data dispersion
include variance and standard deviation. The most common and effective numeric measure of the center
of the data set is the mean value (also called the arithmetic mean). For the set of n numeric values x1, x2,
…,xn, for the given feature X, the mean is

andit is a built-in function (like all other descriptive statistical measures) in most modern statistical
software tools. For each numeric feature in the n-dimensional set of samples, it is possible to calculate the
mean value as a central tendency characteristic for this feature. Sometimes, each value xi in a set may be
associated with a weight wi, which reflects the frequency of occurrence, significance, or importance
attached to the value. In this case, the weighted arithmetic mean or the weighted average value is

Although the mean is the most useful quantity that we use to describe a set of data, it is not the only one.
For skewed data sets, a better measure of the center of data is the median. It is the middle value of the
ordered set of feature values if the set consists of an odd number of elements, and it is the average of the
middle two values if the number of elements in the set is even. If x1, x2,…,xn represents a data set of size
n, arranged in ascending order of magnitude, then the median is defined by

Another measure of the central tendency of a data set is the mode. The mode for the set of data is the
value that occurs most frequently or repeatedly in the set.
We classify datasets as unimodal (with only one mode) and multimodal (with two or more modes) .
Multimodal datasets may be precisely defined as bimodal, trimodal, etc. For unimodal frequency curves
that are moderately asymmetrical ,we have the following useful empirical relation for numeric datasets:
The degree to which numeric data tend to spread is called dispersion of the data, and the most common
measures of dispersion are the standard deviation σ and the variance σ2. The variance of n numeric values
x1,x2,…,xn is

The standard deviation σ is the square root of the variance σ2.

You might also like