Classification is the task of assigning a class label to an input pattern.
The class label indicates
one of a given set of classes. The classification is carried out with the help of a model obtained
using a learning procedure. According to the type of learning used, there are two categories of
classification, one using supervised learning and the other using unsupervised learning.
Supervised learning makes use of a set of examples which already have the class labels
assigned to them. Unsupervised learning attempts to find inherent structures in the
data. Semi-supervised learning makes use of a small number of labeled data and a large
number of unlabeled data to learn the classifier.
Sensing
The sensors in a system are what receives the data input, and they may vary depending
on the purpose of the system. They are usually some form of transducers such as a camera or
a microphone.
Segmentation
After receiving the input data the different patterns need to be separated. Segmentation is
one of the toughest problems of pattern recognition because a lot of the patterns tend to
overlap and intermingle. For example, trying to recognize a pattern of the individual sound
"s", in the two words "see", and "son" would prove difficult because the sound is pronounced
differently in the two words, and using the same model to segment the 's" would not be
accurate.
Feature Extraction and Classification
Here, the goal is to characterize the data to be recognized by measurements that will give
the same results for data in the same category and different results for data in different
categories. This leads to finding distinguishing features that are invariant to any
transformations of the data. The degree of classifying the input into different categories varies
on the features of the data. While perfect classification is often impossible, an easier task is to
find the probability of the data fitting one of the categories.
Post Processing
The post-processor uses the output of the classifier to decide on the recommended action on
the data.
The image to the right shows the various components of a patten recognition
system.
The design of a pattern recognition also involves the repetition of the design cycle which
contains different activities. The different cycles involved are;
Data Collection
Feature and Model Choices: The choice of distinguishing the features we will be looking
for is a very critical step. Prior knowledge about the incoming data also helps in selecting the
right features.
Training: The process of using the data to determine the classifier is known as training the
classifier.
Evaluation: Evaluation is important to measure the performance of the system and also
indicate any room for improvement.
The image to the right shows an example of the design cycle for a pattern recognition
system.
Bayesian Decision Theory
Introduction
Bayesian decision theory is a fundamental statistical approach to the problem of pattern
classification. It is considered the ideal case in which the probability structure underlying the
categories is known perfectly. While this sort of situation rarely occurs in practice, it permits
us to determine the optimal (Bayes) classifier against which we can compare all other
classifiers. Moreover, in some problems it enables us to predict the error we will get when we
generalize to novel patterns.
This approach is based on quantifying the tradeoffs between various classification decisions
using probability and the costs that accompany such decisions. It makes the assumption that
the decision problem is posed in probabilistic terms, and that all of the relevant probability
values are known.
One of the most well-known equations in the world of statistics and probability is Bayes’
Theorem (see formula below). The basic intuition is that the probability of some class or event
occurring, given some feature (i.e. attribute), is calculated based on the likelihood of the
feature’s value and any prior information about the class or event of interest. This seems like a
lot to digest, so I will break it down for you. First off, the case of cancer detection is a two-
class problem. The first class, ω1, represents the event that a tumor is present, and ω2
represents the event that a tumor is not present.
Prior
There are four parts to Bayes’ Theorem: Prior, Evidence, Likelihood, and Posterior. The
priors(P(ω1), P(ω2)), define how likely it is for event ω1 or ω2 to occur in nature. It is
important to realize the priors vary depending on the situation. Since the objective is to detect
cancer, it is safe to say that the probability of a tumor being present is pretty low: P(ω1)<P(ω2
Likelihood
From a high level, a CT scan is when x-rays are applied in a circular motion. One of the key
metrics that is produced is attenuation — a measurement of x-ray absorption. Objects with a
higher density have a higher attenuation and vice-versa. Therefore, a tumor is more likely to
have a high attenuation compared to lung tissue.
Suppose you only look at attenuation values to help make your decision between ω1 and ω2.
Each class has a class-conditional probability density, p(x|ω1) and p(x|ω2), called likelihoods.
The figure below shows a hypothetical class-conditional probability density for p(x|ω). These
distributions are extracted by analyzing your training data; however, it is always good to have
domain expertise to check the validity of the data.
Evidence
The best way to describe the evidence, p(x), is through the law of total probability. This law
states that if you have mutually exclusive events (e.g. ω1 and ω2) whose probability of
occurrence sum up to 1, then the probability of some feature (e.g. attenuation) is the likelihood
times the prior summed across all mutually exclusive events.
Posterior
The result of using Bayes’ Theorem is called the posterior, P(ω1|x) and P(ω2|x). The posterior
represents the probability that an observation falls into class ω1 or ω2 (i.e tumor present or
not) given the measurement x (e.g. attenuation). Each observation receives a posterior
probability for every class, and all the posteriors must add up to 1. In regards to the cancer
detection problem we are trying to solve, there are two posterior probabilities. The image
below is a hypothetical scenario of how the posterior values could change with respect to a
measurement x. In addition to a connection between the likelihoods and the posteriors, the
posterior can be heavily affected by prior P(ω).
Decision Rules
Now that we have a good understanding of Bayes’ theorem, it’s time to see how we can use it
to make a decision boundary between our two classes. There are two methods for determining
whether a patient has a tumor present or not. The first is a basic approach that only uses the
prior probability values to make a decision. The second way utilizes the posteriors, which
takes advantage of the priors and class-conditional probability distributions.
Using the Priors
Suppose we only make a decision based on the natural prior probabilities. This means we
forget about all the other factors in Bayes’ Theorem. Since the probability of having a tumor,
P(ω1), is far less than not having one P(ω2), our model/system will always decide that every
patient does not have a tumor. Even though the model/system will be correct most of the time,
it will not identify the patients who actually have a tumor and need proper medical attention.
Using the Posteriors
Now let’s take a more comprehensive approach by using the posteriors, P(ω1|x) and P(ω2|x).
Since the posteriors are a result of Bayes’ Theorem, the impact of the priors is mitigated by the
class-conditional probability densities, p(x|ω1) and p(x|ω2). If our model/system is looking at
a region with a higher attenuation than ordinary tissue, then the probability of a tumor being
present increases despite the natural prior probabilities. Let’s assume there is a 75% chance
that a specific region contains a tumor, then that would mean there is a 25% chance there is no
tumor at all. That 25% chance is our probability of error, also known as risk.
K-means clustering is one of the simplest and popular unsupervised machine learning
algorithms.
Typically, unsupervised algorithms make inferences from datasets using only input vectors
without referring to known, or labelled, outcomes.
The objective of K-means is simple: group similar data points together and discover
underlying patterns. To achieve this objective, K-means looks for a fixed number (k) of
clusters in a dataset.”
A cluster refers to a collection of data points aggregated together because of certain
similarities.
Define a target number k, which refers to the number of centroids in the dataset. A centroid is
the imaginary or real location representing the center of the cluster.
Every data point is allocated to each of the clusters through reducing the in-cluster sum of
squares.
In other words, the K-means algorithm identifies k number of centroids, and then allocates
every data point to the nearest cluster, while keeping the centroids as small as possible.
The ‘means’ in the K-means refers to averaging of the data; that is, finding the centroid.
How the K-means algorithm works
To process the learning data, the K-means algorithm in data mining starts with a first group of
randomly selected centroids, which are used as the beginning points for every cluster, and then
performs iterative (repetitive) calculations to optimize the positions of the centroids
It halts creating and optimizing clusters when either:
• The centroids have stabilized — there is no change in their values because the
clustering has been successful.
• The defined number of iterations has been achieved.
K-means algorithm
Step 1: Import libraries
• Pandas for reading and writing spreadsheets
• Numpy for carrying out efficient computations
• Matplotlib for visualization of data
Step 2: Generate random data
Step 3: Use Scikit-Learn
Step 4: Finding the centroid
Step 5: Testing the algorithm
K-means clustering is an extensively used technique for data cluster analysis.
It is easy to understand, especially if you accelerate your learning using a K-means clustering
tutorial. Furthermore, it delivers training results quickly.
However, its performance is usually not as competitive as those of the other sophisticated
clustering techniques because slight variations in the data could lead to high variance.
Furthermore, clusters are assumed to be spherical and evenly sized, something which may
reduce the accuracy of the K-means clustering Python results.