DATA MINING.
COURSE INSTRUCTOR : Sheza Naeem
Lecture# 28
Genotype Representation:
One of the most important decisions to make while implementing a genetic algorithm is deciding the
representation that we will use to represent our solutions. It has been observed that improper representation
can lead to poor performance of the GA.
Therefore, choosing a proper representation, having a proper definition of the mappings between the
phenotype and genotype spaces is essential for the success of a GA.
In this section, we present some of the most commonly used representations for genetic algorithms. However,
representation is highly problem specific and the reader might find that another representation or a mix of the
representations mentioned here might suit his/her problem better.
Binary Representation
This is one of the simplest and most widely used representation in GAs. In this type of representation the
genotype consists of bit strings.
For some problems when the solution space consists of Boolean decision variables – yes or no, the binary
representation is natural. Take for example the 0/1 Knapsack Problem. If there are n items, we can represent a
solution by a binary string of n elements, where the xth element tells whether the item x is picked (1) or not (0).
For other problems, specifically those dealing with numbers, we can represent the numbers with their binary
representation. The problem with this kind of encoding is that different bits have different significance and
therefore mutation and crossover operators can have undesired consequences. This can be resolved to some
extent by using Gray Coding, as a change in one bit does not have a massive effect on the solution.
Real Valued Representation
For problems where we want to define the genes using continuous rather than discrete variables, the real
valued representation is the most natural. The precision of these real valued or floating point numbers is
however limited to the computer.
Integer Representation
For discrete valued genes, we cannot always limit the solution space to binary ‘yes’ or ‘no’. For example, if we
want to encode the four distances – North, South, East and West, we can encode them as {0,1,2,3}. In such
cases, integer representation is desirable.
Permutation Representation
In many problems, the solution is represented by an order of elements. In such cases permutation
representation is the most suited.
A classic example of this representation is the travelling salesman problem (TSP). In this the salesman has to
take a tour of all the cities, visiting each city exactly once and come back to the starting city. The total distance
of the tour has to be minimized. The solution to this TSP is naturally an ordering or permutation of all the cities
and therefore using a permutation representation makes sense for this problem.
Genetic Algorithms - Population
Population is a subset of solutions in the current generation. It can also be defined as a set of chromosomes.
There are several things to be kept in mind when dealing with GA population −
● The diversity of the population should be maintained otherwise it might lead to premature
convergence.
● The population size should not be kept very large as it can cause a GA to slow down, while a smaller
population might not be enough for a good mating pool. Therefore, an optimal population size needs to
be decided by trial and error.
The population is usually defined as a two dimensional array of – size population, size x, chromosome size.
Population Initialization
There are two primary methods to initialize a population in a GA. They are −
● Random Initialization − Populate the initial population with completely random solutions.
● Heuristic initialization − Populate the initial population using a known heuristic for the problem.
It has been observed that the entire population should not be initialized using a heuristic, as it can result in the
population having similar solutions and very little diversity. It has been experimentally observed that the
random solutions are the ones to drive the population to optimality. Therefore, with heuristic initialization, we
just seed the population with a couple of good solutions, filling up the rest with random solutions rather than
filling the entire population with heuristic based solutions.
It has also been observed that heuristic initialization in some cases, only effects the initial fitness of the
population, but in the end, it is the diversity of the solutions which lead to optimality.
Population Models
There are two population models widely in use −
Steady State
In steady state GA, we generate one or two off-springs in each iteration and they replace one or two
individuals from the population. A steady state GA is also known as Incremental GA.
Generational
In a generational model, we generate ‘n’ off-springs, where n is the population size, and the entire population
is replaced by the new one at the end of the iteration.
Genetic Algorithms - Fitness Function:
The fitness function simply defined is a function which takes a candidate solution to the problem as input and
produces as output how “fit” our how “good” the solution is with respect to the problem in consideration.
Calculation of fitness value is done repeatedly in a GA and therefore it should be sufficiently fast. A slow
computation of the fitness value can adversely affect a GA and make it exceptionally slow.
In most cases the fitness function and the objective function are the same as the objective is to either
maximize or minimize the given objective function. However, for more complex problems with multiple
objectives and constraints, an Algorithm Designer might choose to have a different fitness function.
A fitness function should possess the following characteristics −
● The fitness function should be sufficiently fast to compute.
● It must quantitatively measure how fit a given solution is or how fit individuals can be produced from
the given solution.
In some cases, calculating the fitness function directly might not be possible due to the inherent complexities
of the problem at hand. In such cases, we do fitness approximation to suit our needs.
The following image shows the fitness calculation for a solution of the 0/1 Knapsack. It is a simple fitness
function which just sums the profit values of the items being picked (which have a 1), scanning the elements
from left to right till the knapsack is full.
Genetic Algorithms - Parent Selection:
Parent Selection is the process of selecting parents which mate and recombine to create off-springs for the
next generation. Parent selection is very crucial to the convergence rate of the GA as good parents drive
individuals to a better and fitter solutions.
However, care should be taken to prevent one extremely fit solution from taking over the entire population in
a few generations, as this leads to the solutions being close to one another in the solution space thereby
leading to a loss of diversity. Maintaining good diversity in the population is extremely crucial for the success
of a GA. This taking up of the entire population by one extremely fit solution is known as premature
convergence and is an undesirable condition in a GA.
Fitness Proportionate Selection
Fitness Proportionate Selection is one of the most popular ways of parent selection. In this every individual can
become a parent with a probability which is proportional to its fitness. Therefore, fitter individuals have a
higher chance of mating and propagating their features to the next generation. Therefore, such a selection
strategy applies a selection pressure to the more fit individuals in the population, evolving better individuals
over time.
Consider a circular wheel. The wheel is divided into n pies, where n is the number of individuals in the
population. Each individual gets a portion of the circle which is proportional to its fitness value.
Two implementations of fitness proportionate selection are possible −
Roulette Wheel Selection
In a roulette wheel selection, the circular wheel is divided as described before. A fixed point is chosen on the
wheel circumference as shown and the wheel is rotated. The region of the wheel which comes in front of the
fixed point is chosen as the parent. For the second parent, the same process is repeated.
It is clear that a fitter individual has a greater pie on the wheel and therefore a greater chance of landing in
front of the fixed point when the wheel is rotated. Therefore, the probability of choosing an individual depends
directly on its fitness.
Implementation wise, we use the following steps −
● Calculate S = the sum of a finesses.
● Generate a random number between 0 and S.
● Starting from the top of the population, keep adding the finesses to the partial sum P, till P<S.
● The individual for which P exceeds S is the chosen individual.
Stochastic Universal Sampling (SUS)
Stochastic Universal Sampling is quite similar to Roulette wheel selection, however instead of having just one
fixed point, we have multiple fixed points as shown in the following image. Therefore, all the parents are
chosen in just one spin of the wheel. Also, such a setup encourages the highly fit individuals to be chosen at
least once.
It is to be noted that fitness proportionate selection methods don’t work for cases where the fitness can take a
negative value.
Tournament Selection
In K-Way tournament selection, we select K individuals from the population at random and select the best out
of these to become a parent. The same process is repeated for selecting the next parent. Tournament
Selection is also extremely popular in literature as it can even work with negative fitness values.
Rank Selection
Rank Selection also works with negative fitness values and is mostly used when the individuals in the
population have very close fitness values (this happens usually at the end of the run). This leads to each
individual having an almost equal share of the pie (like in case of fitness proportionate selection) as shown in
the following image and hence each individual no matter how fit relative to each other has an approximately
same probability of getting selected as a parent. This in turn leads to a loss in the selection pressure towards
fitter individuals, making the GA to make poor parent selections in such situations.
In this, we remove the concept of a fitness value while selecting a parent. However, every individual in the
population is ranked according to their fitness. The selection of the parents depends on the rank of each
individual and not the fitness. The higher ranked individuals are preferred more than the lower ranked ones.
Chromosome Fitness Value Rank
A 8.1 1
B 8.0 4
C 8.05 2
D 7.95 6
E 8.02 3
F 7.99 5
Random Selection
In this strategy we randomly select parents from the existing population. There is no selection pressure
towards fitter individuals and therefore this strategy is usually avoided.