Advanced Image Processing Techniques
Advanced Image Processing Techniques
Arora
Advanced Image Processing Techniques
for Remotely Sensed Hyperspectral Data
P. K. Varshney • M. K. Arora
Advanced
Image Processing Techniques
for Remotely Sensed
Hyperspectral Data
f) Springer
Professor Dr. Pramod K. Varshney
Syracuse University
Department of Electrical Engineering
and Computer Science
Syracuse, NY 13244
U.S.A.
This work is subject to copyright. All rights are reserved, whether the whole or part of this material is
concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitations, broadcasting,
reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or
parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its
current version, and permission for use must always be obtained from Springer-Verlag Berlin Heidelberg GmbH.
Violations are liable to prosecution under the German Copyright Law.
springeronline.com
Over the last fifty years, a large number of spaceborne and airborne sensors
have been employed to gather information regarding the earth's surface and
environment. As sensor technology continues to advance, remote sensing data
with improved temporal, spectral, and spatial resolution is becoming more
readily available. This widespread availability of enormous amounts of data
has necessitated the development of efficient data processing techniques for
a wide variety of applications. In particular, great strides have been made in the
development of digital image processing techniques for remote sensing data.
The goal has been efficient handling of vast amounts of data, fusion of data
from diverse sensors, classification for image interpretation, and development
of user-friendly products that allow rich visualization.
This book presents some new algorithms that have been developed for high-
dimensional datasets, such as multispectral and hyperspectral imagery. The
contents of the book are based primarily on research carried out by some
members and alumni of the Sensor Fusion Laboratory at Syracuse University.
Early chapters that provide an overview of multispectral and hyperspectral
sensing, and digital image processing have been prepared by other leading
experts in the field. The intent of this book is to present the material at a level
suitable for a diverse audience ranging from beginners to advanced remote
sensing researchers and practitioners. This book can be used as a text or
a reference in courses on image analysis and remote sensing. It can also be
used as a reference book by remote sensing researchers, scientists and users,
as well as by resource managers and planners.
We would like to thank all the authors for their enthusiasm and active
cooperation during the course of this project. The research conducted at Syra-
cuse University was supported by NASA under grant NAGS-11227. We thank
Syracuse University for assistance in securing this grant. We also thank the
Indian Institute of Technology, Roorkee for granting a postdoctoral leave that
enabled Manoj K. Arora to participate in this endeavor. We are grateful to sev-
eral organizations for providing and allowing the use of remote sensing data
in the illustrative examples presented in this book. They include the Labora-
tory for Applications of Remote Sensing - Purdue University (AVIRIS data),
USGS (Landsat ETM+ data), Eastman Kodak (IRS PAN, Radarsat SAR, and
HyMap data as well as digital aerial photographs), NASA (IKONOS data) and
DIRS laboratory of the Center for Imaging Science at Rochester Institute of
Technology.
Introduction 1
The Challenge............................................................................. 1
What is Hyperspectral Imaging? .................................................... 2
Structure of the Book................................................................... 4
Part I General
Part II Theory
3 Mutual Information:
A Similarity Measure for Intensity Based Image Registration 89
3.1 Introduction....................................................................... 89
3.2 Mutual Information Similarity Measure.................................. 90
3.3 Joint Histogram Estimation Methods..................................... 93
3.3.1 Two-Step Joint Histogram Estimation ........................... 93
3.3.2 One-Step Joint Histogram Estimation........................... 94
3.4 Interpolation Induced Artifacts ............................................ 95
3.5 Generalized Partial Volume Estimation ofJoint Histograms...... 99
3.6 Optimization Issues in the Maximization of MI ....................... 103
3.7 Summary ........................................................................... 107
Manoj K. Arora
Department of Civil Engineering
Indian Institute of Technology Roorkee
Roorkee 247667, India
Tel: +91-1332-285417; Fax: +91-1332-273560
E-mail: manojfce@iitr.ernet.in
Hua-mei Chen
Department of Computer Science and Engineering
The University of Texas at Arlington
Box 19015, Arlington, TX 76019-0015
Tel: +1-817-272-1394; Fax: +1-817-272-3784
E-mail: hchen@cse.uta.edu
Teerasit Kasetkasem
Electrical Engineering Department
Kasetsart University, 50 Phanonyothin Rd.
Chatuchak, Bangkok 10900, Thailand
E-mail: fengtsk@ku.ac.th
Richard M. Lucas
Institute of Geography and Earth Sciences,
The University of Wales, Aberystwyth,
Aberystwyth, Ceredigion,
SY23 3DB, Wales, UK
Tel: +44-1970-622612; Fax: +44-1970-622659
E-mail: rml@aber.ac.uk
Ray Merton
School of Biological, Earth & Environmental Sciences
The University of New South Wales
Sydney NSW2052, Australia
Tel: +61-2-93858713;
Fax: +61-2-93851558
E-mail: r.merton@unsw.edu.au
XIV list of Contributors
Olaf Niemann
Department of Geography
University of Victoria
PO BOX 3050 STN CSC
Victoria, B.c., V8W 3P5, Canada
Tel: + 1-250-4724624; Fax: + 1-250-7216216
E-mail: oniemann@office.geog.uvic.ea
Mahesh Pal
Department of Civil Engineering
National Institute of Technology
Kurukshetra, 136119, Haryana, India
Tel: +91-1744-239276; Fax: +91-1744-238050
E-mail: mpee_pal@yahoo.eo.uk
Raghuveer M. Rao
Department of Electrical Engineering
Rochester Institute of Technology
79 Lomb Memorial Drive
Rochester NY 14623-5603
Tel.: + 1-585-475 2185; Fax: + 1-585-4755845
E-mail: mrreee@riLedu
Stefan A. Robila
301 Richardson Hall
Department of Computer Science
Montclair State University
1, Normal Ave, Montclair, NJ 07043
E-mail: robilas@mail.montclair.edu
Aled Rowlands
Institute of Geography and Earth Sciences,
The University of Wales, Aberystwyth,
Aberystwyth, Ceredigion, SY23 3DB, Wales, UK
Tel: +44-1970-622598; Fax: +44-1970-622659
E-mail: alr@aber.ae.uk
Chintan A. Shah
Department of Electrical Engineering Computer Science
121 Link Hall
Syracuse University, Syracuse, NY, 13244, USA
E-mail: cashah@ees.syr.edu
list of Contributors xv
Pramod K. Varshney
Department of Electrical Engineering Computer Science
121 Link Hall
Syracuse University, Syracuse, NY, l3244, USA
Tel: +1-315-44340l3; Fax: +1-315-4432583
E-mail: varshney@syr.edu
Pakorn Watanachaturaporn
Department of Electrical Engineering Computer Science
121 Link Hall
Syracuse University, Syracuse, NY, l3244, USA
E-mail: pwatanac@syr.edu
Introduction
The Challenge
From time immemorial, man has had the urge to see the unseen, to peer beneath
the earth, and to see distant bodies in the heavens. This primordial curiosity
embedded deep in the psyche of humankind, led to the birth of satellites and
space programs. Satellite images, due to their synoptic view, map like format,
and repetitive coverage are a viable source of gathering extensive information.
In recent years, the extraordinary developments in satellite remote sensing
have transformed this science from an experimental application into a tech-
nology for studying many aspects of earth sciences. These sensing systems
provide us with data critical to weather prediction, agricultural forecasting,
resource exploration, land cover mapping and environmental monitoring, to
name a few. In fact, no segment of society has remained untouched by this
technology.
Over the last few years, there has been a remarkable increase in the number
of remote sensing sensors on-board various satellite and aircraft platforms. No-
ticeable is the availability of data from hyperspectral sensors such as AVIRIS,
HYDICE, HyMap and HYPERION. The hyperspectral data together with geo-
graphical information system (GIS) derived ancillary data form an exceptional
spatial database for any scientific study related to Earth's environment. Thus,
significant advances have been made in remote sensing data acquisition, stor-
age and management capabilities.
The availability of huge spatial databases brings in new challenges for the
extraction of quality information. The sheer increase in the volume of data
available has created the need for the development of new techniques that can
automate extraction of useful information to the greatest degree. Moreover,
these techniques need to be objective, reproducible, and feasible to implement
within available resources (DeFries and Chan, 2000). A number of image analy-
sis techniques have been developed to process remote sensing data with varied
amounts of success. A majority of these techniques have been standardized and
implemented in various commercial image processing software systems such
as ERDAS Imagine, ENVI and ER Mapper etc. These techniques are suitable
for the processing of multispectral data but have limitations when it comes to
an efficient processing of the large amount of hyperspectral data available in
hundreds of bands. Thus, the conventional techniques may be inappropriate
P. K. Varshney et al., Advanced Image Processing Techniques for Remotely Sensed Hyperspectral Data
© Springer-Verlag Berlin Heidelberg 2004
2 Pramod K. Varshney, Manoj K. Arora
What is Hyperspectrallmaging?
Since the initial acquisition of satellite images, remote sensing technology has
not looked back. A number of earth satellites have been launched to advance our
understanding of Earth's environment. The satellite sensors, both active and
passive, capture data from visible to microwave regions of the electromagnetic
spectrum. The multispectral sensors gather data in a small number of bands
(also called features) with broad wavelength intervals. No doubt, multispectral
sensors are innovative. However, due to relatively few spectral bands, their
spectral resolution is insufficient for many precise earth surface studies.
When spectral measurement is performed using hundreds of narrow con-
tiguous wavelength intervals, the resulting image is called a hyperspectral
image, which is often represented as a hyperspectral image cube (see Fig. 1)
(JPL, NASA). In this cube, the x and y axes specify the size of the images,
whereas the z axis denotes the number of bands in the hyperspectral data.
An almost continuous spectrum can be generated for a pixel and hence hy-
perspectral imaging is also referred to as imaging spectrometry. The detailed
spectral response of a pixel assists in providing accurate and precise extraction
of information than is obtained from multispectral imaging. Reduction in the
cost of sensors as well as advances in data storage and transmission technolo-
Introduction 3
gies have made the hyperspectral imaging technology more readily available.
Much like improvements in spectral resolution, spatial resolution has also been
dramatically increased by the installation of hyperspectral sensors on aircrafts
(airborne imagery), opening the door for a wide array of applications.
Nevertheless, the processing of hyperspectral data remains a challenge since
it is very different from multispectral processing. Specialized, cost effective
and computationally efficient procedures are required to process hundreds
of bands acquiring 12-bit and 16-bit data. The whole process of hyperspectral
imaging may be divided into three steps: preprocessing, radiance to reflectance
transformation and data analysis.
Preprocessing is required for the conversion of raw radiance into at-sensor
radiance. This is generally performed by the data acquisition agencies and the
user is supplied with the at-sensor radiance data. The processing steps involve
operations like spectral calibration, geometric calibration and geocoding, sig-
nal to noise adjustment, de-striping etc. Since, radiometric and geometric
accuracy of hyperspectral data vary significantly from one sensor to the other,
the users are advised to discuss these issues with the data providing agencies
before the purchase of data.
Further, due to topographical and atmospheric effects, many spectral and
spatial variations may occur in at-sensor radiance. Therefore, the at-sensor
data need to be normalized in the second step for accurate determination of
the reflectance values in difference bands. A number of atmospheric models
and correction methods have been developed to perform this operation. Since
the focus of this book is on data analysis aspects, the reader is referred to (van
der Meer 1999) for a more detailed overview of steps 1 and 2.
4 Pramod K. Varshney, Manoj K. Arora
Part I: General
Chapter 1 provides a description of a number of hyperspectral sensors onboard
various aircraft and space platforms that were, are and will be in operation in
the near future. The spectral, spatial, temporal and radiometric characteris-
tics of these sensors have also been discussed, which may provide sufficient
guidance on the selection of appropriate hyperspectral data for a particular
application. A section on ground based spectroscopy has also been included
where the utility of laboratory and field based sensors has been indicated.
Laboratory and field measurements of spectral reflectance form an important
component of understanding the nature of hyperspectral data. These measure-
ments help in the creation of spectral Hbraries that may be used for calibration
and validation purposes. A list of some commercially available software pack-
ages and tools has also been provided. Finally, some application areas have
been identified where hyperspectral imaging may be used successfully.
An overview of basic image processing tasks that are necessary for multi
and hyperspectral data analysis is given in Chap. 2. Various sections in this
Introduction 5
SVM has also been given due consideration and a separate section is written
to discuss the merits and demerits of existing optimization methods that have
been used in SVM classification.
Chapter 6 provides the theoretical setting of Markov random field (MRF)
models that have been used by statistical physicists to explain various phe-
nomena occurring among neighboring particles because of their ability to
describe local interactions between them. The concept of MRF model suits
image analysis because many image properties, such as texture, depend highly
on the information obtained from the intensity values of neighboring pixels,
as these are known to be highly correlated. As a result of this, MRF models
have been found useful in image classification, fusion and change detection
applications. A section in this chapter is devoted to a detailed discussion of
MRF and its equivalent form (i.e. Gibbs fields). Some approaches for the use
of MRF modeling are explained. Several widely used optimization methods
including simulated annealing are also introduced and discussed.
ship probability for each pixel. The complete ICAMM algorithm has been
explained in a simplified manner. Unsupervised classification of hyperspectral
data is performed using ICAMM and its performance evaluated vis a vis the
most widely used K-means algorithm.
Chapter 10 describes the application of SVM for supervised classification of
multi and hyperspectral data. Several issues, which may have a bearing on the
performance of SVM, have been considered. The effect of a number of kernel
functions, multi class methods and optimization techniques on the accuracy
and efficiency of classification has been assessed.
The two classification algorithms, unsupervised ICAMM and supervised
SVM, discussed in Chaps. 9 and 10 respectively, are regarded as per pixel
classifiers, as they allocate each pixel of an image to one class only. Often the
images are dominated by mixed pixels, which contain more than one class.
Since a mixed pixel displays a composite spectral response, which may be
dissimilar to each of its component classes, the pixel may not be allocated to
any of its component classes. Therefore, error is likely to occur in the classi-
fication of mixed pixels, if per pixel classification algorithms are used. Hence,
sub-pixel classification methods such as fuzzy c-means, linear mixture mod-
eling and artificial neural networks have been proposed in the literature. In
Chap. 11, a novel method based on MRF models has been introduced for
sub-pixel mapping of hyperspectral data. The method is based on an opti-
mization algorithm whereby raw coarse resolution images are first used to
generate an initial sub-pixel classification, which is then iteratively refined to
accurately characterize the spatial dependence between the class proportions
of the neighboring pixels. Thus, spatial relations within and between pixels
are considered throughout the process of generating the sub-pixel map. The
implementation of the complete algorithm is discussed and it is illustrated by
means of an example
The spatial dependency concept of MRF models is further extended to
change detection and image fusion applications in Chap. 12. Image change
detection is one of the basic image analysis tools and is frequently used in
many remote sensing applications to quantify temporal information, whereas
image fusion is primarily intended to improve, enhance and highlight certain
features of interest in remote sensing images for extracting useful information.
Individual MRF model based algorithms for these two applications have been
described and illustrated through experiments on multi and hyperspectral
data.
References
Campbell JB (2002) Introduction to remote sensing, 3rd edition. Guilford Press, New York
DeFries RS, Chan JC (2000) Multiple criteria for evaluating machine learning algorithms for
land cover classification from satellite data. Remote Sensing of Environment 74: 503-515
Jensen JR (I996) Introductory digital image processing: a remote sensing perspective, 2nd
edition. Prentice Hall, Upper Saddle River, N.J.
JPL, NASA. AVIRIS image cube. http://aviris.jpl.nasa.gov/html!aviris.cube.html
8 Pramod K. Varshney, Manoj K. Arora
1.1
Introduction
Since the beginning of remote sensing observation, scientists have created
a "toolbox" with which to observe the varying dimensions of the Earth's dy-
namic surface. Hyperspectral imaging represents one of the later additions
to this toolbox, emerging from the fields of aerial photography, ground spec-
troscopy and multi-spectral imaging. This new tool provides capacity to char-
acterise and quantify, in considerable detail, the Earth's diverse environments.
This chapter has two goals. The first is to chronicle the development of
hyperspectral remote sensing. In doing so, an overview of the past, present and
future ground, airborne and spaceborne sensors and their unique attributes
which render them most suited for specific applications is presented. The
second is to provide an overview of the applications where hyperspectral
remote sensing has, to date, made the greatest impact. Key areas considered
include the atmosphere, snow and ice, marine and freshwater environments,
vegetation, soils, geology, environmental hazards (e. g., fire) and anthropogenic
activity (e.g., land degradation).
1.2
Multi-spectral Scanning Systems (MSS)
Since the mid 1950s, a diverse range of airborne and spaceborne sensors have
recorded the reflectance of the Earth's surface in the spectral wavelength region
extending from ~ 400 to ~ 2500 nanometres (nm). Early sensors collected data
initially onto black and white and subsequently colour infrared film although
later, filters and multi-camera systems were developed for creating four "color"
image combinations (blue, green, red and near infrared (NIR». With the
advent and advancement of more readily available computer technology and
the declassification of sensing technology developed for military use, a number
of civilian airborne (Table 1.1) and, later, spaceborne sensors were developed
and deployed. These initial digital sensors acquired data in broad, widely-
spaced spectral ranges which were referred to commonly as bands or channels.
Such multi-spectral scanning systems (MSS) have been, and continue to be,
the backbone of the optical data acquisition systems.
P. K. Varshney et al., Advanced Image Processing Techniques for Remotely Sensed Hyperspectral Data
© Springer-Verlag Berlin Heidelberg 2004
12 1: Richard lucas, Aled Rowlands, Olaf Niemann, Ray Merton
'"
-
.t:;
.D
10
'"0
....
S
- -
o o
-
0
til 0 o o
'"c
Il) ........ c
~
0
<"l
o
<"l
o
N
'"
U Il)
DO I I I
Jr;....
Il)
til 0 o o
........u N
~
o
Il)
Q..
"'" <"l
'"
-
.~
....
'3 "S ; Il)"o
S ;::l~
z'Cl
Il)
c
....
0
.D
.;.... >-
....0 S
-
10
~ 00
'" ~
'"
.........~
.~ oj
~
....u
Il)
;.:
....
oj
-
Il)
oj
...<::
cc u
.~
00
10
U N
~ ~
.~
'"
oj
'" S
til Il)
'"
;::l
.... ...<:: til
-
~
.... E-< "0
...: u
Il) c
Il)
....
Il)
oj
.... ~ .... Il)O
0
:c :E]g:~
Il)
'"c
Il)
;::l .... oj
~ IZl ;:E:.(;:E!;(
and two in the NIR respectively. The successors of the MSS, the LANDSAT
Thematic Mapper (TM on board LANDSAT 4-5) and Enhanced Thematic
Mapper (ETM + on board LANDSAT 7) provided additional measurements in
the shortwave infrared (SWIR) region centred on 1600 nm and 2100 nm and
at a finer (30 m) spatial resolution and level of quantization. The provision
of a 15 m panchromatic band on the LANDSAT ETM+ also facilitated greater
resolving of ground features and opportunities for data fusion. The SPOT-
series of sensors, developed in France, were first launched in 1986. Initially,
the SPOT 1-3 High Resolution Visible (HRV) recorded reflectance (at 20 m
spatial resolution) in three visible and NIR (VNIR) channels, although the
subsequent High Resolution Visible InfraRed (HRVIR; SPOT 4) supported
a SWIR waveband. The launch of SPOT-S, in 2003, heralded a new era in
fine spatial resolution remote sensing, with the panchromatic (PAN) sensor
increasing in spatial resolution from 10m to 5 m, with potential to obtain 2.5 m
spatial resolution data through simultaneous observation. The resolution of
the SPOT-5 multispectral (XS) sensor also increased to 10 m (maintaining 20 m
in the SWIR region). The IRS series of spacecraft, which were first launched by
India in 1988, complemented the Landsat and SPOT-series, as several sensors
were in orbit, observing in four VNIR bands and at spatial resolutions of
approximately 23.5-36.25 min XS (70 min SWIR) and 6 m in PAN mode.
With time, these sensors have provided consistent historical data with which
to observe and quantify intra-annual, inter-annual or decadal changes occur-
ring on the Earth's surface. The addition of other sensors (e.g., IKONOS and
Quickbird) observing at a variety of spatial and temporal resolutions and also
band configurations and quantizations has served to extend the time-series and
to provide additional information on Earth surface processes and dynamics.
The continuation and development of these Earth observing programs is testa-
ment to the important role that data from these sensors play in understanding
the longer-term dynamics of both natural and anthropogenic change. This also
indicates the commitment of national and international organisations to Earth
observation for environmental monitoring.
A fundamental limitation of these multispectral sensors, however, is that
data are acquired in a few broad (typically 100-200 nm in width) irregularly
space spectral bands (van der Meer and de Jong 2001). As such, narrow spectral
features may not be readily discriminated and tend to be either averaged across
the spectral sampling range or masked by stronger proximal features (Kumar
et al. 2001), thereby resulting in a reduction or loss in the environmentally
relevant information that can potentially be extracted. This deficiency in the
spectral, radiometric, and spatial resolution of many multispectral systems,
therefore render them unsuitable for identifying many surface materials and
features, particularly minerals (Goetz 1995).
14 1: Richard Lucas, Aled Rowlands, Olaf Niemann, Ray Merton
1.3
Hyperspectral Systems
In recognising the limitations of multispectral systems, an improved tech-
nology based on spectroscopy was developed whereby the limited number
of discrete spectral bands was enhanced by a series (nominally> 50 bands)
of narrow (Full-Width-Half-Maximum; FWHM 2-20 nm) contiguous bands
(Aspinall et al. 2002) over the VNIR and SWIR wavelength regions. Similar
advances were also made in the thermal infrared (TIR) regions. Matrices of
spectral samples were then designed to be built up on a line-by-line basis to
form a two dimensional image (x- and y-axes), with a third dimension (z-
axis) holding the spectral data for each sample (pixel). These new imaging
spectrometers were subsequently able to retrieve near-laboratory quality re-
flectance spectra such that the data associated with each pixel approximated
the true spectral signature of a target material, with sufficiently high signal-
to-noise ratio (SNR) across the full contiguous wavelength range (nominally
400-2500 nm) to ultimately form a 3-dimensional datacube.
Within these hyperspectral images, molecular absorption and particle scat-
tering signatures of materials could be detected to unambiguously identify
and quantify the abundance of surface constituents (Buckingham et al. 2002).
Specific reflectance or absorption features could also be associated with differ-
ent minerals or even biological and chemical processes, thereby providing an
opportunity to better characterise surface environments and dynamics. With
this advancement, the era of hyperspectral imaging was developed with air-
borne sensors deployed initially albeit largely for research purposes. Several
spaceborne hyperspectral sensors have also been deployed although, to date,
the data have been used largely for technology demonstration and research.
The following sections chronicle the development of these airborne and
spaceborne hyperspectral sensors and subsequently consider the character-
istics of several, in the context of temporal, spatial, spectral and radiometric
resolutions, which render them unique for environmental applications.
1.3.1
Airborne sensors
Airborne hyperspectral remote sensing has been available since the early 1980s
(Table l.2). Early developments were characterized by small, purpose-built
sensors. Amongst the earliest of the scanning imaging spectrometers was a one-
dimensional profiler developed by the Geophysical Environmental Research
Company (GER) in 1981. This sensor gathered data in 576 channels over the
400-2500 nm wavelength range. In the same year, the Shuttle Multi-spectral
Infrared Radiometer (SMIRR) became operational and Canada's Department
of Fisheries and Oceans introduced the Fluorescence Line Imager (FLI). In 1983,
the Airborne Imaging Spectrometer (AIS) was first flown following a three year
period of development at NASA's Jet Propulsion Laboratory (Goetz 1995). The
principal driving force behind many of these initial developments came from
geological disciplines.
Table 1.2. Basic characteristics of airborne hyperspectral sensors I~ t1)
Vl
~
Sensor Availability Number Spectral Band width Notes Technical t1)
n
of Bands range (nm) at FWHM Reference -!!l..
(nm) VI
t1)
=>
AAHIS 1994 288 432- 832 6 (Hochberg and Atkinson 2003) '"
51
Advanced Airborne STI industries ...'"=>
Hyperspectral Imaging www.sti-industries.com/ c..
l>
~
Sensors 'E..
;:;.
AHI 1994 256 7500-11700 100 12 bits Hawaii Institute of Geophysics
Airborne and Planetology 0'
=>
Hyperspectral Imager www.higp.hawaii.edu/
-...
'"
AIS-1/2 1982-1985 128 900- 2100 9.3 (Vane et al. 1984)
1985-1987 800- 2400 10.6
AISA+ 1997 244 400- 970 2.9 AISA Eagle has (Kallio et al. 2001)
AISA Eagle Airborne 2002 a 1000 pixel swath, www.specim.fi/
Imaging Spectrometer compared with 500
for Different Applica- for AISA+
tions
AISAHawk 2003 240 1000- 2400 8 www.specim.fi/
APEX 2005 Programmable 380- 2500 10 312 spectral rows (Schaepman et al. 2003)
Airborne Prism to a max of 300 in the VNIR and
Experiment 195 spectral rows
in the SWIR.
ASTER Simulator 1992 1 700- 1000 300 16 bits (Mills et al. 1993)
3 3000- 5000 600-700 www.cis.rit.edu/c1ass/
20 8000-12000 200 simg707 /Web_Pages/Survey_
report.htm#_ Toc4041 06559 .....
AVIRIS 1987 224 400- 2450 9.4-16 10 bits until 1994, (Curran and Dungan 1989; ;J:I
n'
~
Airborne Visible/ 12 bits from 1995 Curran and Dungan 1990)
Infrared Imaging makalu.jpl.nasa.gov/aviris.html '"a.
r-
Spectrometer c
n
HYMAP 1996 126 450- 2500 15-20 12-16 bits (Cocks et al. 1998)
Intspec (www.intspec.com)
Hyvista (www.hyvista.com)
MAIS 1991 32 450- 1100 20 12 bits NASA (ltpwww.gsfc.nasa.govl
Modular Airborne 32 1400- 2500 30 ISSSR-95/modulara.htm)
Imaging Spectrometer 7 8200-12200 400-800 (van der Meer et al. 1997;
VanderMeer et al. 1997;
Yang et al. 2000)
MAS 1993 9 529- 969 31-55 12 bits (King et al. 1996)
MODIS Airborne 16 l395- 2405 47-57 (pre-1995: 8 bits) MODIS Airborne Simulator
Simulator 16 2925- 5325 142-151 (mas.arc.nasa.gov)
9 8342-14521 352-517
I Number of spectrometers, detector types and band placements are tailored to user requirements. GER GER EPS-H Airborne Imaging Spectrometer
System Technical Description, 2003.
" '-l
'-
Sensor Availability Number Spectral Band width Notes Technical 100
-
of Bands range (nm) at FWHM Reference
(nm)
MIVIS 1993 20 433- 833 20 12 bits Sensytech
Multispectral Infrared 8 1150- 1550 50 (www.sensystech.com)
and Visible Spectrometer 64 2000- 2500 8
10 8200- 12700 400-500
OMIS 1999 64 460- 1100 10 12 bits (Huadong et al. 2001)
Operative Modular Airborne 16 1060- 1700 40
Imaging Spectrometer 32 2000- 2500 15
8 3000- 5000 250
8 8000- 12500 500 ::D
;;.
~
PROBE-l 100-200 440- 2543 11- 18 (McGwire et al. 2000) ...a.
ROSIS 1993 128 selectable 440- 850 5 12 bits (Kunkel et al. 1991; Su et al. 1997) r-
n
Reflective Optics System spectral bands DLR Institute of Optoelectronics ...c
~
Imaging Spectrometer www.op.dlr.de/ne-oe/fo/ ;I>
iD
rosis/home.html c..
::D
0
SASI 2002 160 850- 2450 10 14 bits www.itres.com ::e
iii"
Shortwave Infrared telsat.belspo.be/documentsl ;:J
c..
Airborne Spectrographic casi2003.html ~
0
Sensor iii"
SFSI 1994 22-120 1230- 2380 10.3 (Nadeau et al. 2002) z
iD·
Short Wavelength Infrared Canada Center for Remote Sensing
-
3
...
;:J
Full Spectrum Imager www.ccrs.nrcan.gc.ca ::>
::D
VIFIS 1994 64 420- 870 10.12 8 bits (Sun and Anderson 1993; Guetal.1999) ...
'<
Variable Interference Filter www.aeroconcepts.com/Tern/ 3:
tD
Imaging Spectrometer Instrument.html g
;:J
Hyperspectral Sensors and Applications 19
1.3.1.1
Spectral Regions, Resolutions and Bidirectional Capability
Although hyperspectral remote sensing implies observation across the full
spectral region (i. e. 400-2500 nm), many sensors operate only within the
VNIR regions. As examples, the ASAS and CAS I record reflectance in the
20 1: Richard lucas, Aled Rowlands, Olaf Niemann, Ray Merton
1.3.1.2
Spatial Resolution and Coverage
The spatial resolution of an observing sensor refers to a distance between the
nearest objects that can be resolved, is given in units oflength (e. g., meter) and
depends on the instantaneous field of view (IFOV). For many airborne sensors,
the spatial resolution is dictated largely by the flying height of the aircraft as
well as the configuration of the sensor and, in some cases, the aircraft platform
needs to be changed to achieve the required resolution (Vane et al. 1993). The
flying height of the aircraft also influences the width of the scan and hence the
extent of coverage. The lens optics and the integration time will similarly limit
the potential spatial resolution of the image, although data can be acquired at
spatial resolutions finer than 1 m.
The level of detail able to be resolved at different spatial resolutions is
indicated in Fig. l.la-d, which compares data acquired by several airborne
and also spaceborne sensors over an area of subtropical woodland in central
Queensland, Australia. Using aerial photography, tree crowns can be resolved
easily through differentiation between photosynthetic vegetation (PV), non-
photosynthetic vegetation (NPV; e.g., branches) and soil background, and
different tree species can be distinguished. Using 1 m spatial resolution CAS I
Hyperspectral Sensors and Applications 21
a b
c d
Fig.1.la-d. Observations of the same area of mixed species subtropical woodlands near
Injune, central Queensland, Australia, observed using a stereo colour aerial photography,
b CAS!, c HyMap, and d Hyperion data at spatial resolutions of < I m, 1 m, 2.8 m and 30 m
respectively. For a colored version of this figure, see the end of the book
data, tree crowns are still resolved and species can be differentiated although
within-canopy components cannot be isolated and pixels containing a mix of
ground and vegetation spectra are common. Even so, relatively pure spectra
relating to ground (e. g., bare soil) and vegetation (e. g., leaf components) can
be extracted. At ~ 2.5-5 m spatial resolution, which is typical to the HyMap
sensor, only larger tree crowns are discernible and, due to the openness of the
woodland canopy, most pixels contain a mix of ground and vegetation spectra.
At coarser (~ 5- > 20 m) spatial resolutions, tree crowns cannot be differen-
tiated and the averaging of the signal is such that pure spectral signatures
for specific surfaces cannot be extracted from the imagery, and only broad
vegetation and surface categories can be distinguished. Sensors observing at
resolutions from 20 to 30 m include the airborne AVIRIS (Vane et al. 1993),
DAIS-7915 (Ben-Dor et al. 2002) and MODIS-ASTER simulator (MASTER) as
well as the spaceborne Hyperion. These observations illustrate the difficul-
ties associated with obtaining "pure" spectral reflectance data from specific
surfaces and materials, particularly in complex environments.
22 1: Richard Lucas, Aled Rowlands, Olaf Niemann, Ray Merton
7.3.7.3
Temporal Resolution
The temporal resolution refers to the frequency of observation by sensors.
Until the advent of hyperspectral data from spaceborne sensors, observations
were obtained on an 'as needs' basis. Due primarily to limitations of cost, few
multitemporal datasets have been produced. As a result, the complex time-
series of hyperspectral dataset acquisitions targeting geology, soil, and plant
applications constructed primarily of AVIRIS, HyMap, and CHRIS data (1992-
current) over Jasper Ridge in California (Merton 1999) perhaps represents one
of the few comprehensive datasets available.
The lack of temporal observations of hyperspectral sensors has proved par-
ticularly limiting for understanding, mapping or monitoring dynamic environ-
ments that experience rapid or marked changes in, for example, seasonal leaf
cover and chemistry (e. g., temperate woodlands), water status (e. g., wetlands)
or snow cover (e. g., high mountains). In many environments, such as arid or
semi-arid zones, temporal changes are less significant which accounts partly
for the success of hyperspectral data in, for example, geological exploration
and mapping.
7.3.7.4
Radiometric Resolution
Radiometric resolution, or quantization, is defined as the sensitivity of a sen-
sor to differences in strength of the electromagnetic radiation (EMR) signal
and determines the smallest difference in intensity of the signal that can be
distinguished. In other words, it is the amount of energy required to increase
a pixel value by one count. In contrast to many MSS, which typically recorded
data up to 8-bit in size, hyperspectral sensors have been optimised to record
data to use at least 10 to 12 bits. As an example, AVIRIS recorded data in 10 bits
prior to the 1993 flight season and 12 bits thereafter (Vane et al. 1993). The
consequence of increasing the quantization has been to increase the sensitivity
of the sensor to variations in the reflected signal, thereby allowing more subtle
reflectance differences from surfaces to be detected and recorded. If the quan-
tization level is too small then these differences can be lost. The acquisition of
data by sensors that support a larger number of quantization levels is therefore
important, especially when dealing with some of the applications discussed
later in this chapter (e. g., determination of vegetation stress and health and
discrimination of minerals).
A fundamental requirement of hyperspectral remote sensing has been the
need to maximise the signal as opposed to the noise; in other words, the
SNR. The SNR is a measure of how the signal from surfaces compares to the
background values (i. e., noise) and is determined typically by estimating the
signal from pixel values averaged over a homogenous target and dividing this
by the noise estimated from the standard deviation of the pixel values. SNR
vary considerably between sensors and spectral regions. Manufacturers of
CASI, for example, state that SNR values are greater than 480:1. Manufacturers
Hyperspectral Sensors and Applications 23
of HyMap, state that SNR values are above 500:1 at 2200 nm, and 1000:1 in
the VNIR region. For the GER sensor, values of 5000:1 for the visible and
NIR regions and 500: 1 for the SWIR region, have been reported (Mackin and
Munday 1988). The SNR also varies depending upon the nature of the surface.
In a study of soils in Israel (Ben-Dor and Levin 2000), the SNR for sensors
varied from 20:1 to 120:1 for light targets and 1:1 to 5:1 for dark targets.
The development and use of airborne hyperspectral sensors has continued
as they provide flexibility in the acquisition of data, particularly in terms of
temporal frequency and spatial coverage. Specifically, data can be acquired
when conditions (e. g., local weather) are optimal, and parameters (e. g., the
area of coverage, waveband configurations and spatial resolutions) can be de-
termined prior to acquisition. Even so, the expense of acquiring data from
airborne sensors still limits the construction of multi-temporal datasets and
generally reduces coverage to relatively restricted areas. Many studies using
such data have therefore been developed for specific and often narrow applica-
tions. A further disadvantage of using airborne sensors has been that aircraft
motion and sun-surface-sensor geometries impact negatively on the quality
of the data acquired. For these reasons, there has been a drive to advance
spaceborne missions with hyperspectral capability.
1.3.2
Spaceborne sensors
1.3.2.1
Spectral Regions, Resolutions and Bidirectional Capability
Although not providing a continuous spectral profile, the MODIS and ASTER
image the spectral region from the visible through to the SWIR and TIR regions
and at variable sampling intervals. The MERIS, although observing in only the
390-1040 nm region and at band widths of2.S-12.S nm, is fully programmable
Hyperspectral Sensors and Applications 25
(in terms of width and location) such that observations can be made in up to 15
moveable bands with these selected through telecommands. This increases the
utility of the instrument as selected portions of the spectrum with contiguous
channels can be viewed. The ASTER sensor also has channels centred on
selected absorption features. Specifically, band 6 (SWIR) is centred on a clay
absorption feature (associated with hydrothermal alteration), whilst bands 8
(SWIR) and 14 (TIR) are centred on a carbonate feature, thereby allowing
discrimination oflimestones and dolomites. Bands 10 to 12 (TIR) are designed
to detect sulphate and silica spectral features, whilst band 10 (with 6) allows
discrimination of common minerals, including alunite and anhydrite (Ellis
1999).
Only the Hyperion and CHRIS systems provide a continuous spectral profile
for each pixel value (Table 1.2). The Hyperion observes in 220 bands extending
from 400 to 2500 nm with a FWHM of 10 nm. The CHRIS observes in the
410-1050 nm range in 63 bands with bandwidths ranging variably from 1-3 nm
at the blue end of the spectrum to about 12 nm at the NIR end. The red-edge
region (~690-7 40 nm) is sampled at a bandwidth of approximately 7 nm.
Although a continuous spectral profile is useful, it can be argued that for
many applications, the information required can be easily extracted from
a few specific spectral bands, particularly since many are highly correlated
and often redundant. Feature extraction and selection therefore becomes an
important image processing issue under such circumstances. Aspects of over
sampling are discussed in more detail in Chapters 8 and 9.
A particular benefit of CHRIS is that multi-angle viewing can be achieved
in the five along track view angles. For example, using PROBA's agile steering
capabilities in both along and across track directions, observations of targets
outside the nominal field of view of 1.3° can be obtained. From CHRIS, surface
biophysical parameters are being estimated using a number of techniques
ranging from traditional vegetation indices to more advanced techniques such
as BRDF model inversion. MERIS also provides directional viewing capability.
1.3.2.2
Spatial Resolution and Coverage
The spatial resolution of spaceborne hyperspectral sensors is coarser compared
to most airborne sensors, due largely to their greater altitude. MODIS and
MERIS sensors provide data at spatial resolutions ranging from 250-300 m to
1000 m. Hyperion observes at 30 m spatial resolution whilst CHRIS acquires
data at spatial resolutions of 36 m and 19 m with 63 and 18 spectral bands
respectively. MODIS provides data of the same location at variable spatial
resolutions of 250 m-1000 m (Table 1.3). As suggested earlier, coarse spatial
resolution can limit the ability to extract pure spectral signatures associated
with different surfaces. Orbview 4 (launch failure) was to include a finer (8 m)
spatial resolution 200 band hyperspectral sensor.
The swath width, and hence the area of coverage, also varies between satel-
lite sensors. MODIS and MERIS have swath widths of 2330 km and 1150 km
respectively, allowing wide area coverage within single scenes. Such data are
tv
Table 1.3. Basic characteristics of selected spaceborne hyperspectral and multispectral sensors 10--
Sensor Launch Platform Number Spectral range Band width Spatial Resolution Technical reference
of Bands (nm) at FWHM (m, unless stated)
ASTER Dec 1999 Terra (EOS) 14 520-1165 40-100 VNIR 15 terra.nasa.govl
SWIR30
TIR90
CHRIS Oct 2001 ESAPROBA up to 62 410-1050 5-12 18-36 www.chris-proba.org
'<
'"
s:
~
0
-~
Hyperspectral Sensors and Applications 27
therefore more suitable for use in regional to global studies. Hyperion has
a nominal altitude of 705 km giving a narrow (7.5 km) swath width, whilst
CHRIS has a nominal 600 km orbit and images the Earth with a 14 km swath
(with a spatial resolution of 18 m). This coverage generally limits use to local
or landscape-scale studies.
1.3.2.3
Temporal Resolution
1.4
Ground Spectroscopy
Data acquired by airborne or spaceborne sensors cannot be considered in
isolation since effective data interpretation requires a detailed understanding
of the processes and interactions occurring at the Earth's surface. In this re-
spect, a fundamental component of understanding hyperspectral sensors is
the laboratory and field measurement of the spectral reflectance of different
surfaces. A number of portable field and laboratory spectroradiometers have
been developed for this purpose, ranging from the Milton spectroradiometer
to the more advanced spectroradiometers that include the Analytical Spectral
Devices (ASD) Fieldspec Pro FR and the IRIS spectroradiometer developed by
GER (Table 1.4).
Technological advances in a number of areas have led to improvements in
the field of spectroscopy. Advances in spectroradiometer technology, specifi-
cally with respect to the electro-optical systems, have resulted in an increase
in sensor sensitivity and a decrease in scan times, permitting greater data
collection in a shorter period of time. This has enabled researchers to acquire
high quality reflectance data rapidly, both in the field and under laboratory
conditions. A second advancement has been the increase in the processing
sophistication of computer technology and the reduction in the cost of data
storage. These technological improvements have also reduced the overall costs
so that a greater number of users are able to acquire portable spectrometers,
particularly for use in the field.
28 ,: Richard lucas, Aled Rowlands, Olaf Niemann, Ray Merton
1.S
Software for Hyperspectral Processing
The datasets originating from hyperspectral sensors are complex and cannot
be adequately analyzed using the more traditional image processing packages.
Significant research has therefore been carried out to develop algorithms that
can be incorporated into commercially available software. At the time of writing
two of the more commonly used packages for hyperspectral image processing
have been produced by Research Systems Inc. (RSI), U.S.A. and PCI Geomatics,
Canada. Other examples of software that state hyperspectral capability are
listed in Table 1.5. It should be noted that many of the algorithms used for
hyperspectral imaging applications presented in this book are of an advanced
nature and are still under research and development.
30 1: Richard Lucas, Aled Rowlands, Olaf Niemann, Ray Merton
Software Developer/Distributor
Environment for Visualizing Images (ENVI) Research Systems Inc., USA
EASI-PACE PCI Geomatics, Canada
Imagine ERDAS, USA
Hyperspectral product Generation System Analytical Imaging and Geophysics, USA
(HPGS)
Hyperspectral Image Processing Chinese Academy of Sciences, China
and Analysis System (HIPAS)
Spectral Image Processing System (SIPS) University of Colorado, USA
1.6
Applications
Although imaging spectrometry has been used in military applications (e. g.,
distinguishing between camouflage and actual vegetation) for many years,
the classified nature of the information has resulted in few published papers
regarding their origins (van der Meer and de Jong 2001). Therefore, as hy-
perspectral remote sensing data became available to the civilian community,
the early phases of analysis focused largely on understanding the information
content of the data over a disparate range of environments. Subsequently, the
development of algorithms specifically designed to more effectively manipu-
late the enormous quantities of data generated became a priority. As familiarity
with the data increased, the potential benefits of using hyperspectral imaging
became apparent. Today, hyperspectral data are increasingly used for applica-
tions ranging from atmospheric characterisation and climate research, snow
and ice hydrology, monitoring of coastal environments, understanding the
structure and functioning of ecosystems, mineral exploration and land use,
land cover and vegetation mapping. The following provides a brief overview
of these applications.
1.6.1
Atmosphere and Hydrosphere
Within the atmospheric sciences, hyperspectral remote sensing has been used
primarily to investigate the retrieval of atmospheric properties, thereby allow-
ing the development and implementation of correction techniques for airborne
and spaceborne remote sensing data (Curran 1994; Green et al. 1998b; Roberts
et al. 1998). In particular, certain regions of the electromagnetic spectrum
are sensitive to different atmospheric constituents. For example, absorption
of water vapour occurs most strongly at 820 nm, 940 nm, 1130 nm, 1380 nm
and 1880 nm (Gao and Goetz 1991) whilst C02 absorption is prominent at
Hyperspectral Sensors and Applications 31
1600 nm but particularly at '" 2080 nm. 02 absorption occurs at 760 nm and
1270 nm. Using this information, estimates of these constituents can be de-
rived using hyperspectral data. As an example, water vapour can be estimated
using the Continuum Interpolated Band Ratio (CIBR) and Narrow/Wide and
Atmospheric Pre-Corrected Differential Absorption (APDA), with these mea-
sures utilising differential absorption techniques and applied across specific
atmospheric water vapour absorption features (e. g. 940 and 1130 nm) (Rodger
and Lynch 2001). The EO-l also supported an Atmospheric Corrector for facil-
itating correction for atmospheric water vapour, thereby optimising retrieval
of surface features. Coarser spatial resolution datasets have also been used to
characterise the atmosphere. For example, MODIS data have been used in the
measurement of cloud optical thickness, cloud top pressures, total precipitable
water and both coarse and fine aerosol contents in the atmosphere (Baum et
al. 2000; Seemann et al. 2003).
For studies of the cryosphere, hyperspectral techniques have been used to
retrieve information relating to the physical and chemical properties of snow
including grain size, fractional cover, impurities, and snow mass liquid water
content (Dozier and Painter 2004). Few airborne studies have been undertaken
in high latitude regions due to the difficulty in acquiring data. Snow products
are also routinely and automatically derived from orbital platforms such as
MODIS, and include snow cover maps from individual scenes to spatial and
temporal composites (Hall et al. 1995; Klein and Barnett 2003).
Within marine and freshwater environments, hyperspectral data have been
used to characterize and map coral reefs and submerged aquatic vegetation
(Kutser et al. 2003; Malthus and Mumby 2003), including sea grasses (Fyfe
2003), and for quantifying sediment loads and water quality (Fraser 1998). At
the land-sea interface, applications have included characterization of aquatic
vegetation (Zacharias et al. 1992), salt marshes (Schmidt and Skidmore 2003;
Silvestri et al. 2003) and mangroves (Held et al. 2003). As an example, Zacharias
et al. (1992) were able to detect submerged kelp beds using airborne hyper-
spectral data at several spatial resolutions (Fig. 1.2). The benefits of using
hyperspectral CASI data for identifying and mapping different mangrove
species and communities have been highlighted in studies of Kakadu National
Park, northern Australia (Fig. 1.3). When the derived maps were compared in
a time-series with those generated from historical black and white and true
colour stereo aerial photography, changes in both the extent of mangroves
and their contained species and communities were evident (Mitchell 2004).
These changes suggested a long-term problem of saltwater intrusion as a result
of coastal environmental change. The study also emphasized the benefits of
using hyperspectral data at fine « 1) spatial resolution for characterizing and
mapping mangrove communities.
The application of remotely sensed data to the study of coral reefs was pro-
posed in the mid 1980s (Kuchler 1986) although difficulties associated with
wavelength -specific penetration oflight in water, mixed pixels and atmospheric
attenuation caused initial disillusionment amongst some users (Green et al.
1996; Holden and LeDrew 1998). Even so, the science developed dramatically
in the 1990s, due partly to concerns arising from the impacts on coral reefs
32 1: Richard lucas, Aled Rowlands, Olaf Niemann, Ray Merton
Fig. 1.2. Example of detection of aquatic vegetation using airborne imaging spectrometers:
partially submerged kelp beds (bright red) as observed using a 2 m CAS I and b 20 m AVIRIS
data. For a colored version of this figure, see the end of the book
a b
Fig.1.3. a Colour composite (837 nm, 713 nm and 446 nm in RGB) of fourteen 1 km x
~ 15 km strips of CAS I data acquired over the West Alligator River mangroves, Kakadu
National Park, Australia by BallAIMS (Adelaide). The data were acquired at 1 m spatial
resolution in the visible and NIR wavelength (446 nm-838 nm) region. b Full resolution
image of the west bank near the river mouth with main species/communities indicated
(Mitchell 2004). For a coloured version of this figure, see the end of the book
Hyperspectral Sensors and Applications 33
1.6.2
Vegetation
1.6.2.1
Reflectance Characteristics of Vegetation
Leaf I Cell
__
I
Water content Dominant factor
pigments:
. :...-structure
- - -__
:
:_.-------~~---
} controlling
I I leaf reflectance
80 I I
Primary
70 Chlorophyll Water } absorption
absorption absorption
bands
\
~ 60
o~ NIR
-; 50 plateau
o
c:
~ 40
+
CD
'$
a: 30
20
10
O~-.--,--.--.--,-,,-.--.--,--.~~
0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6
I
Wavrlength (lim)
III" 11"1 4
I
Visible: NIR I SWIR
I
~ 55 i: I
05 ~
o
r:c:
I
I
I
I
that is scattered within the leaf is reflected back through the leaf surface, with
a proportion transmitted through the leaf (Kumar 1998). SWIR reflectance
is determined largely by moisture content (Kaufman and Remer 1994), with
water absorption features centred primarily at 2660 nm and 2730 nm (Kumar
et al. 2001). SWIR reflectance is also influenced by leaf biochemicals such as
lignin, cellulose, starch, proteins and nitrogen (Guyot et al. 1989; Kumar et al.
2001) but absorption is relatively weak and often masked by the more dominant
water absorption features (Kumar 1998).
Key descriptive elements of the vegetation spectral reflectance signatures
include the green peak, the chlorophyll well, the red-edge, the NIR plateau, and
the water absorption features (Kumar 1998). Of particular importance is the
red-edge, which is defined as the rise of reflectance at the boundary between
the chlorophyll absorption feature in red wavelengths and leaf scattering in
NIR wavelengths (Treitz and Howarth 1999). The red-edge, which displays
the greatest change in reflectance per change in wavelength of any green leaf
spectral feature in the VNIR regions (Elvidge, 1985), is identified typically
using the red-edge inflection point (REIP) or point of maximum slope and is
located between 680 and 750 nm regardless of species (Kumar et al. 2001).
Using hyperspectral data, key factors that determine the likelihood of re-
trieving a 'pure' reflectance spectra include the spatial resolution of the sen-
sor, the density of the canopy, the area and angle distribution of leaves and
branches, and the proportion of shade in the canopy (Adams et al. 1995). In
most cases, the spectral signature becomes mixed with that of other surfaces,
although the influence of multiple reflective elements within a pixel can be re-
Hyperspectral Sensors and Applications 35
1.6.2.2
Foliar Biochemistry and Biophysical Properties
A major focus of hyperspectral remote sensing has been the retrieval of foliage
chemicals, particularly those that play critical roles in ecosystem processes,
such as chlorophyll, nitrogen and carbon (Daughtry et al. 2000; Coops et al.
2001). Most retrievals have been based on simple, conventional stepwise or
multiple regression between chemical quantities (measured using wet or dry
chemistry based techniques) and reflectance (R) or absorbance (calculated as
log 10 (i/R) ) data or band ratios of spectral reflectance, which themselves are
usually correlated to measures of green vegetation amount, cover or function-
ing (Gitelson and Merzlyak 1997). Modified partial least squares (MPLS), neu-
ral network, statistical methods (Niemann and Goodenough 2003) and model
inversion techniques (Demarez and Gastellu-Etchegorry 2000; Jacquemoud et
al. 2000) have also been used to assist retrieval. MPLS has proved particularly
useful as the information content of hundreds of bands can be concentrated
within a few variables, although the optimal predictive bands still need to
be identified using correlograms of spectra and biochemical concentration or
regression equations from the MPLS analysis.
Although a wide range of foliar chemicals exist in the leaf, hyperspectral
remote sensing has shown greatest promise for the retrieval of chlorophyll
a and b (Clevers 1994; Gitelson and Merzlyak 1997), nitrogen (Curran 1989;
Matson et al. 1994; Gastellu-Etchegorry et al. 1995; Johnson and Billow 1996),
carbon (Ustin et al. 2001), cellulose (Zagolski et al. 1996), lignin (Gastellu-
Etchegorry et al. 1995), anthocyanin, starch and water (Curran et al. 1992;
Serrano et al. 2000), and sideroxylonal-A (Blackburn 1999; Ebbers et al. 2002).
Other biochemicals which exhibit clearly identifiable absorption features and
are becoming increasingly important to the understanding of photosynthetic
and other leaf biochemical processes include: yellow carotenes and pale yellow
xanthophyll pigments (strong absorptions in the blue wavelengths), carotene
absorption (strong absorption at ~ 450 nm), phycocyanin pigment (absorbs
primarily in the green and red regions at ~ 620 nm) and phycoerythrin (strong
absorption at ~ 550 nm). However, chlorophyll pigments are dominant and
normally mask these pigments. During senescence or severe stress, the chloro-
phyll dominance may be lost, causing the other pigments to become dominant
(e. g. at leaf fall). Anthocyanin may also be produced in autumn causing leaves
to appear bright red when observed in visible wavelengths.
A number of studies have successfully retrieved foliar chemicals from air-
borne and spaceborne hyperspectral data. As examples, foliar nitrogen concen-
36 1: Richard lucas, Aled Rowlands, Olaf Niemann, Ray Merton
tration has been retrieved using AVIRIS, HyMap and Hyperion data (Johnson
1994; Matson et al. 1994; LaCapra et al. 1996; Martin and Aber 1997) and lignin,
among other biochemicals, has been quantified using AVIRIS data (Johnson
1994). Indices such as the Water Band Index (WBI) (Penuelas et al. 1997) and
the Normalised Difference Water Band Index (NDWBI) have also been used as
indicators ofleaf and canopy moisture content (Ustin et al. 2001).
Despite successes with the retrieval of foliar chemicals, the process is com-
plicated by the distortion effects of the atmosphere, low SNR, complicated tree
canopy characteristics and the influence of the underlying soils and topog-
raphy. Many understorey species, which exhibit wide ranging foliar chemical
diversity, also cannot be observed. Even so, continued efforts at retrieving algo-
rithms for quantifying foliar biochemicals are advocated given the importance
of many (e. g., N and C) in global cycles.
Key biophysical attributes that can be retrieved using hyperspectral data
include measures of foliage and canopy cover (e. g., FPC or LAI) (Spanner et
al. 1990a; Spanner et al. 1990b; Gong et al. 1995), the fraction of absorbed
photosynthetically active radiation (fAPAR) and also measures of canopy ar-
chitecture, including leaf angle distributions. Although woody attributes (e. g.,
branch and trunk biomass) cannot be sensed directly, these can often be in-
ferred (Asner et al. 1999). Approaches to retrieving biophysical properties
from spectral reflectance measurements include the integration of canopy ra-
diation models with leaf optical models (Otterman et al. 1987; Franklin and
Strahler 1988; Goel and Grier 1988; Ustin et al. 2001) or the use of empirical
relationships and vegetation indices (Treitz and Howarth 1999). Other spe-
cific techniques include derivative spectra, continuum removal, hierarchical
foreground/background analysis and SMA (Gamon and Qiu 1999), with most
associated with either physically-based canopy radiation models (Treitz and
Howarth 1999) or empirical spectral relationships (indices).
Key biophysical attributes that can be retrieved using hyperspectral data in-
clude measures of foliage and canopy cover (e. g., Foliage Projected Cover (FPC)
or Leaf Area Index (LAI)) (Spanner et al. 1990a; Spanner et al. 1990b; Gong et
al. 1995), the fraction of absorbed photosynthetically active radiation (fAPAR)
and also measures of canopy architecture, including leaf angle distributions.
Although woody attributes (e. g., branch and trunk biomass) cannot be sensed
directly, these can often be inferred (Asner et al. 1999). Approaches to retriev-
ing biophysical properties from spectral reflectance measurements include the
integration of canopy radiation models with leaf optical models (Otterman et
al. 1987; Franklin and Strahler 1988; Goel and Grier 1988; Ustin et al. 2001) or
the use of empirical relationships and vegetation indices (Treitz and Howarth
1999). Other specific techniques include derivative spectra, continuum re-
moval, hierarchical foreground/background analysis and SMA (Gamon and
Qiu 1999), with most associated with either physically-based canopy radiation
models (Treitz and Howarth 1999) or empirical spectral relationships (indices).
Hyperspectral Sensors and Applications 37
1.6.2.3
Vegetation Mapping
1.6.3
Soils and Geology
1.6.4
Environmental Hazards and Anthropogenic Activity
Hyperspectral remote sensing offers many opportunities for monitoring natu-
ral hazards such as bushfires, volcanic activity and observing anthropogenic-
induced activities and impacts such as acidification, land clearing and degra-
dation (de Jong and Epema, 2001), biomass burning (Green et al. 1998a), water
pollution (Bianchi et al. 1995a; Bianchi et al. 1995b), atmospheric fallout of
dust and particulate matter emitted by industry (Ong et al. 2001) and soil
salinity (Metternicht and Zinck 2003). As an example, Chisholm (2001) indi-
cated that spectral indices could be used to detect vegetation water content at
40 1: Richard Lucas, Aled Rowlands, Olaf Niemann, Ray Merton
the canopy level which could be used subsequently to assess fire fuel loading
and hence assist in the prediction of wildfires. Asner et al. (1999) also indicated
that observed variations in AVIRIS hyperspectral signatures were indicative
of the 3-dimensional variation in LAI and dry carbon and that simultaneous
observations of both LAI and dry carbon area index (NPVAI) allowed the pro-
duction of maps of both structural and functional vegetation types as well as
fire fuel load.
1.7
Summary
References
Aber JD, Bolster KL, Newman SD, Soulia M, Martin ME (1994) Analysis of forest foliage II:
Measurement of carbon fraction and nitrogen content by end-member analysis. Journal
of Near Infrared Spectroscopy 2: 15-23
Adams JB, Sabol DE, Kapos V, Filho RA, Roberts DA, Smith MO (1995) Classification
of multispectral images based on fractions of endmembers: application to land-cover
change in the Brazilian Amazon. Remote Sensing of Environment 52: 137-154
Adams JB, Smith MO (1986) Spectral mixture modeling: a new analysis of rock and soil
types at the Viking Lander 1 site. Journal of Geophysical Research 91B8: 8098-8112
References 41
Ahn CW, Baumgardner MF, Biehl LL (1999) Delineation of soil variability using geostatistics
and fuzzy clustering analyses of hyperspectral data. Soil Science Society of America
Journal 63(1): 142-150
Asner GP, Heidebrecht KB (2002) Spectral unmixing of vegetation, soil and dry carbon cover
in arid regions: comparing multispectral and hyperspectral observations. International
Journal of Remote Sensing 23(19): 3939-3958
Asner GP, Townsend AR, Bustamante MMC (1999) Spectrometry of pasture condition and
biogeochemistry in the Central Amazon. Geophysical Research Letters 26(17): 2769-
2772
Aspinall RJ, Marcus WA, Boardman JW (2002) Considerations in collecting, processing, and
analysing high spatial resolution hyperspectral data for environmental investigations.
Journal of Geographical Systems 4: 15-29
Baugh WM, Kruse FA, Atkinson WW (1998) Quantitative geochemical mapping of ammo-
nium minerals in the southern Cedar Mountains, Nevada, using the Airborne Visible In-
frared Imaging Spectrometer (AVIRIS). Remote Sensing of Environment 65(3): 292-308
Baum BA, Kratz DP, Yang P, Ou SC, Hu YX, Soulen PF, Tsay SC (2000) Remote sensing of
cloud properties using MODIS airborne simulator imagery during SUCCESS 1. Data
and models. Journal of Geophysical Research-Atmospheres 105(D9): 11767-11780
Ben-Dor E, Banin A (1994) Visible and near-infrared (0.4-1.1 }lm analysis of arid and
semi-arid soils. Remote Sensing of Environment 48(3): 261-274
Ben -Dor E, Levin N (2000). Determination of surface reflectance from raw hyperspectral data
without simultaneous ground data measurements: a case study of the GER 63-channel
sensor data acquired over Naan, Israel. International Journal of Remote Sensing 21(10):
2053-2074
Ben-Dor E, Patkin K, Banin A, Karnieli A (2002) Mapping of several soil properties us-
ing DAIS-7915 hyperspectral scanner data - a case study over clayey soils in Israel.
International Journal of Remote Sensing 23(6): lO43-lO62
Bianchi R, Castagnoli A, Cavalli RM, Marino CM, Pignatti S, Poscolieri M (1995a) Use of
airborne hyperspectral images to assess the spatial distribution of oil spilled during
the Trecate blow-out (Northern Italy). Remote Sensing for Agriculture, Forestry and
Natural Resources. E. T. Engman, Guyot, G. and Marino, C. M., SPIE Proceedings 2585,
pp 352-362
Bianchi R, Castagnoli A, Cavalli RM, Marino CM, Pignatti S, Zilioli E (1995b) Preliminary
analysis of aerial hyperspectral data on shallow lacustrine waters. Remote Sensing for
Agriculture, Forestry and Natural Resources. E. T. Engman, Guyot, G. and Marino, C.
M., SPIE Proceedings 2585, pp 341-351
Bierwirth P, Huston D, Blewett R (2002) Hyperspectral mapping of mineral assemblages
associated with gold mineralization in the Central Pilbara, Western Australia. Economic
Geology and the Bulletin of the Society of Economic Geologists 97(4): 819-826
Blackburn GA (1999) Relationships between spectral reflectance and pigment concentra-
tions in stacks of deciduous broadleaves. Remote Sensing of Environment 70(2): 224-237
Buckingham B, Staenz K, Hollinger A (2002) Review of Canadian airborne and space
activities in hyperspectral remote sensing. Canadian Aeronautics and Space Journal
48(1): 115-121
Carranza EJM, Hale M (2002) Mineral imaging with Landsat Thematic Mapper data for
hydrothermal alteration mapping in heavily vegetated terrain. International Journal of
Remote Sensing 23(22): 4827-4852
Cervelle B. (1991) Application of mineralogical constraints to remote-sensing. European
Journal of Mineralogy 3(4): 677-688
42 1: Richard lucas, Aled Rowlands, Olaf Niemann, Ray Merton
Chabrillat S, Goetz AFH, Krosley L, Olsen HW (2002) Use of hyperspectral images in the
identification and mapping of expansive clay soils and the role of spatial resolution.
Remote Sensing of Environment 82(2-3): 431-445
Chica-Olmo M, Abarca F, Rigol JP (2002) Development of a Decision Support System based
on remote sensing and GIS techniques for gold-rich area identification in SE Spain.
International Journal of Remote Sensing 23(22): 4801-4814
Chisholm LA (2001) Characterisation and evaluation of moisture stress in E. camalduensis
using hyperspectral remote sensing. Sydney, University of New South Wales
Christensen PR, Bandfield JL, Hamilton VE, Howard DA, Lane MD, Piatek JL, Ruff SW,
Stefanov WL (2000) A thermal emission spectral library of rock-forming minerals.
Journal of Geophysical Research-Planets 105(E4): 9735-9739
Clevers JPGW (1994) Imaging spectrometry in agriculture - plant vitality and yield indi-
cators. Imaging Spectrometry - a Tool for Environmental Observations. Hill JAM (ed),
Kluwer Academic Publishers, Dordrecht The Netherlands, pp 193-219.
Cochrane MA (2000) Using vegetation reflectance variability for species level classification
of hyper spectral data. International Journal of Remote Sensing 21(10): 2075-2087
Cocks T, Jenssen R, Stewart A, Wilson I, Shields T (1998) The HYMAp™airborne hyper-
spectral sensor:the system, calibration and performance. 1st EARSEL Workshop on
Imaging Spectroscopy, Zurich.
Cole MM (1991) Remote-sensing, geobotany and biogeochemistry in detection of Thalanga
zinc lead copper-deposit near Charters-Towers, Queensland, Australia. Transactions of
the Institution of Mining and Metallurgy Section B-Applied Earth Science 100: B1-B8
Coops NC, Smith M-L, Martin ME, Ollinger SV, Held A, Dury SJ (2001) Assessing the
performance of HYPERION in relation to eucalypt biochemistry: preliminary project
design and specifications. Proceedings ofInternational Geoscience and Remote Sensing
Symposium (IGARSS 2001), CD
Culvenor DS (2002) TIDA: an algorithm for the delineation of tree crowns in high spatial
resolution remotely sensed imagery. Computers and Geosciences 28(1): 33-44
Curran PJ (1989) Remote-sensing of foliar chemistry. Remote Sensing of Environment 30(3):
271-278
Curran PJ (1994) Imaging spectrometry. Progress in Physical Geography 18(2): 247-266
Curran PJ, Dungan JL (1989) Estimation of signal-to-noise - a new procedure applied to
AVIRIS data. IEEE Transactions on Geoscience and Remote Sensing 27(5): 620-628
Curran PJ, Dungan JL (1990) An image recorded by the Airborne Visible Infrared Imaging
Spectrometer (AVIRIS). International Journal of Remote Sensing 11(6): 929-931
Curran PJ, Dungan JL, Macler BA, Plummer SE, Peterson DL (1992) Reflectance spectroscopy
of fresh whole leaves for the estimation of chemical concentration. Remote Sensing of
Environment 39(2): 153-166
Dabney PW, Irons JR, Travis JW, Kasten MS, Bhardwaj S (1994) Impact of recent enhance-
ments and upgrades of the Advanced Solid-state Array Spectroradiometer (ASAS).
Proceedings of International Geoscience and Remote Sensing Symposium (IGARSS),
Pasadena, US, Digest, 3, pp 649-1651.
Daughtry CST, Walthall CL, Kim MS, de Colstoun EB, McMurtrey JE (2000) Estimating
corn leaf chlorophyll concentration from leaf and canopy reflectance. Remote Sensing
of Environment 74(2): 229-239
de Jong SM, Epema GF (2001) Imaging spectrometry for surveying and modeling land
degradation. In: van der Meer FD, de Jong SM, Imaging spectrometry: basic principles
and prospective applications, Kluwer Academic Publishers, Dordrecht, The Netherlands,
pp 65-86
References 43
Goetz AFH (ed) (1995) Imaging spectrometry for remote sensing: vision to reality in 15
years. Proceedings of SPIE Society for Optical Engineers, Orlando, Florida
Goetz AFH, Herring M (1989) The High-Resolution Imaging Spectrometer (HIRIS) for EOS.
IEEE Transactions on Geoscience and Remote Sensing 27(2): 136-144
Gong P, Pu RL, Miller JR (1995) Coniferous forest leaf-area index estimation along the ore-
gon transect using Compact Airborne Spectrographic Imager data. Photogrammetric
Engineering and Remote Sensing 61(9): 1107-1117
Green EP, Clark CD, Mumby PJ, Edwards AJ, Ellis AC (1998a) Remote sensing techniques
for mangrove mapping. International Journal of Remote Sensing 19(5): 935-956
Green EP, Mumby PJ, Edwards AJ, Clark CD (1996) A review of remote sensing for the
assessment and management of tropical coastal resources. Coastal Management 24( 1):
1-40
Green RO, Eastwood ML, Sarture CM, Chrien TG, Aronsson M, Chippendale BJ, Faust JA,
Pavri BE, Chovit q, Solis MS, Olah MR, Williams 0 (1998b) Imaging spectroscopy
and the Airborne Visible Infrared Imaging Spectrometer (AVIRIS). Remote Sensing of
Environment 65(3): 227-248
Gu Y, Anderson JM, Monk JGC (1999) An approach to the spectral and radiometric calibra-
tion of the VIFIS system. International Journal of Remote Sensing 20(3): 535-548
Guyot G, Guyon D, Riom J (1989) Factors affecting the spectral response offorest canopies:
a review. Geocarto International 4(3): 3-18
Hall DK, Riggs GA, Salomonson VV (1995) Development of methods for mapping global
snow cover using moderate resolution imaging spectroradiometer data. Remote Sensing
of Environment 54: 127- 140
Held A, Ticehurst C, Lymburner L, Williams N (2003) High resolution mapping of tropical
mangrove ecosystems using hyperspectral and radar remote sensing. International
Journal of Remote Sensing 24(13): 2739-2759
Hochberg EJ, Atkinson MJ (2003) Capabilities of remote sensors to classify coral, algae, and
sand as pure and mixed spectra. Remote Sensing of Environment 85(2): 174-189
Hochberg EJ, Atkinson MJ, Andrefouet S (2003) Spectral reflectance of coral reef bottom-
types worldwide and implications for coral reef remote sensing. Remote Sensing of
Environment 85(2): 159-173
Holden H, LeDrew E (1998) The scientific issues surrounding remote detection of submerged
coral ecosystems. Progress in Physical Geography 22(2): 190-221
Hollinger AB, Gray LH, Gowere JFR, Edel H (1987) The fluorescence line imager: an imaging
spectrometer for land and ocean remote sensing. Proceedings of the SPIE, 834
Horig B, Kuhn F, Oschutz F, Lehmann F (2001) HyMap hyperspectral remote sensing to
detect hydrocarbons. International Journal of Remote Sensing 22( 8): 1413-1422
Huadong G, Jianmin X, Guoqiang N, Jialing M (2001) A new airborne earth observing
system and its applications. International Geoscience and Remote Sensing Symposium
(IGARSS), Sydney, Australia, CD
Hunt GR (1989) Spectroscopic properties of rock and minerals. In: Carmichael RC (ed),
Practical handbook of physical properties of rocks and minerals, C.R.C. Press Inc., Boca
Raton, Florida, pp 599-669
Irons JR, Ranson KJ, Williams DL, Irish RR, Huegel FG (1991) An off-nadir-pointing imaging
spectroradiometer for terrestrial ecosystem studies. IEEE Transactions on Geoscience
and Remote Sensing 29(1): 66-74
Jacquemoud S, Bacour C, Poilve H, Frangi JP (2000) Comparison of four radiative transfer
models to simulate plant canopies reflectance-direct and inverse mode. Remote Sensing
of Environment 74: 741-781
References 45
LaCapra VC, Melack JM, Gastil M, Valeriano D (1996) Remote sensing of foliar chemistry
of inundated rice with imaging spectrometry. Remote Sensing of Environment 55(1):
50-58
Levesque J, Staenz K, Szeredi T (2000) The impact of spectral band characteristics on un-
mixong of hyperspectral data for monitoring mine tailings site rehabilitation. Canadian
Journal of Remote Sensing 26(3): 231-240
Longhi I, Sgavetti M, Chiari R, Mazzoli C (2001) Spectral analysis and classification of
metamorphic rocks from laboratory reflectance spectra in the 0.4-2.5 mu m interval:
a tool for hyperspectral data interpretation. International Journal of Remote Sensing
22(18):3763-3782
Mackin S, Drake N, Settle J, Briggs S (1991) Curve shape matching, end-member selection
and mixture modeling of AVIRIS and GER data for mapping surface mineralogy and
vegetation communities. Proceedings of the 2 nd JPL Airborne Earth Science Workshop,
JPL Publication, Pasadena, CA, pp 158-162
Mackin S, Munday TJ (1988) Imaging spectrometry in environmental science research and
applications - preliminary results from teh anaglysis of GER II imaging spectrometer
data - Australi and the USA, Departmnet of Geological Sciences, University of Durham
Maeder J, Narumalani S, Rundquist DC, Perk RL, Schalles J, Hutchins K, Keck J (2002) Clas-
sifying and mapping general coral-reef structure using Ikonos data. Photogrammetric
Engineering and Remote Sensing 68(12): 1297-1305
Malthus TJ, Mumby PJ (2003) Remote sensing of the coastal zone: an overview and priorities
for future research. International Journal of Remote Sensing 24(13): 2805-2815
Martin ME, Aber JD (1997) High spectral resolution remote sensing of forest canopy lignin,
nitrogen, and ecosystem processes. Ecological Applications 7(2): 431-443
Matson P, Johnson L, Billow C, Miller J, Pu RL (1994) Seasonal patterns and remote spectral
estimation of canopy chemistry across the Oregon transect. Ecological Applications
4(2): 280-298
Mauser W (2003) The airborne visible/infrared imaging spectrometer AVIS-2 - multian-
gular and hyperspectral data for environmental analysis. International Geoscience and
Remote Sensing Symposium (IGARSS), Toulouse, France, CD
McGwire K, Minor T, Fenstermaker L (2000) Hyperspectral mixture modeling for quanti-
fying sparse vegetation cover in arid environments. Remote Sensing of Environment
72(3): 360-374
Merton RN (1999) Multi-temporal analysis of community scale vegetation stress with imag-
ing spectroscopy. Unpublished PhD Thesis, Department of Geography, The University
of Auckland.
Merton RN, Silver E (2000) Tracking vegetation spectral trajectories with multi-temporal
hysteresis models. Proceedings of the Ninth Annual JPL Airborne Earth Science Work-
shop, Jet Propulsion Laboratory, Pasadena, CA.
Metternicht GI, Zinck JA (2003) Remote sensing of soil salinity: potentials and constraints.
Remote Sensing of Environment 85(1): 1-20
Mills F, Kannari Y, Watanabe H, Sano M, Chang SH (1993) Thermal Airborne Multispectral
Aster Simulator and its preliminary-results. Remote Sensing of Earths Surface and
Atmosphere. 14: 49-58
Mitchell, A. (2004). Remote sensing techniques for the assessment of mangrove forest
structure, species composition and biomass, and response to environmental change.
School of Biological Earth Environmental Sciences. Sydney, University of New South
Wales.
References 47
Mumby, pJ, Edwards AJ (2002) Mapping marine environments with IKONOS imagery:
enhanced spatial resolution can deliver greater thematic accuracy. Remote Sensing of
Environment 82(2-3): 248-257
Nadeau C, Neville RA, Staenz K, O'Neill NT, Royer A (2002) Atmospheric effects on the
classification of surface minerals in an arid region using Short-Wave Infrared (SWIR)
hyperspectral imagery and a spectral unmixing technique. Canadian Journal of Remote
Sensing 28(6): 738-749
Niemann KO (1995) Remote-sensing of forest stand age using airborne spectrometer data.
Photogrammetric Engineering and Remote Sensing 61(9): 1119-1127
Niemann KO, Goodenough DG (2003) Estimation of foliar chemistry in Western Hemlock
using hyperspectral data. In: Wulder M, Franklin S (eds), Remote sensing of forest
environments: concepts and case Studies, Kluwer Academic Publishers, Norwell, MA,
pp447-467
Ong C, Cudahy T, Caccetta M, Hick P, Piggott M (200l) Quantifying dust loading on man-
groves using hyperspectral techniques. International Geoscience and Remote Sensing
Symposium (IGARSS), Sydney, Australia, CD
Oppenheimer C, Rothery DA, Pieri DC, Abrams MJ, Carrere V (1993) Analysis of Airborne
Visible Infrared Imaging Spectrometer (AVIRIS) data of volcanic hot-spots. Interna-
tional Journal of Remote Sensing 14(16): 2919-2934
Otterman J, Strebel DE, Ranson KJ (1987) Inferring Spectral Reflectances of Plant-Elements
by Simple Inversion of Bidirectional Reflectance Measurements. Remote Sensing of
Environment 21(2): 215-228
Palacios-Orueta A, Ustin SL (1998) Remote sensing of soil properties in the Santa Monica
Mountains I. Spectral analysis. Remote Sensing of Environment 65(2): 170-183
Penuelas J, Pinol J, Ogaya R, Filella I (1997) Estimation of plant water concentration by
the reflectance water index WI (R900/R970). International Journal of Remote Sensing
18(13): 2869-2875
Perry CT (2003) Reef development at Inhaca Island, Mozambique: Coral communities and
impacts ofthe 1999/2000 Southern African floods. Ambio 32(2): 134-139
Ranson, KJ, Irons JR, Williams DL (1994) Multispectral Bidirectional Reflectance of North-
ern forest canopies with the advanced solid-state array spectroradiometer (Asas). Re-
mote Sensing of Environment 47(2): 276-289
Resmini RG, Kappus ME, Aldrich WS, Harsanyi JC, Anderson M (1997) Mineral mapping
with HYperspectral Digital Imagery Collection Experiment (HYDICE) sensor-data at
Cuprite, Nevada, USA. International Journal of Remote Sensing 18(7): 1553-1570
Riegl B (2002) Effects of the 1996 and 1998 positive sea-surface temperature anomalies on
corals, coral diseases and fish in the Arabian Gulf (Dubai, UAE). Marine Biology 140(1):
29-40
Roberts DA, Batista G, Pereira J, Waller E, Nelson B (1998) Change identification using
multi temporal spectral mixture analysis: applications in eastern Amazonia. In: Elvidge
C, Ann Arbor LR, Remote sensing change detection: environmental monitoring appli-
cations and methods, Ann Arbor Press, Michigan, pp 137-161
Rodger A, Lynch MJ (200l) Determining atmospheric column water vapour in the 0.4-
2.5 11m spectral region. Proceedings of the 10th JPL Airborne Earth Science Workshop,
Pasadena, CA, pp 321-330
Rowan LC, Crowley JK, Schmidt RG, Ager CM, Mars JC (2000) Mapping hydrothermally
altered rocks by analyzing hyperspectral image (AVIRIS) data of forested areas in the
Southeastern United States. Journal of Geochemical Exploration 68(3): 135-166
48 1: Richard Lucas, Aled Rowlands, Olaf Niemann, Ray Merton
Sakami T (2000) Effects of temperature, irradiance, salinity and inorganic nitrogen concen-
tration on coral zooxanthellae in culture. Fisheries Science 66(6): 1006-1013
Schaepman ME, Itten KI, Schlapfer D, Kaiser JW, Brazile J, Debruyn W, Neukom A, Feusi
H, Adolph P, Moser R, Schilliger T, De Vos L, Brandt G, Kohler P, Meng M, Piesbergen
J, Strobl P, Gavira J, Ulbrich G, Meynart R (2003) Status of the airborne dispersive
pushbroom imaging spectrometer APEX (Airborne Prism Experiment). International
Geoscience and Remote Sensing Symposium (IGARSS), Toulouse, France, CD
Schmidt KS, Skidmore AK (2003) Spectral discrimination of vegetation types in a coastal
wetland. Remote Sensing of Environment 85(1): 92-108
Schreier H, Wiart R, Smith S (1988) Quantifying organic-matter degradation in agricultural
fields using PC-based image-analysis. Journal of Soil and Water Conservation 43(5):
421-424
Seemann SW, Li J, Menzel WP, Gumley LE (2003) Operational retrieval of atmospheric
temperature, moisture, and ozone from MODIS infrared radiances. Journal of Applied
Meteorology 42(8): 1072-1091
Serrano L, Ustin SL, Roberts DA, Gamon JA, Penuelas J (2000) Deriving water content of
chaparral vegetation from AVIRIS data. Remote Sensing of Environment 74(3): 570-581
Silvestri S, Marani M, Marani A (2003) Hyperspectral remote sensing of salt marsh vege-
tation, morphology and soil topography. Physics and Chemistry of the Earth 28(1-3):
15-25
Spanner MA, Pierce LL, Peterson DL, Running SW (1990a) Remote-sensing of temperate
coniferous forest Leaf-Area Index - the influence of canopy closure, understory veg-
etation and background reflectance. International Journal of Remote Sensing 11(1):
95-111
Spanner MA, Pierce LL, Running SW, Peterson DL (1990b) The seasonality of AVHRR data
of temperate coniferous forests - relationship with Leaf-Area Index. Remote Sensing of
Environment 33(2): 97-112
Strobl P, Nielsen A, Lehmann F, Richter R, Mueller A (1996) DAIS system performance, first
results from the 1995 evaluation campaigns. 2nd International Airborne Remote Sensing
Conference and Exhibition, San Francisco, USA.
Su Z, Troch PA, DeTroch FP (1997) Remote sensing of bare surface soil moisture using
EMAC/ESAR data. International Journal of Remote Sensing 18(10): 2105-2124
Sun XH, Anderson JM (1993) A spatially-variable light-frequency-selective component-
based, airborne pushbroom imaging spectrometer for the water environment. Pho-
togrammetric Engineering and Remote Sensing 59(3): 399-406
Thenkabail PS, Smith RB, de Pauw E (2000) Hyperspectral vegetation indices and their
relationships with agricultural crop characteristics. Remote Sensing of Environment
71: 158-182
Ticehurst C, Lymburner L, Held A, Palylyk C, Martindale D, Sarosa W, Phinn S, Stanforf M
(2001) Mapping tree crowns using hyperspectral and high spatial resolution imagery.
Proceedings od 3rd International Conference on Geospatial Information in Agriculture
and Forestry, Denver, Colorado, CD
Treitz PM, Howarth PJ (1999) Hyperspectral remote sensing for estimating biophysical
parameters of forest ecosystems. Progress in Physical Geography 23(3): 359-390
Ustin, SL, Zarco-Tejada pJ, Asner GP (2001) The role of hyper spectral data in understanding
the global carbon cycle. Proceedings of the lOth JPL Airborne Earth Science Workshop,
Pasadena, CA, pp 397-410
References 49
van der Meer F, Bakker W (1998) Validated surface mineralogy from high -spectral resolution
remote sensing: a review and a novel approach applied to gold exploration using AVIRIS
data. Terra Nova 10(2): 112-119
van der Meer F, Fan LH, Bodechtel J (1997) MAIS imaging spectrometer data analysis for Ni-
Cu prospecting in ultramafic rocks of the Jinchuan group, China. International Journal
of Remote Sensing 18(13): 2743-2761
van der Meer FD, de Jong SM (2001). Introduction. In: van der Meer FD, de Jong SM (eds),
Imaging spectrometry: basic principles and prospective applications, Kluwer Academic
Publishers, Dordrecht, The Netherlands, pp xxi-xxiii
van der Meer, FD, de Jong SM, Bakker W (Eds) (2001) Imaging spectrometry: basic analyt-
ical techniques. Imaging spectrometry: basic principles and prospective applications,
Kluwer Academic Publishers, Dordrecht, The Netherlands
Vane G, Goetz AFH, Wellman JB (1984) Airborne imaging spectrometer - a new tool for
remote-sensing. IEEE Transactions on Geoscience and Remote Sensing 22(6): 546-549
Vane G, Green RO, Chrien TG, Enmark HT, Hansen EG, Porter WM (1993) The Airborne
Visible Infrared Imaging Spectrometer (AVIRIS). Remote Sensing of Environment 44(2-
3): 127-143
Vos RJ, Hakvoort JHM, Jordans RWJ, Ibelings BW (2003) Multiplatform optical monitor-
ing of eutrophication in temporally and spatially variable lakes. Science of the Total
Environment 312(1-3): 221-243
Yang H, Zhang J, van der Meer F, Kroonenberg SB (2000) Imaging spectrometry data
correlated to hydrocarbon microseepage. International Journal of Remote Sensing 21 (1):
197-202
Yentsch CS, Yentsch CM, Cullen JJ, Lapointe B, Phinney DA, Yentsch SW (2002) Sunlight and
water transparency: cornerstones in coral research. Journal of Experimental Marine
Biology and Ecology 268(2): 171-183
Zacharias, M, Niemann KO, Borstad G (1992) An assessment and classification of a mul-
tispectral bandset for the remote sensing if intertidal seaweeds. Canadian Journal of
Remote Sensing 18(4): 263-273
Zagolski F, Pinel V, Romier J, Alcayde D, Fontanari J, Gastellu-Etchegorry JP, Giordano
G, Marty G, Mougin E, Joffre R (1996) Forest canopy chemistry with high spectral
resolution remote sensing. International Journal of Remote Sensing 17(6): 1107-1128
Zhou Y, Yan G, Zhou Q, Tang S (2003) New airborne multi-angle high resolution sensor
AMTIS LAl inversion based on neural network. International Geoscience and Remote
Sensing Symposium (IGARSS), Toulouse, France, CD
CHAPTER 2
2.1
Introduction
The current mode of image capture in remote sensing of the earth by aircraft or
satellite based sensors is in digital form. The pixels correspond to localized spa-
tial information while the quantization levels in each spectral band correspond
to the quantized radiometric measurements. It is most logical to regard each
image as a vector array, that is, the pixels are arranged on a rectangular grid
but the value of the image at each pixel is a vector whose elements correspond
to radiometric levels (also known as intensity values or digital numbers) of the
different bands. The image formed for each band is a monochrome image and
for this reason, we refer to the quantized values of such an image as gray levels.
The images may be acquired in a few bands (i.e. multi-spectral image) or in
hundreds of bands (i. e. hyperspectral image). Image processing operations
for both multispectral and hyperspectral images can therefore be either scalar
image oriented, that is, each band is processed separately as an independent
image or vector image oriented, where the operations take into account the
vector nature of each pixel. Image processing can take place at several different
levels. At its most basic level, the processing enhances an image or highlights
specific objects for the analyst to view. At higher levels, processing can take on
the form of automatically detecting objects in the image and classifying them.
The aim of this chapter is to provide an overview of image processing of mul-
tispectral and hyperspectral data from the basic to the advanced techniques.
While a few excellent books are available on the subject (Jensen 1996; Mather
1999; Richards and Jia 1999; Campbell 2002), the purpose here is to provide
a ready reference to go with the other chapters of the book. It will be seen that
the bulk of processing techniques for image enhancement are those that have
been developed for consumer and document image processing and are thus
generic. They do not differ substantially in their applicability as a function
of the imaging platform be it hyperpsectral or otherwise and many of the
illustrations in this chapter use images in the visible band.
P. K. Varshney et al., Advanced Image Processing Techniques for Remotely Sensed Hyperspectral Data
© Springer-Verlag Berlin Heidelberg 2004
52 2: Raghuveer M. Rao, Manoj K. Arora
2.2
Image File Formats
Digital images are organized into records. Appropriate header data indicate
the organization. The earliest storage medium for remotely sensed images
was a magnetic tape. Consequently, several image format schemes are tailored
towards the sequential access mode of this medium. With the increased use of
optical storage media, some of the schemes have fallen out of favor. Because
hyperspectral data are gathered in terms of spatial location, in the form of
pixel coordinates, and spectral bands (i. e. gray values of the pixel in different
bands), image format schemes typically differ in how they handle data with
respect to location and bands.
The band interleaved by pixel format (BIP) is pixel oriented in the sense that
all data corresponding to a pixel are first stored before moving on to the next
pixel. Thus, the first record corresponds to the first pixel of the first band of
the first scan line. The second record corresponds to the first pixel of the first
line but data in band 2 and so on. After data from all the bands are recorded
sequentially, we move on to pixel 2 of the first line.
In the band interleaved by line format (BIL) the data are organized in line-
oriented fashion. The first record is identical to that of the BIP image. However,
the second record corresponds to the second pixel of band 1 of the first line.
Once all pixel values of band 1 ofline 1 have been recorded, we begin recording
data of pixel 1 of band 2 followed by pixel 3 of band 2 and so on. We begin
recording of line 2 data after completion of recording of all line 1 data.
The band sequential format (BSQ) requires storing of the data of each band
as a separate file. This format is most useful when one might process just one
or a few bands of data. With the availability of random access capability in
optical storage media, such a separation into band-based image files does not
provide a substantial increase in search speed over the other formats when one
is interested in a small number of bands.
A versatile data format called the hierarchical data format (HDF) has been
developed for the National Center for Supercomputing Applications with the
aim of facilitating storage and exchange of scientific data. This format provides
for different data models and supports single file and multifile modes. The fol-
lowing different data structures are supported in HDF4: raster images, color
palettes, texts, scientific data sets (which are multidimensional arrays), Vdata
(2D arrays of different data types) and Vgroup (a structure for associating
sets of data objects). In the newer HDF5, only two fundamental structures are
supported: groups and dataspaces. HDF-EOS an extension ofHDF5 developed
by NASA for the Earth Observing Systems, additional structures are used.
These are point, swath and grid. Some advantages of HDF are: portability to
platform independence, self-documentation, efficient storage and access, ac-
commodation of multiple data types within the same file and ease of extending
the format to suit specific needs (Folk et al. 1999; Ullman 1999).
Overview of Image Processing S3
2.3
Image Distortion and Rectification
Obviously, the ideal image is an accurate reproduction of ground reflectance
in the various spectral bands. However, there are various factors that make
this impossible. Thus, most recorded images suffer from various degrees of
distortion. There are two main types of distortion, radiometric and geometric.
The former is the inaccurate translation of ground reflectance to gray values,
typically arising due to atmospheric interference and/or sensor defects. The
latter refers to shape and scale distortions owing usually to errors due to
perspective projection and instrumentation. We provide an overview of these
distortions along with schemes used to rectify the errors.
2.3.1
Radiometric Distortion
This refers to inaccurate representation of relative brightness levels (or intensity
values) either in a given band across pixels or across bands for a given pixel. The
ideal measured brightness by a sensor at a pixel in a given band is proportional
to the product of the surface reflectance and the sun's spectral irradiance
assuming uniform surface reflectance over the area contributing to the pixel.
Thus the ideal image captures the relative brightness as measured at the ground.
However, the intervening atmosphere between the ground and a space based
sensor causes inaccuracies in the recording of the relative levels.
Atmosphere induced radiometric distortion is due primarily to scattering of
electromagnetic radiation by either the air molecules or suspended particles.
Part of the down -going radiation from the sun is scattered back and away from
the earth by the atmosphere itself. Also, apart from the radiation reflected by
the area covered by the pixel of interest, radiation reflected from neighboring
regions is scattered by the atmosphere into the pixel of interest. The main effect
of both types of scattering is loss of image detail. It also causes inaccurate
representation of radiometric levels across the bands due to the wavelength
dependent nature of scattering.
Correction of atmospheric distortion can be done in either a detailed or
gross fashion. Detailed correction requires information regarding the incident
angle of the sunlight and atmospheric constituents at the time of imaging. The
scattering effects are then modeled into atmospheric attenuation as a function
of path length and wavelength. There are a number of atmospheric models
that allow the conversion of at-sensor image data to transform into ground
reflectance. Some of the methods are flat field correction, internal average
relative reflectance and empirical line method (Rast et al. 1991; Smith and
Milton 1999). Because detailed atmospheric information is hard to obtain, gross
or bulk corrections are more common. Bulk brightness correction attempts
mainly to compensate for at-sensor radiance or path radiance. Path radiance
effect for a given pixel refers to interfering scatter from the atmosphere as well
as from other pixels. A typical approach, referred to as dark object subtraction,
is to subtract the lowest gray value of each band from the values of other pixels
54 2: Raghuveer M. Rao, Manoj K. Arora
in that band. The lowest gray value in each band provides an estimate of the
contribution to each pixel's gray value in that band from neighboring pixels.
In the visible region, atmospheric scattering injects extraneous radiation at
higher frequencies. Consequently the bulk correction just described tends to
reduce the blue component.
The radiometric distortion may also occur due to the image capture system
itself. If the response of the detector is a nonlinear function of the input
radiance, it results in nonlinear distortion. Another source of instrumentation
error in multi-detector cameras is mismatch of detector characteristics, that is,
each detector may have a slightly different transfer characteristic (even iflinear)
and offset. Offset refers to the fact that each detector outputs a different dark
current, which is the current from a radiation-sensing device even when there
is no radiation impinging on it. Yet another common source of contribution
of instruments to radiometric distortion is sensor noise, which shows up as
random deviations from the true intensity values.
Radiometric distortion due to detector mismatch in multidetector systems
typically shows up as a striping artifact. The correction usually performed is
that of adjusting the output gray values of the sensors to match, in mean and
variance to, that of one chosen as a reference. For example, let Pi; i = 1, ... , N,
be the respective means of the outputs of N sensors and 0i; i = 1, ... , N, be the
standard deviations. Suppose we choose sensor 1 as the reference. Then, if Ii
denotes the image from the ith sensor, the corrections are implemented as
(2.1)
2.3.2
Geometric Distortion and Rectification
The rotation of the Earth beneath the satellite as the satellite traverses causes
images to be rhombus-shaped when mapped back to the imaged region. In
other words, even if the acquired image is treated as square from the point of
view of storage and display, the actual region of the Earth that it corresponds
to is a rhombus. This is when the area covered in the image is small enough to
ignore the Earth's curvature. There is a further distortion introduced when this
curvature cannot be ignored. This happens with large swath widths on high
altitude satellites. The ground pixel size is larger at a swath edge than in the
middle and happens because of the geometry as depicted in Fig. 2.1. Given that
the angular field of view (FOV) of space imaging systems is constant, there is
another type of geometric distortion called panoramic distortion, which refers
to pixel sizes being larger at the edge of a scan than at the nadir. This occurs
because the footprint of the FOV directly underneath the air or space platform
is smaller than that several hundred meters away owing to the fact that the
footprint size is proportional to the tangent of the scan angle rather than to
the scan angle itself.
Overview of Image Processing 55
The above distortions are systematic in the sense that the errors are pre-
dictable. For example, once a satellite's glide path is known, it is easy to predict
the square to rhombus transformation due to the Earth's rotation. Often, the
agency-supplied data are systematic corrected. However, there are also non-
systematic distortions. These arise mainly due to unpredictable occurrences
such as random variations in platform velocity, altitude and attitude. Obvi-
ously, such changes result in changing the FOV thus leading to distortion.
A common approach to correcting shape distortions is to geometrically
transform the coordinates of the image to coincide with ground control points
(GCP). A geo-referenced data set such as a map of the imaged area or an
actual ground survey, conducted these days using global positioning systems
(GPS), is required to collect the GCP. A GCP is a point on the reference data
set that is also identifiable in the recorded image. For example, this could be
a mountain peak or a cape or human-made features such as road intersections.
Several GCPs are identified and their geographical coordinates are noted.
A coordinate transformation is determined to relate the image coordinates to
the true coordinates,
(2.2)
Here R is the rectification mapping that maps the set of image coordinates
(subscripted 1) to the true coordinates (subscripted T). The simplest coordi-
nate transformation that one can choose is linear or affine. While this works
for simple distortions such as the square to rhombus alluded to above, it is in-
sufficient for other distortions. Polynomial transformations are more general.
For example, a second order transformation has the form
b12] [XT]
b22 YT
+ [Cll
C2l
(2.3)
If we choose exactly 12 GCPs we can solve for the constants in (2.3) through
12 simultaneous linear equations. If we choose more GCPs we will have an
overdetermined set of linear equations for which we can get a least squares
solution.
Once the parameters of the mapping have been determined, we need to
adjust the intensity values in the rectified image. This is because pixel centers
in the original image do not necessarily map to pixel centers in the rectified
56 2: Raghuveer M. Rao, Manoj K. Arora
image. The widely used nearest neighbor resampling technique simply chooses
the gray value of that pixel of the original image whose center is closest to the
new pixel center. Other graphical interpolation techniques such as bilinear or
cubic interpolation may also be used (Richards and Jia 1999).
The above approach may also be used to register images acquired from
two different sensors or from two different sources or at two different times.
The process of image registration has widespread application in image fusion,
change detection and GIS modeling, and is described next.
2.4
Image Registration
The registration of two images is the process by which they are aligned so
that overlapping pixels correspond to the same entity that has been imaged.
Either feature based or intensity based image registration may be performed.
The feature based registration technique is based on manual selection of GCP
as described in the previous section. The feature-based technique is, how-
ever, laborious, time intensive and a complex task. Some automatic algorithms
have been developed to automate the selection of GCP to improve the ef-
ficiency. However, the extraction of GCP may still suffer from the fact that
sometimes too few a points will be selected, and further the extracted points
may be inaccurate and unevenly distributed over the image. Hence, intensity
based registration techniques may be more appropriate than the feature based
techniques.
We provide a basic overview of the principles of intensity based image reg-
istration in this chapter. An advanced intensity based technique using mutual
information as the similarity measure has been dealt in Chaps. 3 and 7.
The problem of intensity based registration of two images containing trans-
lation errors can be posed for two scalar-valued images It (m, n) and h(m, n) as
finding integers k and I such that II (m-k, n-l) is as close as possible to I 2 (m, n).
If the images are of the same size and the relative camera movement between
the two image-capture positions is mostly translational, then registration can
be attempted by shifting until a close match is obtained.
If the mean squared error is used as a criterion of fit, it is then equiva-
lent to maximizing the cross-correlation between the two images, that is, k
and I are chosen to maximize the correlation between the images It and h
Finding the cross-correlation over all possible offsets is computationallyex-
pensive. Therefore, knowing approximately the range of offsets within which
best registration is achieved helps to keep the computational load small. In
many instances, a control point is chosen in the scene and a window of pixels
around the control point is extracted from the image to be registered and is
then correlated with the other image in the vicinity of the estimated location of
the control point in it. Instead of computing the correlation, one can also com-
pute the sum of the least absolute differences between corresponding pixels of
the two images over the window. The window position for which the summed
difference is the smallest corresponds to the best translation.
Overview of Image Processing 57
2.S
Image Enhancement
The aim of image enhancement is to improve the quality of the image so
that it is more easily interpreted (e. g. certain objects become more distinct).
The enhancement may, however, occur at the expense of the quality of other
objects, which may become subdued in the enhancement operation. Image
enhancement can be performed using point operations and geometric op-
erations, which are described next. It may, however, be mentioned that be-
fore applying these operations, the images are assumed to have been cor-
rected for radiometric distortions. Otherwise, these distortions will also be
enhanced.
2.S.1
Point Operations
Fig. 2.2. Left: Input image. Conesus Lake Upstate New York. Modular imaging spectrometer
Gray version. http://www.cis.rit.edu/research/dirs/pages/images.html. Right: Result after
thresholding pixel values below 60 to 0 and those above 60 to 255
they fall into. In the latter case, it is usually done because objects of interest
often contribute to a specific range of radiometric values in the image. For
example, the black pixels resulting after thresholding in Fig. 2.2. belong almost
entirely to the water body in the original image.
2.5.1.1
Image Histograms
(MN) nl
~ nk)
no (MN - no) .. . (MN - k-O (2.4)
n254
distinct images with the same histogram. In spite of this limitation, image
histograms turn out to be very useful (Gonzalez and Woods 1992; Richards
and Jia 1999).
Histogram equalization is a common procedure for contrast enhancement.
The process creates a new image whose histogram is nearly flat across all gray
levels. An example is shown in Fig. 2.4 where the top panel shows the original
image along with its histogram and the bottom panel shows resulting image
and histogram after equalization.
There are several objects, for example, the river that is visible in the
histogram-equalized image is not in the original image. The equalization
process is based on transforming the PDF of a continuous random variable to
the uniform distribution. Suppose a random variable x has a PDF fx(x) and
60 2: Raghuveer M. Rao, Manoj K. Arora
15
10
15
10
Fig. 2.4. a Original image. b Histogram. c Histogram -equalized image. d Equalized histogram
y = FAx) . (2.5)
a b
Fig.2.5a,b. Landsat ETM + image of the Finger Lakes region, upstate New York. a Original
image. b Piecewise contrast enhanced image
250
200
150
100
50
O L-------~L-----~--~
o 100 200 o 100 200
a b
intensity values above 230 are set to 255. Intensity value transformations are
then applied as
and
16
Y == -(x - 175) + 175 for 175 ~ x ~ 230 . (2.7)
11
The effect of these transformations is to map values between 125 and 175
to 0 and 175, and to map values between 175 and 230 to 175 and 255. The
value of 175 itself is chosen because of it being the third mode as shown in the
histogram. The piecewise linear mapping corresponding to these operations
is also shown in Fig 2.6b. The operation can be seen as a form of stretching
in the sense that values between 125 and 230 are stretched to a new set of
values.
To the extent that a pointwise operator depends only on the local pixel value,
it can be implemented as a lookup table. Any such lookup table is by defini-
tion a pointwise operator and for gray values in the range 0 to 255, there are
256 factorial such tables. However, only a few will lead to meaningful results.
Apart from the histogram equalization or the linear stretching for contrast
enhancement, which have been illustrated above and are dependent on the
a b
Fig. 2.7. a Original image. b Image after nonlinear contrast enhancement
Overview of Image Processing 63
characteristics of the actual image and its histogram, one can apply generic
operators for contrast enhancement. For example, the mapping
results in the stretching of the mid-level gray values and compression of the ex-
tremes. The result is illustrated in Fig. 2.7. Notice that the contrast modification
curve given by the above expression is nonlinear.
2.5.2
Geometric Operations
Geometric operations involve modification of pixel gray values based not only
on the individual pixel value but also on the pixel values in its neighborhood.
In the extreme case, each new pixel value is a function of all pixel values in the
original image.
2.5.2.7
Linear Filters
The simplest and most common operation is linear finite impulse response
(FIR) filtering. Given an image I(m, n), a linear FIR filtered version J(m, n) is
obtained through the operation
where A is a finite set of paired indices. It is the finite support of the weighting
sequence h(k, l) that gives the name finite impulse response filtering, that is,
h(k, l) = 0 for (k, l) i A. Most often A is symmetric and square. In other
words, it is of the form
However, other geometries are also possible for A. Thus, as seen from (2.9),
at any pixel location (m, n), the linear filtered output is a weighted sum of the
input pixels at (m, n)and a local neighborhood as determined by A. The filter
form given in (2.9) is not only linear but is also space-invariant, that is, the filter
coefficients h(k, l)are independent of the pixel location (m, n). It is possible to
have coefficients that are dependent on pixel location in which case the filter
becomes space varying (Gonzalez and Woods 1992).
A particularly simple example of linear filtering is smoothing. Suppose we
average the values of an image over a 3 x 3 window, that is we have
1/9 -1:::::k,l:::::1
h(k,l) = { 0 (2.11)
otherwise
64 2: Raghuveer M. Rao, Manoj K. Arora
a b
Fig.2.S. a Input image. b Result after smoothing
Then, the effect is one of smoothing. This operation is most commonly em-
ployed for suppressing the effects of noise as demonstrated in Fig. 2.8. In this
instance, the filter (kernel) in (2.11) was employed. However, one may choose
other regions of support and a different set of coefficients. Smoothing is in-
variably accompanied by a loss of high-frequency information in the image.
Here frequency refers to spatial, not optical, frequency (or gray levels).
The operation in (2.9) is a linear convolution in two dimensions regardless
of the extent of support of A. If A is not only finite but also small, the sequence
h(k, e ), also known as the convolution kernel, is represented as a matrix, with k
indexing the rows and £ indexing the columns. For example, the uniform 3 x 3
smoothing kernel becomes
2.5.2.2
Edge Operations
The human visual system is sensitive to edges (Gonzalez and Woods 1992).
As such edge detection, extraction and enhancement are employed for vari-
ous purposes including assisting manual interpretation of images, automatic
Overview of Image Processing 65
where i and j are the unit vectors along the x and y directions respectively. For
digital images, several approximations to the gradient have been proposed.
The Sobel edge operator is a common and simple approximation to the
gradient. This approximates the horizontal and vertical derivatives using filter
kernels
-1
h](k,£) = [ -2
oo 1]
2 and
2
o ~] , (2.l3)
-1 o 1 -2 -1
respectively. Suppose I] (m, n), and 12(m, n) are the respective outputs of ap-
plying these filters to an image I(m, n). The edge intensity is calculated as
jI?(m, n) + Ii(m, n).
An alternative to gradient approximations is provided by the Laplacian of
Gaussian (LoG) approach. For a continuous image I (x,y), the Laplacian is
provided by the operation
L{I}= (aax 2
2 + ( 22 ) I (x,y) .
ay
(2.14)
h(k,e) ~ [! 1 0]
-4
1
1
0
. (2.15 )
hLoG (x,y) =
1
--4
[X2 + y2 ] +rx2
- - 2 - -14 e- 2a2 (2.16)
na 2a
66 2: Raghuveer M. Rao, Manoj K. Arora
A surface plot of the function is shown in Fig. 2.9. The parameter 0 controls
the spread of the kernel. Discrete approximations can be generated for different
values of o.
2.5.2.3
Rank-Ordered Filtering
Filtering operations need not be linear. Nonlinear approaches are useful as well.
In particular, rank-ordered filtering of which median filtering is a well-known
example, is widely used. Rank-ordered filtering replaces each pixel value in an
image by that pixel value in its specified neighborhood that occupies a specific
position in a list formed by arranging the pixel gray values in the neighborhood
in ascending or descending order. For example, one may choose to pick the
maximum in the neighborhood. Median filtering results when the value picked
is in the exact middle of the list.
Formally, given an image I{m, n), the median filter amounts to
where A is the neighborhood over which the median is taken. Median filters
are most useful in mitigating the effects of "salt and pepper" noise that arises
typically due to isolated pixels incorrectly switching to extremes of opposite
intensity. This can happen during image acquisition or transmission where
black pixels erroneously become white and vice-versa due to overflows and
saturation.
2.6
Image Classification
Image classification procedures help delineate regions in the image on the basis
of attributes of interest. For example, one might be interested in identifying
regions on the basis of vegetation or inhabitation. The classification problem
can be stated in formal terms as follows. Suppose we want to classify each
pixel in an image into one of N classes, say C1, C2 , .•• , CN. Then, decision rules
Overview of Image Processing 67
2.6.1
Supervised Classification
p(x E A ICd = f
yEA
fxlCI (y)dy, (2.18)
Otherwise, the rule assigns the pixel to Class 2. The principle for the rule is as
follows. If the two classes contribute equally in the image, then there is a greater
probability of Class 1 pixels being in the neighborhood of y than Class 2 pixels
for any y satisfying (2.19). However, the two classes may not contribute pixels
in equal proportion in which case one has to take their individual probabilities
of occurrence into account.
The maximum a posteriori (MAP) classification rule picks Class 1 if for
a measured pixel value of y,
(2.20)
This means that given the measurement, the probability that it came from
a particular class is evaluated for purpose of classification. This is a reverse or
a posteriori probability as opposed to the forward probability in (2.18), which
gives the probability of obtaining a value for the measurement given the pixel
class. By Bayes theorem, which relates forward and reverse probabilities,
(2.21)
68 2: Raghuveer M. Rao, Manoj K. Arora
where fx (y) is the PDF of the pixel measurement regardless of the class to
which it belongs and P(Cd is the a priori probability that a randomly chosen
pixel belongs to Class 1. A similar relationship obviously holds for P( C2Iy).
Thus, (2.20) yields, in conjunction with Bayes theorem, the MAP classification
rule:
Assign pixel measurement y to Class 1 if
(2.22)
1A symmetric distribution about a mean of x implies that the PDF fx (y)is a function only of the
Euclidean distance between y and x.
Overview of Image Processing 69
2.6.2
Unsupervised Classification
The next step is the location of the centroid within each cluster. The centroid
is defined as that value for which the sum of its distances from each member
of the cluster is minimized. For the LI-distance this turns out to be the vector
each of whose components is the median of that component taken over the
members of the cluster. For example, suppose we are working with just two
bands and there is a cluster with three members, say, al = [2,34], az = [10,22]
and a3 = [17,36]. Then the centroid is the vector [10,34], which is obtained as
the median of the two components of these three vectors. Notice that the cen-
troid of a cluster may not be an actual input value. Following the calculation of
the centroids, the clustering process is repeated as in (2.24) with the centroids
replacing the initial values. A new set of centroids is then calculated following
the clustering. The two steps of clustering and centroid calculations are per-
formed until either the clusters are stable or there is no significant change in
clusters.
If the Euclidean or Lz -distance is used instead of the absolute distance, then
the centroid simply becomes the arithmetic mean of all vectors. This measure
is appropriate if the distribution is Gaussian. It is close to the ML rule in that
70 2: Raghuveer M. Rao, Manoj K. Arora
300
2SO
200
150
100
50
0
300
300
Fig. 2.10. a Input color image. b 3D scatter plot of R, G and B components. c Pseudocolor
map of unsupervised classification. For a colored version of this figure, see the end of the
book
Overview of Image Processing 71
heuristics. We have explained just one clustering approach. There are other
approaches to be found in the literature (Gersho and Gray 1992).
An illustration of forming classes through clustering is shown in Fig. 2.10.
A color image has been divided into five classes through unsupervised cluster-
ing. A pseudo color image is then created where each class is assigned a color.
The resulting classes reflect to a large degree actual ground objects such as
water, buildings and fields.
2.6.3
Crisp Classification Algorithms
Clearly, the objective of classification is to allocate each pixel of an image to
one of the classes. This is known as per-pixel, crisp or hard classification.
Maximum likelihood decision rule is most commonly used for per pixel based
supervised classification when the data satisfy Gaussian distribution assump-
tion. However, often the spectral properties of the classes are far from the
assumed distribution. For example, images acquired in complex and heteroge-
neous environments actually record the reflectance from a number of different
classes. Many pixels are, therefore, mixed because the boundaries of mutually
exclusive classes meet within a pixel (boundary pixels) or a small proportion
of the classes exist within the major classes (sub-pixel phenomenon). The oc-
currence of mixed pixels is a major problem and has been discussed under
fuzzy classification algorithms.
Another issue associated with the statistical algorithm is the selection of
training data of appropriate size for accurate estimation of parameters govern-
ing that algorithm. In the absence of sufficient number of pure pixels to define
training data (generally 10 to 30 times the number of bands (Swain and Davis
1978», the performance of parametric or statistical classification algorithms
may deteriorate. Further, these classification algorithms have limitations in
integrating remote sensing data with ancillary data. Digital classification per-
formed solely on the basis of the statistical analysis of spectral reflectance
values may create problems in classifying certain areas. For example, remote
sensing data are an ideal source for land cover mapping in hilly areas as it
minimizes accessibility problems. Hilly areas, on the other hand, are covered
with shadows due to high peaks and ridges thereby resulting in the change of
the reflectance characteristics of the objects underneath. To reduce the effect of
shadows in remote sensing image, the inclusion of information from the digital
elevation model (DEM) can be advantageous (Arora and Mathur 2001). There-
fore, the ancillary (non-spectral) information from other sources such as DEM
(Bruzzone et al. 1997; Saha et al. 2004), and/or geological and soil maps (Gong
et al. 1993) may be more powerful in characterizing the classes of interest.
Moreover, due to the widespread availability of GIS, digital spatial data have
become more accessible than before. Therefore, greater attention is now being
paid to the use of ancillary data in classification using remote sensing imagery,
where parametric classifiers may result into inappropriate classification.
Non-parametric classification may be useful when the data distribution
assumptions are not satisfied. A number of nonparametric classification al-
72 2: Raghuveer M. Rao, Manoj K. Arora
Landsat TM
Band I
Agriculture
Band 2
Built-up
Band 3
Forest
Band 4
Sand
Band 5
Water
Band 6
also result in the phenomenon called overtraining (Chi 1995). The optimal
number of hidden units is often determined experimentally using a trial and
error method although some basic geometrical arguments may be used to give
an approximate indication (Bernard et al. 1995).
Thus, the units in the neural network are representation of the biological
concept of a neuron and weighted paths connecting them (Schalkoff 1997).
The data supplied to a unit's input are multiplied by the path's weight and
are summed to derive the net input to that unit. The net input (NET) is then
transformed by an activation function (f) to produce an output for the unit. The
most common form of the activation function is a Sigmoid function, defined
as
1
t(NET) = - - - - (2.25)
1+ exp( -NET)
and accordingly,
where NET is the sum of the weighted inputs to the processing unit and may
be expressed as
n
NET = LXiWi, (2.27)
;=1
where Xi is the magnitude of the ith input and Wi is the weight of the intercon-
nected path (Foody 1998).
The determination of the appropriate weights is referred to as learning
or training. Learning algorithms may be supervised and unsupervised, as
discussed earlier. Generally a backpropagation algorithm is applied which
is a supervised algorithm and has been widely used in applications related
to neural network classification in remote sensing. The algorithm iteratively
minimizes an error function over the network outputs and target outputs
for a sample of training pixels c. The process continues until the error value
converges to a minimum. Conventionally, the error function is given as
c
E = 0.5 L (Ti - Oi)2 , (2.28)
i=1
where Ti is the target output and Oi is the network output, also known as the
activation levels.
Thus, the magnitudes of the weights are determined by an iterative training
procedure in which network repeatedly tries to learn the correct output for
each training sample. The procedure involves modifying the weights of the
layers connecting units until the network is able to characterize the training
data (Foody 1995b). Once the network is trained, the adjusted weights are used
to classify the unknown dataset.
74 2: Raghuveer M. Rao, Manoj K. Arora
2.6.4
Fuzzy Classification Algorithms
For a large country, mapping from remote sensing is often carried out at the
regional level. This necessitates the use of coarse spatial resolution images
from a number of satellite based sensors such as MODIS, MERIS and AVHRR
that provide data ranging from 250 m to 1.1 km spatial resolution. The images
obtained from these sensors are frequently contaminated by mixed pixels.
These are the pixels that do not represent a single homogeneous class but
instead two or more classes are present in a single pixel area. The result is large
variations in the spectral reflectance value of the pixel. Thus, occurrence of
mixed pixels in an image is a major problem.
The crisp classification algorithms force the mixed pixels to be allocated
to one and only one class thereby resulting in erroneous classification. The
problem may be resolved either by ignoring or removing the mixed pixels
from the classification process. This may, however, be undesirable since it may
result in loss of pertinent information hidden in these pixels. Moreover, since
the mixed pixel displays a composite spectral response, which may be dissimilar
to each of its component classes, the pixel may not be allocated to any of its
component classes. Therefore, error is likely to occur in the classification of
images that contain a large proportion of mixed pixels (Foody and Arora 1996).
Alternative approaches for classifying images dominated by mixed pixels are,
therefore, highly desirable.
The output of the maximum likelihood classification may be fuzzified to
represent multiple class memberships for each pixel. Thus, for example, the
a posteriori probabilities from MLC may reflect, to some extent, the class
composition of a mixed pixel (Foody 1992). Another widely used method to
Overview of Image Processing 75
{2.29}
where pi is the vector of cluster centers {i. e. class means}, Pij are class mem-
bership values of a pixel, c and n are number of classes and pixels respectively,
IIXi - pj II~ is the squared distance {dij} between spectral response of a pixel
Xj and the class mean pi and m is a weighting exponent, which controls the
degree of fuzziness. The value of m varies from 1 {no fuzziness} to 00 {complete
fuzziness}. Earlier studies have shown that there is no optimal value of m but
a value in the range 1.5 to 3 can generally be adopted. The class membership
Pij is computed from
1
{2.30}
Pij = c ( ) l/(rn-l) .
L dij/dfk
k=l
These class membership values of a pixel denote the class proportions and
are treated as sub-pixel classifications. The sub-pixel classifications are rep-
resented in the form of fraction or proportion images, equal to the number
of classes being mapped. In Chap. 11, a Markov random field {MRF} model
based approach for sub-pixel classification of multi and hyperspectral data is
presented.
76 2: Raghuveer M. Rao, Manoj K. Arora
2.6.5
Classification Accuracy Assessment
Table 2.1. A typical error matrix (nij are the pixels of agreement and disagreement, N is the
total number of testing pixels)
OA, PA and UA do not take into account the agreement between the data sets
(i. e. classified image and reference data) that arises due to chance alone. Thus,
these measures tend to overestimate the classification accuracy (Ma and Red-
mond 1995). The kappa coefficient of agreement (K) has the ability to account
for chance agreement (Rosenfeld and Fitzpatrick-Lins 1986). The proportion
of agreement by chance is the result of the misclassifications represented by the
off-diagonal elements of the error matrix. Therefore, K uses all the elements
of the error matrix, and not just the diagonal elements (as is the case with
OA). When some classes have more confusion than others, weighted kappa
may be implemented since it does not treat all the misclassifications (disagree-
ments) equally and tends to give more weight to the confusions that are more
serious than others (Cohen 1968; Naesset 1996). To determine the accuracy
of individual classes, a conditional kappa may be computed (Rosenfeld and
Fitzpatrick-Lins 1986).
It may thus be seen that there are a number of measures that may be com-
puted from an error matrix. Each measure may however be based on different
assumptions about the data and thus may evaluate different components of
accuracy (Lark 1995; Stehman 1999). Therefore, in general, it may be expe-
dient to provide an error matrix with the classified image and report more
than one measure of accuracy to fully describe the quality of that classification
(Stehman 1997).
The error matrix based measures inherently assume that each pixel is asso-
ciated with only one class in the classified image and only one class in the ref-
erence data. Use of these measures to assess the accuracy of fuzzy classification
may therefore under or over estimate the accuracy since a fuzzy classification
has to be degraded to adhere to this assumption. Moreover, in addition to the
classification output, often ambiguities exist in the reference data, which may
therefore be treated as fuzzy and thus other alternative measures need to be
adopted (Foody 1995a; Foody 1995b).
One of the simplest measures to assess the accuracy of fuzzy classification
is the entropy, which shows how the strength of class membership (i. e. fuzzy
memberships) in the classified image is partitioned between the classes for
each pixel. The entropy for a pixel is maximum when the pixel has equal class
memberships for all the classes. Conversely, its value is minimum, when the
pixel is entirely allocated to one class. It thus tells us the degree to which
a classification output is fuzzy or crisp. The utility of entropy may, how-
ever, be appropriate for situations in which the output of the classification
is fuzzy (Binaghi et al. 1999). Often, there exist ambiguities in the refer-
ence data, which may therefore be treated as fuzzy or uncertain (Bastin et
al. 2002). Other measures may have to be applied when the reference data
and classified image are fuzzy. Under these circumstances, measures such
as Euclidian distance, L\ distance and cross-entropy or directed divergence
may be adopted. These measures estimate the separation of two data sets
based on the relative extent or proportion of each class in the pixel (Foody
and Arora 1996). Lower the values of these measures, higher is the accuracy
of classification. To indicate the accuracy of an individual class in a fuzzy
classification, correlation coefficient (R) may be used (Maselli et al. 1996).
78 2: Raghuveer M. Rao, Manoj K. Arora
Once the fuzzy error matrix is generated, conventional error matrix based
measures such as OA, PA and UA may be computed in the similar fashion to
indicate the accuracy of a fuzzy classification and of individual class. Thus,
the use of fuzzy error matrix based measures to evaluate fuzzy classification
have conformity with the measures based on conventional error matrix based
measures for crisp classification, and therefore may be more appropriate than
the distance and entropy based measures. However, further research needs to
be conducted to operationalize these measures in remote sensing community.
Overview of Image Processing 79
2.7
Image Change Detection
The objects on the earth surface change with time due to the economic, social
and environmental pressures. These changes need to be recorded periodi-
cally for planning, management and monitoring programs. Some programs
require estimation of changes over long time intervals whereas others need the
recording of short-term changes. The change in time is typically referred to as
temporal resolution.
To perform a systematic change detection study, two maps prepared at
different times are required. Aerial photographs and remote sensing images
that provide synoptic view of the terrain are regarded as attractive sources for
studying the changes on the earth's surface. This is due the availability of these
data at many different temporal resolutions. While the temporal resolution
of data from aircrafts may be altered as per the needs of the project, satellite
based data products have fixed temporal resolution. Once the data at two
different times are available, these can be analyzed to detect the changes using
three approaches - manual compilation, photographic compilation and digital
processing. Manual and photographic compilations are laborious and time
consuming. Therefore, digital methods of change detection are gaining a lot of
importance.
Before detecting changes using digital images, these need to be accurately
registered with each other (see Sect. 2.4). For example, registration accuracy
ofless than one-fifth of a pixel is required to achieve a change detection error
of less than 10% (Dai and Khorram 1999). Either feature based or intensity
based registration may be performed.
The fundamental assumption in applying any digital change detection al-
gorithm is that there exists a difference in spectral response of a pixel (i. e.
intensity value) on images of two dates if there is a change in two objects from
one date to the other. There are many digital change detection algorithms that
can be used to derive change maps; three widely used ones are,
1. Image differencing
2. Image ratioing
3. Post classification comparisons
In image differencing, one digital image is simply subtracted from another
digital image of the same area acquired at different times. The difference of
intensity values is stored as a third image in which features with no change will
have near zero values. The third image is thresholded in such a way that pixels
showing change have a value one and the pixels with no change have a value
zero thereby creating a binary change image.
The image ratioing approach involves the division of intensity values of
pixels in one image to the other image. The ratios are used to generate a third
image. In this image, if the intensity values are close to 1, there is no change in
objects. This method has an important advantage in that by taking ratios the
variations in the illumination can be minimized.
80 2: Raghuveer M. Rao, Manoj K. Arora
In post classification comparison, the two images are first classified using
any of the techniques mentioned in Sect. 2.6. These two classified images are
then compared pixel by pixel to produce a change image representing pixels
placed in different classes. For this method to be successful, the classification
of images should be performed very accurately.
Change detection has also been performed using many other algorithms
such as change vector analysis (Lambin and Strahler 1994), principal compo-
nent analysis (Mas 1999) and neural networks (Liu and Lathrop 2002). A re-
cently developed MRF model based change detection algorithm is presented
in Chap. 12.
2.8
Image Fusion
During the last several years, enormous amount of data from a number of
remote sensing sensors and geo-spatial databases has become available to the
user community. Often, the information provided by each individual sensor
may be incomplete, inconsistent and imprecise for a given application (Si-
mone et al. 2002). The additional sources of data may provide complementary
information to remote sensing data analysis. Therefore, fusion of different
information may result in a better understanding of the environment. This is
primarily due to the fact that the merits of each source may be tapped so as
to produce a better quality data product. For example, a multispectral sensor
image (at fine spectral resolution) may be fused with a panchromatic sensor
image (at high spatial resolution) to generate a product with enhanced quality,
as spectral characteristics of one image are merged with spatial characteristics
of another image to increase the accuracy of detecting certain objects, which
otherwise may not be possible from individual sensor data. Thus, image fusion
may be defined as the process of merging data from multiple sources to achieve
refined information. The fusion of images may be performed to (Gupta 2003):
2.9
Automatic Target Recognition
almost impossible for a target and the background to offer identical reflectance
in each spectral band. Whereas natural backgrounds typically offer highly
correlated images in narrowly spaced spectral bands (Yu et al. 1997), human-
made objects are less correlated.
ATR may generally be seen as a two-stage operation: i) Detection of anoma-
lies and ii) Recognition. In the detection stage, pixels composing different
targets or a single target are identified. This requires classifying each pixel as
target or non-target using classification approaches described earlier. Thus the
detection stage requires operating on the entire image. On the other hand, the
recognition stage is confined to processing only those regions that have been
classified as target pixels. Recognition also involves prior knowledge of targets.
In many instances, the latter can also be cast as a classification problem. While
there is this broad categorization of ATR into two stages, actual application
might involve a hierarchy of classification stages. For example, one might go
through classifying objects first as human-made or natural (the detection of
potential targets), then classify the human-made objects into buildings, roads
and vehicles, and finally classify vehicles as tanks, trucks and so on.
The early approaches to the ATR problem in images consisted roughly of
thresholding single sensor imagery, doing a shape analysis on the objects of
interest and then comparing the extracted shape information to stored tem-
plates. It is easy to appreciate the complexity of the problem, given the many
views, a single 3D object can provide as a function of its position relative to the
sensor with factors such as scaling and rotation thrown in. Furthermore, one
has to operate in poor signal to noise ratio (SNR) and contrast in highly clut-
tered backgrounds making traditional approaches such as matched filtering
virtually useless. Thus, we may conclude that approaches begun in the early
1980s have reached their limits.
On the other hand, multiband imagery offers many advantages, the chief one,
as already mentioned being the spectral separation that is seen between human-
made and natural objects. An important theme of current investigations is to
collect spectral signatures of targets in different bands. Classifier schemes
are then trained with target and background data for discriminating target
pixels across bands against background. The performance of hyperspectral
or multispectral ATR is yet to reach acceptable levels in field conditions. An
Independent Component Analysis (ICA) based feature extraction approach,
as discussed in Chap. 8, may be used for the ATR problem.
2.10
Summary
This chapter has provided an overview of image processing and analysis tech-
niques used with multiband image data. The intent of this chapter is to provide
a ready reference within the book for the basics of image analysis.
References 83
References
Arora MK, Foody GM (1997) Log-linear modeling for the evaluation of the variables affecting
the accuracy of probabilistic, fuzzy and neural network classifications. International
Journal of Remote Sensing 18(4): 785-798
Arora MK, Ghosh SK (1998) A comparative evaluation of accuracy measures for the classifi-
cation of remotely sensed data. Asian Pacific Remote Sensing and GIS Journal 10(2): 1-9
Arora MK, Mathur S (2001) Multi-source image classification using neural network in
a rugged terrain. Geo Carto International 16(3): 37-44
Bastin LPF, Fisher P, Wood J (2002) Visualizing uncertainty in multi-spectral remotely
sensed imagery. Computers and Geosciences 28: 337-350
Bernard AC, Kanellopoulos I, Wilkinson GG (1995) Neural network classification of mix-
tures. Proceedings of the International Workshop on Soft Computing in Remote Sensing
Data Analysis, Milan, Italy, pp 53-58
Bezdek JC, Ehrlich R, Full W (1984) FCM: the fuzzy c-means clustering algorithm. Comput-
ers and Geosciences 10: 191-203
Binaghi E, Brivio PA, Ghezzi P, Rampini A (1999) A fuzzy set-based accuracy assessment of
soft classification. Pattern Recognition Letters 20: 935-948
Bretschneider T, Kao 0 (1999) Image fusion in remote sensing, Report, Technical University
of Clausthal
Brodley CE, Friedl MA, Strahler AH (1996) New approaches to classification in remote
sensing using homogeneous and hybrid decision trees to map land cover. Proceedings
of the IEEE International Geoscience and Remote Sensing Symposium 1, pp 532-534
Brown, LG (1992) A survey of image registration techniques. ACM Computing Surveys 24:
325-376
Bruzzone L, Conese C, Maselli F, Roli F (1997) Multi-source classification of complex rural
areas by statistical and neural-network approaches. Photogrammetric Engineering and
Remote Sensing 63(5): 523-533
Campbell, JB (2002) Introduction to remote sensing. Guilford Press, New York
Chi Z (1995) MLP classifiers: overtraining and solutions. Proceedings of the IEEE Interna-
tional Conference on Neural Networks 5, pp 2821-2824
Cipollini P, Corsini G, Diani M, Grasso R (2001) Retrieval of sea water optically active
parameters from hyperspectral data by means of generalized radial basis function neural
networks. IEEE Transactions on Geoscience and Remote Sensing 39(7): 1508-1524
Cohen J (1968) Weighted kappa: nominal scale agreement with provision for scaled dis-
agreement or partial credit. Psychological Bulletin 70: 213-220
Congalton RG (1991) A review of assessing the accuracy of classifications of remotely sensed
data. Remote Sensing of the Environment 37: 35-46
Dai X, Khorram S (1999) A feature-based image registration algorithm using improved
chain-code representation combined with invariant moments. IEEE Transactions on
Geoscience and Remote Sensing 37: 2351-2362
Folk M, McGrath RE, Yeager N (1999) HDF: an update and future directions. Proceedings
of the IEEE International Geoscience and Remote Sensing Symposium 1, pp 273-275
Foody GM (1992) A fuzzy sets approach to the representation of vegetation continua from
remotely sensed data: an example from lowland heath. Photogrammetric Engineering
and Remote Sensing 58: 221-225
Foody GM (1995a) Cross-entropy for the evaluation of the accuracy of a fuzzy land cover
classification with fuzzy ground data. ISPRS Journal of Photogrammetry and Remote
Sensing 50: 2-12
84 2: Raghuveer M. Rao, Manoj K. Arora
Foody GM (1995b) Using prior knowledge in artificial neural network classification with
a minimal training set. International Journal of Remote Sensing 16(2): 301-312
FoodyGM (1998) Issues in training set selection and refinement for classification by a feed-
forward neural network. Proceedings of the IEEE International Geoscience and Remote
Sensing Symposium 1, pp 409-411
Foody GM, Arora MK (1996) Incorporating mixed pixels in the training, allocation and
testing stages of supervised classifications. Pattern Recognition Letters 17(13): 1389-
1398
Gersho A, Gray RM (1992) Vector quantization and signal compression. Kluwer Academic,
Boston, MA
Gong P, Miller JR, Spanner M (1993) Forest canopy closure from classification and spectral
mixing of scene components: multi-sensor evaluation of application to an open canopy.
Proceedings of the IEEE International Geoscience and Remote Sensing Symposium 2,
pp 747-749
Gonzalez RC, Woods RE (1992) Digital image processing. Addison-Wesley Publishing Com-
pany, Reading, Massachusetts
Gupta RP (2003) Remote sensing geology. Springer Verlag, Heidelberg
Hansen M, DeFries R, Townshend J, Sohlberg R (2000) Global land cover classification at
lkm spatial resolution using a classification tree approach. International Journal of
Remote Sensing. 21(6&7): 1331-1364
Heermann PD, Khazenie N (1992) Classification of multispectral remote sensing data using
a back-propagation neural network. IEEE Transactions on Geoscience and Remote
Sensing 30(1): 81-88
Huang X, Jensen JR (1997) A machine learning approach to automated construction of
knowledge bases for image analysis expert systems that incorporate GIS data. Pho-
togrammetric Engineering and Remote Sensing 63(10): 1185-1194
Jensen, JR (1996) Introductory digital image processing: a remote sensing perspective.
Prentice-Hall, Upper Saddle River, NJ
Kavzoglu T, Mather PM (2003) The use of backpropagating artificial neural networks in
land cover classification. International Journal of Remote Sensing 24(23): 4907-4938
Kimes DS (1991) Inferring vegetation characteristics using a knowledge-based system.
Proceedings of the IEEE International Geoscience and Remote Sensing Symposium 4,
pp 2299-2302
Lambin EF, Strahler AH (1994) Indicators ofland-cover change for change-vector analysis
in multitemporal space at coarse spatial scales. International Journal of Remote Sensing
15: 2099-2119
Lark RM (1995) Contribution of principal components to discrimination of classes ofland
cover in multispectral imagery. International Journal of Remote Sensing 6: 779-784
Liu X, Lathrop RG (2002) Urban change detection based on an artificial neural network.
International Journal of Remote Sensing 23: 2513-2518
Ma Z, Redmond RL (1995) Tau coefficient for accuracy assessment of classification of remote
sensing data. Photogrammetric Engineering and Remote Sensing 61: 435-439
Mas JF (1999) Monitoring land-cover changes: a comparison of change detection techniques.
International Journal of Remote Sensing 20: 139-152
Maselli F, Conese C, Petkov L (1996) Use of probability entropy for the estimation and
graphical representation of the accuracy of maximum likelihood classifications. ISPRS
Journal of Photog ram me try and Remote Sensing 49(2): 13-20
Mather PM (1999) Computer processing of remotely-sensed images: an introduction. Wiley,
Chichester
References 85
3.1
Introduction
Image
F(x)
Reference
Image
R(J)
Fig. 3.1. General flow chart of intensity based image registration using mutual information
as the similarity maesure
are explained. However, these methods, under certain conditions, suffer from
a phenomenon called interpolation-induced artifacts that hampers the op-
timization process and may thereby reduce registration accuracy (Pluim et
al. 2000). To overcome the problem of interpolation-induced artifacts, a new
algorithm called the Generalized Partial Volume joint histogram Estimation
(GPVE) algorithm (Chen and Varshney 2003) is also introduced.
3.2
Mutual Information Similarity Measure
Mutual information has its roots in information theory (Cover and Thomas
1991). It was developed to set fundamental limits on the performance of com-
munication systems. However, it has made vital contributions to many different
disciplines like physics, mathematics, economics, and computer science. In this
section, we introduce the use of MI for image registration.
MI of two random variables A and B is defined by
" PA B(a, b)
I(A,B) = :,;-PA,B(a, b) log PA(~)PB(b) , (3.1)
where P A(a) and PB (b) are the marginal probability mass functions, and
PA,B(a, b) is the joint probability mass function. MI measures the degree of
dependence of A and B by measuring the distance between the joint probability
PA,B(a, b) and the probability associated with the case of complete indepen-
dence PA(a)PB(b), by means of the relative entropy or the Kullback-Leibler
Mutuallnformation:A Similarity Measure for Intensity Based Image Registration 91
where H(A) and H(B) are the entropies of A and Band H(A, B) is their joint
entropy. Considering A and B as two images, floating image (F) and reference
image (R) respectively, the MI based image registration criterion states that
the images shall be registered when I(F, R) is maximum. The entropies and the
joint entropy can be computed from,
where Pp (t) and PR(r) are the marginal probability mass functions, and
Pp,R (t, r) is the joint probability mass function of the two images F and R. The
probability mass functions can be obtained from,
h (t, r)
Pp,R (t, r) = L h (t, r) , (3.6)
f,r
The value h(a, b), a E [O,M - 1], bE [O,N - 1], is the number of pixel pairs
having intensity value a in the first image (i. e. F) and intensity value b in the
second image (i. e. R). It can thus be seen from (3.2) to (3.8) that the joint
histogram estimate is sufficient to determine the MI between two images.
To interpret (3.2) in the context of image registration with random variables
F and R, let us first assume that both H(F) and H(R) are constant. Under this
assumption, maximization of MI in (3.2) is equivalent to the minimization of
92 3: Hua-mei Chen
a b c
Fig. 3.2a-c. Joint histograms of a pair of Landsat TM band 3 and band 7 images. a Images
are completely registered. b Images are shifted by one pixel vertically. c Images are shifted
by three pixels vertically
joint entropy. Figure 3.2a-c show the joint histograms of a pair of Landsat TM
band 3 and band 7 images. Figure 3.2a is obtained when the two images are
registered. Figure 3.2b is obtained when the first image is shifted vertically
by one pixel. Figure 3.2c shows the joint histogram when the first image is
shifted vertically by three pixels. From Fig. 3.2, we can observe that when
two images are registered, the joint histogram is very sharp whereas it gets
more dispersed as the vertical displacement is increased. Therefore, it seems
reasonable to devise a similarity measure that meaningfully represents the
sharpness or degree of dispersion of the joint histogram. Joint entropy is one
such metric that measures the uncertainty of the joint histogram. Thus, under
the assumption of constant entropies H{F) and H{R), (3.2) may be used to
interpret the mutual information criterion in that it tries to minimize the
uncertainty of the joint histogram of the two images to be registered.
When the assumption of constant entropies H{F) and H{R) is not satisfied,
we may employ the notion of conditional entropy to interpret the MI criterion.
We rewrite MI in terms of conditional entropy as,
(entropy) of one image by the knowledge of the other image. In other words,
the MI criterion implies that when two images are registered, one gains the
most knowledge about one image by observing the other one. For example,
based on (3.10), the uncertainty of F without the knowledge of R is H(F) and
its uncertainty when R is known is H(FIR). Therefore, the reduction in the un-
certainty H(F) - H(FIR) defines the MI between F and R and its maximization
implies the most reduction in the uncertainty. Since its introduction, MI has
been accepted as one of the most accurate and robust similarity measures for
image registration.
3.3
Joint Histogram Estimation Methods
From (3.2) to (3.8), it is clear that the main task involved in determining the
mutual information between two images is to estimate the joint histogram of
the two images. Existing joint histogram estimation schemes may be catego-
rized into two groups, depending on whether or not an intermediate image
resampling procedure for the reference image is required.
3.3.1
Two-Step Joint Histogram Estimation
(3.13)
for every i such~that Ta(Xi) E Y. In (3.13), round(·) represents the rounding off
operation and Y is the continuous domain of the reference image.
The general drawback of this two-step procedure is that the resulting MI
registration function is usually not very smooth due to the rounding off op-
eration. Smoothness of the MI registration function facilitates the subsequent
optimization process. This is illustrated in Fig. 3.3a-c where a and b represent
94 3: Hua-mei Chen
0.4
0.37 '----~----~---'
a b c
Fig. 3.3. a Radarsat SAR image, b IRS PAN image image and c MI registration function using
linear interpolation. In c, rotation angle is presented in degrees
a pair of Radarsat Syntehtic Aperture Radar (SAR) and IRS PAN images and
c shows the MI registration function using linear interpolation by varying the
rotation angle from - 21.56° to - 19.56° after registration while keeping the
values of the other transformation parameters fixed. From Fig. 3.3c, we can
observe that the MI registration function obtained using linear interpolation
is not smooth. To overcome this problem, Maes et al. (1997) devised a one-step
joint histogram estimation method that produces a very smooth MI curve and
is described next.
3.3.2
One-Step Joint Histogram Estimation
This method directly estimates the joint histogram without resorting to the
interpolation of intensity values at the transformed grid points on the reference
image. As a result, rounding off operation involved in the two-step procedure
is avoided and the resulting MI registration function becomes very smooth.
This method is called partial volume interpolation (PVI) as it was originally
devised for 3D brain image registration problems. Its 2D implementation is
now described.
In Fig. 3.4, 11I> 112, 113, and 114 are the grid points of the reference image that
are closest to the transformed point Ta(.x). These points define a cell. Ta(x)
0>2 (0,
Ta(x)
O>.! CIlj
Fig. 3.4. Graphical illustration of partial volume interpolation (PVI) in two dimensions
Mutuallnformation:A Similarity Measure for Intensity Based Image Registration 9S
0.335 ,---~,---~---~-------,
0.33
e!
~
13 0.325
E
~ 0.32
.!!!
:~ 0.315
0.31
0.305 L-.-_~
-21.56 -21.06 -20.56 -20.06 -19.56
Rotation angle
Fig. 3.5. MI registration function obtained using the same pair of images as shown in Fig. 3.3
but employing 2D implementation of the PVI algorithm
splits the cell (fiI> fi2, fi3, fi4) into four sub-cells having areas WI> W2, W3, and
W4 with the constraint Li wi(Ta(.x)) = 1. The joint histogram is then obtained
by updating four entries defined by (F(x), R(fii)), i = 1, ... ,4, as follows:
The method results in a very smooth MI registration function, and does not
introduce any extra intensity values. That is, the indices of all the non-zero
entries in the estimated joint histogram are completely defined by the set of
intensity values in the floating image and the set of intensity values in the
reference image (see (3.9), where each element is specified by its 2D index
[0 ~ M - 1, 0 ~ N - 1]). In the two-step procedure described previously,
new intensity values, which are not in the original set of intensity values in
the reference image, may be introduced because of interpolation. In addition,
in the two-step procedure, if a more sophisticated interpolation method like
cubic convolution interpolation or cubic B-spline interpolation is employed,
the interpolated values may go beyond the range of the intensity values used
(e. g. 0 ~ 2SS in general). In this case, extra effort may be required to obtain
a meaningful estimate of the joint histogram by keeping the interpolated values
between 0 and 2SS. Fig. 3.S shows the registration function using the same
image data as shown in Fig. 3.3 but the 2D implementation ofPVI is employed.
Clearly, in this case, the MI registration function is much smoother.
3.4
Interpolation Induced Artifacts
Very often, when a new similarity measure is adopted for image registration,
it is a common practice to use a pair of identical images to test its efficacy.
96 3: Hua-mei Chen
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
1.4
1.5
1.2
0.8
a b
1.4 , - - - - - - - - - - - _ _ , 0.7 r------"'"/"~---__,
1.2
0.6
:E 0.5
0.8
0.6 0.4
c d
Fig. 3.7a-d. MI artifacts for a pair of Landsat TM band 1 and band 3 images along the vertical
displacement (x axis) in pixels. a No extra noise is added. b Gaussian noise with variance
equal to 1 is added. c Gaussian noise with variance equal to 3 is added. d Gaussian noise
with variance equal to 7 is added
important role in the formation of the artifact pattern. As the noise in the
images is increased, the artifact patterns become more pronounced.
To gain some additional insight into the formation of the artifacts resulting
from linear interpolation, let us consider the following. From (3.2), we know
that three components are involved in the determination of MI. These are
entropy of the floating image, entropy of the reference image, and the joint
entropy of the floating and reference images. For simplicity, we can make the
floating image always within the scope of the reference image. In this manner,
the overlap for the floating image remains the same and the corresponding
entropy H(F) remains a constant. Thus, only the entropy of the reference
image and the joint entropy vary as the transformation parameters change.
Figures 3.8 and 3.9 show the effect of noise on these two components when
linear interpolation is employed for joint histogram estimation. From these
figures, we can see that when the vertical displacement is non-integer, both
the entropy of the reference image and the joint entropy decrease as a result
of the blurring effect of linear interpolation. Moreover, this decrease is more
manifested for joint entropy than marginal entropy. Therefore, the pattern of
MI artifacts shown in Fig. 3.7 is basically the inverse of the pattern of the joint
entropy shown in Fig. 3.9.
98 3: Hua-mei Chen
5.26 5.35
5.24
~ 5.22 1;:
e
c ~
'" 5.2
'" 5.25
5.18
5.16 5.2
-3 -2 -1 -3 -2 -1 2
a b
5.6 5.9
5.55
1;:
e
c
'"
5.65 .- . . .
~-~.
-2 -1 -3 -2 -1 0 3
C d
Fig. 3.8a-d. Entropy of the reference image. a No extra noise is added. b Gaussian noise with
variance equal to 1 is added. c Gaussian noise with variance equal to 3 is added. d Gaussian
noise with variance equal to 7 is added. In each case, y axis represents entropy and x axis
the displacement in pixels
10 10
9.8
J~\/~F
~ 9.5 9.6
1;:
e
c ~ 9.4
'"c '"
C
:§. :Q. 9.2
-3 -2 -1 2 -3 -2 -1
a b
10.6 11.4
10.4 11.3
1;:
~ 10.2
!
>-
11.2
'"c 10
'"c 11.1
:2.. "£ 11
9.8 10.9
9.6 10.8
-3 -2 -1 -3 -2 -1
C d
Fig. 3.9a-d. Joint entropy of the floating and reference images. a No extra noise is added. b
Gaussian noise with variance equal to 1 is added. c Gaussian noise with variance equal to 3
is added. d Gaussian noise with variance equal to 7 is added. In each case, x axis represents
the displacement in pixels
Mutuallnformation:A Similarity Measure for Intensity Based Image Registration 99
0.7 11.5
11.4
0.6
~
e 11.3
C
~ 0.5 OJ
C 11.2
:Q.
0.4
11.1
0.3 11
-3 -2 -1 0 2 3 -3 -2 -1 0 2 3
a b
Fig. 3. 1Oa,b. Artifact patterns resulting from the PVI algorithm. a Artifact pattern in the MI
registration function. b Artifact pattern in the joint entropy function. In each case, x axis
represents the displacement in pixels
Next, let us turn to the artifacts resulting from the PVI algorithm. The
mechanism that causes the artifacts in the PVI algorithm is the splitting of
a joint histogram entry into a few smaller entries, or equivalently, a probability
into a few smaller probabilities (Pluim et al. 2000). The result of this probability
split is an increase in entropy. Again, from (3.2) we know that there are three
components that determine the mutual information. Generally, the change in
the joint entropy of the two images caused by the probability split behaves
in a similar but more severe manner as the entropies of the two individual
images. Therefore, the artifact pattern in the MI registration function resulting
from PVI is dominated by the artifact pattern in the joint entropy, as is the
case in linear interpolation. Figure 3.10 shows the artifact patterns in the MI
registration function and the joint entropy resulting from PVI using the same
image pair used to generate Fig. 3.7d, 3.8d and 3.9d. Notice that the artifact
pattern in the joint entropy (Fig. 3.10b) is similar to that in the MI measure
(Fig. 3.1 Oa) but is inverted. This is because of the negative sign in (3.2). For more
information about the mechanisms resulting in artifact patterns from linear
and partial volume interpolations, interested readers are referred to Pluim et
al. (2000). A new joint histogram estimation algorithm that can reduce the
interpolation artifact problem effectively is discussed next.
3.S
Generalized Partial Volume Estimation of Joint Histograms
y. =(Yi'Y)
Y. = (Yi + I, Y j + I)
Fig.3.ll. Another graphical illustration ofPVI in two dimensions. compare this figure with
Fig. 3.4
may be employed. However, Chen and Varshney (2002) clearly showed that
such an attempt was not successful as even after applying cubic convolution
and cubic B-spline interpolations, the artifacts were still present. Although,
there are many other image interpolation algorithms (Lehmann et al. 1999),
the chances of devising a two-step artifact-free joint histogram estimation
scheme appear remote.
Alternatively, a one-step joint histogram estimation scheme that will reduce
the problem of artifacts may be devised. As discussed previously, the occur-
rence of artifacts in the PVI algorithm is due to the introduction of additional
joint histogram dispersion. Thus, if this additional joint histogram disper-
sion can be reduced, artifacts can be reduced. In this section, we introduce
such a scheme called Generalized Partial Volume joint histogram Estimation
(GPVE) (Chen and Varshney 2003). It turns out that PVI is a special case of
GPVE scheme.
Before introducing the GPVE algorithm, let us first rewrite the PVI algo-
rithm, proposed by Maes et al. (1997), for the 2D case. With reference to
Fig. 3.11, let F and R be the floating image and the reference image respec-
tively, and Ta be the transformation characterized by the parameter set a that
is applied to the grid points of F. Assume that Ta maps the grid point (Xi,Xj) in
image F onto the point (yi + Lli,Yj + Llj) in the image R, where (Yi,Yj) is a grid
point in Rand 0 :::: Lli,Llj < l. YI> 5'2, Y3, and Y4 are the grid points of the ref-
erence image R that are closest to the transformed grid point Ta(x) that splits
the cell defined by grid points YI, Y2, Y3, and y4 into four sub-cells having areas
WI> W2, W3, and W4 with the constraint Li Wi (Ta(x») = l. Now, let us express
the original PVI algorithm described in (3.13) in terms of a kernel function f.
Let f be a triangular function defined by
1+
1- t if 0 :::: t :::: 1
f(t) = ~ t if - 1 :::: t < 0 (3.15)
otherwise,
Mutuallnformation:A Similarity Measure for Intensity Based Image Registration 101
L
00
then for each grid point x = (Xl, X2) E X in the image F, the joint histogram h
is updated in the following manner:
0 0 0
-3 -2 -1 3 -3 2 3 -3 -2 -1 0 2 3
a b c
Fig.3.12a-c. B-sp1ine functions of different orders. a First order. b Second order. c Third
order
102 3: Hua-mei Chen
how many neighboring grid points will be involved in the joint histogram
updating procedure. For more details on B-spline functions, interested readers
are referred to Unser et al. (l993a,b).
In the GPVE scheme, the kernel functions can be of different types along
different directions. That is, we can rewrite (3.19) as
(3.20)
where fl and fz, can be different kernels. For example, if we know that the
artifacts are to appear in the y-direction only, we can choose fl as the 1st order
B-spline function but may choose fz as the 3rd order B-spline function.
Fig. 3.13 shows the grids in R (shown as " 0") that are involved in updating
the joint histogram in the 2D case using the 1st, 2nd, and 3rd order B-splines
as the kernel functions. In each case, the transformed grid point of F appears
at the center of each plot. Figure 3.13a shows the case when the transformed
grid point of F is coincident with a grid point of R and Fig. 3.13b shows the
case when the transformed grid point of F does not coincide with a grid point
in R and is surrounded by four grid points in R. We observe that one to four
entries of the joint histogram are involved in updating each pixel in F if the PVI
algorithm (or the 1st order GPVE) is used. This is evident from Fig. 3. 13a,b. In
Fig. 3.13a, only the central pixel is involved in updating whereas in Fig. 3.13b
all the four grid points surrounding the point marked by "*" are involved in
updating. Similarly, nine and four grid points will be involved in updating
in Fig. 3.13a and Fig. 3.13b respectively when 2nd order GPVE is employed.
The number of grid points involved in updating is determined by the size
of the support of the kernel function, which is shown as the shaded area in
Fig. 3.13a,b. Each side of the shaded region is four times the sample spacing
for the case of 3rd order GPVE. In this case, 9 to 16 grid points are involved in
updating as seen in Fig. 3.13. The ratios of the maximum number to minimum
number of updated entries are 4, 2.25, and 1.78 when using the 1st, 2nd,
and 3rd order GPVE respectively. The reduction in the values of these ratios
gives GPVE the ability to reduce the artifacts. Intuitively, the ratio needs to be
one to remove the artifacts completely because different numbers of updated
entries introduce different amounts of dispersion in the joint histogram, and
therefore influence the mutual information measure differently. However, in
••••
•
•
•
••••
l SI order
•
•
•
•
•
••
•••••
•
•
•
•
•••••
2'w! order
a
3'd order
•••••
•••••
••
••
•••••
••
•
•
• •lSI •order
•
•
•
•
•
••••••
•
•
•
•
•
•
•
•
• • •nd• • • •
2 order
b
•••••
•
•
•
•
••• ••
3rd order
•
•
•
•
•
•
Fig. 3.13a,b. Grid points corresponding to R that are involved in updating the joint histogram
in the 2D case. a When the transformed grid point is coincident with a grid point in R. b When
the transformed grid point is surrounded by grid points in R
Mutuallnformation:A Similarity Measure for Intensity Based Image Registration 103
0.6 0.6
0.5 0.5
~ 0.4 ~ 0.4
0.3 0.3
0.2 0.2
-3 -2 -1 0 2 3 -3 -2 -1 0 2 3
a b
many cases, artifacts can hardly be seen when either the 2nd or the 3rd order
GPVE is used. Figure 3.14a,b shows the MI similarity measure as a function of
vertical displacement using the same image data used to produce Fig. 3.lOa.
Figure 3.14a is obtained using the 2nd order GPVE algorithm, and Fig. 3.14b
results from the 3rd order GPVE algorithm. Clearly, artifacts can be hardly
seen in either case. It is shown in Chen and Varshney (2003) that, for medical
brain MR to CT image registration application, the use of higher order GPVE
algorithm not only reduces artifacts visually, but also improves registration
accuracy when it is affected by the artifact pattern resulting from the PVI
method.
3.6
Optimization Issues in the Maximization of MI
According to the mutual information criterion for image registration, one
needs to find the pose parameter set that results in the global maximum
of the registration function as shown in Fig. 3.1 earlier. The existing local
optimization algorithms may result in just a local maximum rather than a global
maximum while the existing global optimization algorithms are very time
consuming and lack an effective termination criterion. Therefore, almost all
the intensity based registration algorithms that claim to be automatic run the
risk of producing inaccurate registration results. Thus, to develop a reliable
fully automated registration algorithm, a robust global optimizer is desirable.
In general, existing global optimization algorithms such as genetic algo-
rithms (Michalewicz 1996) and simulated annealing (Farsaii and Sablauer
1998) are considered to be robust only when the process runs long enough
such that it converges to the desired global optimum. In other words, if an
optimization algorithm terminates too early, it is very likely that the global
optimum has not been reached yet. On the other hand, if a strict termination
criterion is adopted, despite having achieved the global optimum, the search
will not end until the termination criterion is satisfied resulting in poor effi-
ciency. One way to build a robust yet efficient global optimizer is to determine
whether a local optimum of a function, once found, is actually the global opti-
104 3: Hua-mei Chen
Yes
mum. If there is a way to distinguish the global optimum of a function from its
local optimum, then a robust yet efficient global optimizer is achievable. Fig-
ure 3.15 shows such a global optimization scheme. It is efficient because a local
optimizer is employed to accelerate convergence, and once the global optimum
is reached, the search process terminates immediately. It is also robust because
the process will not terminate until the global optimum is reached.
Generally, it is very difficult, if not impossible, to determine whether or not
the optimum found for a function is global without evaluating the function
completely. However, for the global image registration problem ( a single trans-
formation is sufficient for registering entire images), there exists a heuristic test
to determine whether the optimum determined is global or not (Chen 200lb).
The heuristic test algorithm may be described with the help of Fig. 3.16a that
shows a function with several local maxima. The goal is to find the position
of the global maximum. As mentioned earlier, to identify the global optimum
of a function without completely evaluating the whole function is very hard, if
not impossible. However, if a second function like the one (thick line) shown in
Fig. 3.16b is available, then it is possible to identify the global maximum of the
function represented by the thin line. If we observe Fig. 3.16b carefully, we can
see the unique relationship of the two functions: their global maxima appear
at the same location. Using this property, it is expedient to identify the global
maximum of the function represented by the thin line. Fig. 3.16c illustrates this
approach. For example, if we use a local optimizer and find Point 1 in Fig. 3.16c
as an optimum and would like to know whether it is a global maximum or just
a local maximum, we can use the position of Point 1 as the initial position and
use a local optimizer to find a local maximum of the second function shown by
the thick line. In this case, the local optimizer will result in Point 3 as the local
maximum. Now we can compare the positions of Point 1 and Point 3. Since they
are different in this case, we can conclude that Point 1 is just a local maximum
Mutuallnformation:A Similarity Measure for Intensity Based Image Registration lOS
Fig. 3.16a-c. An illustration of the heuristic test to identify the global maximum. For a col-
ored version of this figure, see the end of the book
rather than the global maximum. In the same manner, we can identify Point 2 as
the global maximum of the function shown as the thin line curve in Fig. 3.16c.
From this example, we see that it is possible to identify the global optimum of
a function if multiple uncorrelated functions whose global optima occur at the
same location in the parameter space are available. The question now is: how
do we obtain the multiple uncorrelated functions? In general, it is difficult.
However, it is possible to find such functions for image registration problems.
Let us consider the example shown in Figs. 3.17 and 3.18. Figure 3.17a,b shows
a pair of Radarsat SAR and IRS PAN images to be registered as shown earlier in
Fig. 3.3a,b. Fig. 3.17 c shows the MI similarity measure as a function of rotation
angle after registration. If we partition the SAR image into two sub-images as
shown in Fig. 3.18a,b and use each of them as the floating image, we attain two
more MI registration functions as shown in Fig. 3.18c,d. Observing Fig. 3.17c
together with Fig. 3.18c,d, we find that their global maxima occur at the same
position (i.e. at about -22° ).
The basic idea of this approach is the following: if two images are geometri-
cally aligned through a global transformation T, then any corresponding por-
tions of the two images (sub-images) are also geometrically aligned through T.
106 3: Hua-mei Chen
O~'r------------------------.
a b c
Fig.3.17a-c. A pair of a Radarsat SAR (histogram equalized for display purposes), b IRS
PAN images, c the MI registration function along the rotation axis. Rotation angle along
absica is shown in degrees
a b
0.22 0.35
0.2
0.3
0.18
0.16
~ ~ 0.25
0.14
0.12 0.2
0.1
0.08
·200 ·100 0 100 200 · 100 o 100 200
Rotation angle Rotation angle
C d
Fig.3.18a-d. a and b Partitioned SAR images. c and d the corresponding MI registration
functions along the rotation axis. In c and d, both rotation angles are represented in degrees
3.7
Summary
Most of the existing similarity measures for intensity based image registration
problems depend on certain specific relationships between the intensities of
the images to be registered. For example, mean squared difference assumes
an identical relationship and cross correlation assumes a linear relationship.
If these assumptions are not strictly satisfied, registration accuracy may be
affected significantly. One example to illustrate this will be given in Chap. 7.
In this chapter, we have introduced the use of mutual information as a simi-
larity measure for image registration. It does not assume specific relationship
between the intensity values of the images involved and hence, it is a very gen-
eral similarity measure having been adopted for many different registration
applications involving different types of imaging sensors.
Tasks involved in MI based registration include joint histogram estimation
and global optimization of the MI similarity measure. These tasks have been
discussed in detail in this chapter. A phenomenon called interpolation-induced
artifacts was also discussed. A new joint histogram estimation scheme called
generalized partial volume joint histogram estimation was described. It elimi-
nates or reduces the severity of this phenomenon. To build a robust yet efficient
global optimizer, a heuristic test to distinguish the global optimum from the
local ones was presented. Image registration examples using the techniques
introduced in this chapter will be provided in Chap. 7.
References
Chen H (2002) Mutual information based image registration with applications. PhD disser-
tation, Syracuse University
Chen H, Varshney PK (2000a) A pyramid approach for multimodality image registration
based on mutual information. Proceedings of 3rd international conference on informa-
tion fusion 1, pp MoD3 9-15
Chen H, Varshney PK (2002) Registration of multimodal brain images: some experimental
results. Proceedings of SPIE Conference on Sensor Fusion: Architectures, Algorithms,
and Applications 6, Orlando, FL, 4731, pp 122-l33
Chen H, Varshney PK (2001a) Automatic two stage IR and MMW image registration al-
gorithm for concealed weapon detection. lEE Proceedings, vision, image and signal
processing 148(4): 209-216
108 3: Hua-mei Chen
Chen H, Varshney PK (2001 b) A cooperative search algorithm for mutual information based
image registration. Proceedings of SPIE Conference on Sensor Fusion: Architectures,
Algorithms, and Applications 5, Orlando, FL, 4385 117-128, 200 l.
Chen H, Varshney PK (2003) Mutual information based CT-MR brain image registration
using generalized partial volume joint histogram estimation. IEEE Transactions on
medical imaging 22(9): 1111-1119
Chen H, Varshney PK, Arora MK (2003a) Mutual information based image registration for
remote sensing data. International Journal of Remote Sensing 24(18): 3701-3706
Chen H, Varshney PK, Arora MK (2003b) Automated registration of multi-temporal remote
sensing images using mutual information. IEEE Transactions on Geoscience and Remote
Sensing 41(11): 2445-2454
Collignon A, Maes F, Delaere D, Vandermeulen D, Suetens P, Marchal G (1995) Automated
multimodality medical image registration using information theory. Proceedings of 14th
International Conference on Information Processing in Medical Imaging (IPMI'95), Ile
de Berder, France, pp. 263-274.
Cover TM, Thomas JA (1991) Elements of information theory, John Wiley and Sons, New
York.
Farsaii B, Sablauer A (1998) Global cost optimization in image registration using Simulated
Annealing. Proceedings of SPIE conference on mathematical modeling and estimation
techniques in computer vision 3457, San Diego, California, pp 117-125
Holden M, Hill DLG, Denton ERE, Jarosz JM, Cox TCS, Rohlfing T, Goodey J, Hawkes
DJ (2000) Voxel similarity measures for 3-D serial MR brain image registration. IEEE
Transactions on Medical Imaging 19(2): 94-102.
Keys RG (1981) Cubic convolution interpolation for digital image processing. IEEE Trans-
actions on Acoustics, Speech and Signal Processing 29(6): 1153-1160
Lehmann TM, Gonner C, Spitzer K (1999) Survey: interpolation methods in medical image
processing. IEEE Transactions on Medical Imaging 18: 1049-1075
Maes F, Collignon A, Vandermeulen D, Marchal G, Suetens P (1997) Multimodality image
registration by maximization of mutual information. IEEE Transactions on Medical
Imaging 16(2): 187-197
Michalewicz Z (1996) Genetic algorithms + data structures = evolution programs, 3rd and
extended edition. Springer, New York.
Pluim JPW, Maintz IBA, Viergever MA (2000) Interpolation artifacts in mutual information-
based image registration. Computer vision and image understanding 77: 211-232
Studholme C, Hill DLG, Hawkes DJ (1997) Automated three-dimensional registration of
magnetic resonance and positron emission tomography brain images by multiresolution
optimization of voxel similarity measures. Medical Physics 24( 1): 25-35
Unser M, Aldroubi A, Eden M (1993) B-spline signal processing: Part I-theory. IEEE Trans-
actions on Signal Processing 41(2): 821-833
Unser M, Aldroubi A, Eden M (1993) B-spline signal processing: Part II-efficient design.
IEEE Transactions on Signal Processing 41(2): 834-848
Viola P, Wells III WM (1995) Alignment by maximization of mutual information. Proceed-
ings of 5th International Conference on Computer Vision, Cambridge, MA, pp. 16-23
Wells WM, Viola P, Atsumi H, Nakajima S (1996) Multi-modal volume registration by
maximization of mutual information. Medical Image Analysis 1(1): 35-51
West J, et al (1997) Comparison and evaluation of retrospective intermodality brain image
registration techniques. Journal of Computer Assisted Tomography 21(4): 554-566
CHAPTER 4
Stefan A. Robila
4.1
Introduction
4.2
Concept of ICA
Consider a cocktail party in a room in which several persons are talking to each
other. When they talk at the same time, their voices cannot be distinguished
even with the help of several microphones. This is because the recorded sound
signals will be mixtures of the source sound signals (voices). The exact manner
in which the voices have been mixed is unknown and may depend on the
distance from the microphone and other factors. We are, thus, faced with
P. K. Varshney et al., Advanced Image Processing Techniques for Remotely Sensed Hyperspectral Data
© Springer-Verlag Berlin Heidelberg 2004
110 4: Stefan A. Robila
Generally, this problem is called Blind Source Separation (BSS). Each com-
ponent of s corresponds to a source (thus there are m sources). The term blind
indicates that little or no information on the mixing matrix A, or the source
signals is available (Hyvarinen et al. 2001). The number of possible solutions
for this problem can be infinite, i. e., for a given x there is an infinite number
of pairs (s,A) that satisfy (4.1).
Consider now the situation when the components of the source s are assumed
to be statistically independent meaning thereby that the probability density
function of s, p(s) can be expressed as:
rn
p(s) = np(Si) . (4.2)
i=l
In this case, the problem is called the independent component problem, and
a solution to this problem is called an Independent Component Analysis (ICA)
solution (Common 1994). In addition to the independence assumption, in or-
der to provide a unique solution, ICA has the following restrictions (Hyvarinen
et al. 2001):
1. All the components of s have non-Gaussian distribution
2. The number of sources is smaller or equal to the number of observations
(m.:::: n)
3. Only low noise is permitted
The above conditions are required to ensure the existence and the unique-
ness of the ICA solution. The second restriction states that there should be
enough observations available in order to be able to recover the sources. It
is similar to the condition in linear algebra, where the number of equations
needs to be at least equal to the number of variables. If this condition is not
satisfied, the solution obtained for ICA is not unique (Lee 1998). For simplicity,
in most applications, it is assumed that m = n and that the mixing matrix A
is invertible. In this case, once the matrix A is computed, its inverse can be
used to retrieve the independent components. Thus, in the rest of this chapter,
the matrix A will be assumed to be square. Extension of ICA to non-square
matrix cases has been proposed in the literature (Common 1994; Girolami
2000; Chang et al. 2002).
The restriction regarding non-Gaussian components also contributes to the
uniqueness of the solution. If non-Gaussian components are involved, the ICA
Independent Component Analysis III
(4.3)
0.35 -··~·-T~-~-~-~~-··
- gaussian
0.3 - - supergaussian
. . . subgaussian
0.25
/
/1
,
I
0.2
I \
0.15
.-
,,
0.1
.-
/ ,
0.05 ... ... " ....
0
·4 ·3 ·2 ·1 0 2 3 4
6000 6000
kurtosis=0.0022 kurtosis=2.5639
5000 5000
4000 4000
3000 3000
2000 2000
1000
OL-------~-~~------~
-10 -5 o 5 10
1000
0L-______
-10 -5
~_~~
o
______
5
I ~
10
a b
200r---------------, 150r---~--~--~-~
150 \
/
100
100
~~
°2L--~~3------4------5------6
6
c d
Fig.4.2. Example of kurtosis instability. a Gaussian distributed random variable (mean =
0, standard deviation = 1 and number of observations = 100,000) b A perturbation of
500 observations is added on the right tail c, d Area enlarged of the two graphs (a and b
respectively) showing that the perturbation is almost unnoticeable in the histogram
Independent Component Analysis 113
4.3
ICA Algorithms
We proceed to solve the ICA problem stated in the previous section. The first
step is to decorrelate and sphere the observed data, thereby eliminating the
first and second order statistics (i. e. mean and standard deviation). The next
step is to determine the matrix A such that its inverse allows the recovery of
the independent components. A is usually obtained through gradient based
techniques that optimize a certain cost function c(·):
dc(A,x)
L1A = . (4.4)
dA
The cost function is designed such that its optimization corresponds to
achieving independence. One direct approach for the solution of the ICA
problem is to consider c(·) as the mutual information of the components of
A -1 x. This method is described in detail in Sect. 4.3.2. A second approach is
based on the observation that projecting the data x into components, which are
decorrelated and are as non-Gaussian as possible, also leads to independence
(see Sect. 4.3.3) Both methods can be shown to be equivalent and choice of
one over the other, or modification of either of them depends on the specific
problem or the preference of the user.
4.3.1
Preprocessing using PCA
Whitening (decor relation and sphering) of the data is usually performed before
the application of I CA as a mean to reduce the effect of first and second order
114 4: Stefan A. Robila
such that
is diagonal. The vector of expected values E{y} and the covariance matrix Iy
can be expressed in terms of the vector of expected values and the covariance
matrix for x:
E {y} = E {Wx} = WE{x} , (4.7)
Ly=WLxWT . (4.8)
(4.9)
minimize the dependence between the components of the data. In the case of
PCA, dependence minimization is achieved when the covariance between the
components is zero, whereas ICA considers the components to be independent
when the probability density function of the vector can be expressed as the
product of the probability density functions of the components (see (4.2)).
When the data is Gaussian, decorrelation is equivalent to independence, and
thus PCA can be considered to be equivalent to ICA. However, when the data
is non-Gaussian, PCA does not fully achieve independence. This is because, in
this case, dependence in the data needs to be characterized through third and
fourth order statistical measures, which is not the case with PCA that depends
on second order statistics only.
Nevertheless, decorrelation prior to ICA processing allows ICA algorithms to
achieve faster convergence. In addition, the data is also sphered. This transform
is given by (Hyvarinen et al. 2001):
(4.11)
Once the data is transformed, we are ready to proceed with the ICA solution.
4.3.2
Information Minimization Solution for leA
(4.12)
The mutual information is the most natural way of assessing statistical inde-
pendence of random variables (Hyvarinen et al. 2001). When the components
are independent, the ratio inside the logarithm in (4.12) reduces to one, and
thus the mutual information becomes zero. It should be pointed out that 1(·) is
lower bounded by zero. It can also be shown that when the mutual information
is zero, the components of u are independent (Lee 1998). This property of the
mutual information is used here to achieve independence of components.
The mutual information can also be defined based on the entropy. For
a random variable u, the entropy of u is defined as:
The solution of the leA problem can be obtained by minimizing the mutual
information and making it as close to zero as possible. Thus, we define the
minimal mutual information (MMI) problem as follows. For an n dimensional
random vector x, find a pair (u, W), where u is an m dimensional random
vector, and W is an m x n matrix, such that:
u= Wx (4.16)
and
I (UI. ... , um) = min {I (VI. ... , vm) Iv = Vx} , (4.17)
where V is a m x n matrix.
The MMI problem can be used to obtain the solution for the leA problem.
From (4.1), if (5,A) is the solution to the leA problem, under the assumption
that A is invertible, we have:
(4.18)
where the components of5 are independent. The MMI problem defined by (4.16)
is identical to the leA problem if 5 and A -1 are set equal to u and W, respec-
tively. Since the components of 5 are independent, the mutual information is
zero, i. e.:
(4.19)
In this case, 5 satisfies the second condition for the MMI solution (4.17).
Hence, the leA solution (5, A -1) can be obtained by solving the MMI problem.
The equivalence between the leA and the MMI problems allows the develop-
ment of a practical procedure to solve the leA problem. In other words, the
leA solution is found by minimizing the mutual information. If MI reaches
the minimum achievable value of zero, complete independence of the compo-
nents is achieved. Otherwise, the components are as independent as possible.
This will occur in situations where complete independence is not achievable
(Hyvarinen et al. 2001; Lee 1998).
Let us now develop an algorithm for the solution of the MMI problem. The
goal is to determine the gradient of the mutual information with respect to the
elements of W. Once the gradient is computed, it is used as the iterative step
for updating the elements of W in the gradient-based optimization algorithm:
aI(ul. ... ,u m )
W = W + ..1 W = W - -'--------"- (4.20)
aw
Independent Component Analysis 117
In this case, the role of the cost function c(·) in (4.4) is played by the
mutual information I. In order to compute the gradient, expand the mutual
information as (Papoulis 1991):
m
i=!
m
(4.23)
Using (4.22) and (4.23), express the first term of the mutual information
given in (4.21) using x and W:
n
I(uJ, ... ,U n ) = E{log (p(x))} -log(detW)- LE{log(p(Ui))}. (4.24)
i=1
Since the first term E {log (p(x))} does not involve W, we will analyze the
two remaining terms separately. The first term becomes (Lee 1998):
The derivative of Ui with respect to the (i,j) element of w, (i. e. Wj,k) can be
computed as follows:
L
a ( n Wi,qXq ) n . . .
q=1 = L
a (Wi,qXq) = {Xk if 1 = J (4.28)
aWl·,k
q=1 aWl·' k
0 if i =J j .
Now, define a family of nonlinear functions gi(·) such that they approximate
the probability density function for each Ui. Using the above result, the second
term in (4.25) becomes:
ti=1
aE {log (p (Ui))}
aw
~ t 1_1_
i=1
E
gi (Ui)
agi (Ui) aUj ) = E
aUi aw
{_I ag(u) xT} ,
g(u) au
(4.29)
where g(u) denotes (g1 (ud, ... ,gn(u n)). Finally, compute an approximation of
the gradient of 1(·) with respect to W and obtain the update step:
In the form given in (4.30), a matrix inversion is required. This may result
in slow processing. To speed up, the update step is multiplied by the 'natural
gradient' WTW (Lee 1998):
= W + E {h(u)u T } W, (4.31)
where
It can be proved that multiplication with the natural gradient not only
preserves the direction of the gradient but also speeds up the convergence
Independent Component Analysis 119
process (Lee 1998). The MMI algorithm for ICA will repeatedly perform an
update of the matrix W:
W=W+k1W, (4.32)
(4.34)
(4.35)
MMI algorithm converges very close to an leA solution. This is very impor-
tant since the use of a single nonlinearity implies that the estimation of the
probability density functions does not have to be very precise. The success of
using only one function can be explained by the fact that a probability density
function mismatch may still lead to a solution that is sufficient for our pur-
poses because its corresponding independent sources are scaled versions of
the initial independent sources. It is, however, important to indicate that not
all functions are equal. For example, a super-Gaussian-like function may not
model a sub-Gaussian source correctly. Therefore, remedial approaches that
involve prediction of the nature of the sources along with a switch between
sub-Gaussian and super-Gaussian functions have been proposed (Lee 1998).
Algorithms, other than the MMI algorithm have also been proposed. One
such algorithm is the Information Maximization leA (Infomax) algorithm
presented in Bell and Sejnowski (1995). While the update formula is identical
to the one generated through MMI, the setting for the Infomax algorithm is
slightly different. Following the transformation through W, the random vector
u is further processed through a set of nonlinear functionsf;(·) (see Fig. 4.3).
If these functions are invertible, the independence of the components of y is
equivalent to the independence of the components of u.
To understand how the nonlinearity can be used, it is interesting to look at
the mutual information for the components of y (Bell and Sejnowski 1995):
n n
I(yJ,,,·,Yn) = -H(y) + LH(Yi) = -H(y) + LE {log (p (Yi)) I
t. +Og )I
i=! i=!
x u f(.) y
Xl---~
X2 ---~
Xn ---~ U" ~ Yn
Fig. 4.3. Infomax algorithm for leA solution. Following the multiplication with W, the data
is further transformed through nonlinear functions. Independence of the components of u
is equivalent to the independence of the components of y. Infomax algorithm proceeds to
modify W such that components of y become independent
Independent Component Analysis 121
If, for all i, the derivative off; (Ui) with respect to Ui matches the probability
density function of Ui" i. e.
.) _ of; (Ui)
p (U t - - : 1 - - ' (4.38)
uUi
(4.40)
which is identical to the one in (4.31) with the observation that gi(.) are the
derivatives off;(·).
4.3.3
ICA Solution through Non-Gaussianity Maximization
A different approach for the solution of the ICA problem is to find a linear
transform such that the resulting components are uncorrelated and as non-
Gaussian as possible (Hyvarinen et al. 2001).
According to the central limit theorem (Hyvarinen et al. 2001), the distri-
bution of any linear mixture of non-Gaussian independent components tends
to be closer to the Gaussian distribution than the distributions of the initial
components. In the effort to recover the independent components from the
observed vector x, we are trying to find the transform W such that:
(4.42)
122 4: Stefan A. Robila
u=Wx=WAs, (4.43)
where W = (WI. W2, W3,"" wm)T, the components of u are estimates of the
original independent components (s).
There are several methods for quantifying the non-Gaussianity. One of them
is based on the negentropy, (i. e. the distance between the probability density
function of the component and the Gaussian probability density function)
(Common 1994). For a random vector u the negentropy is defined as:
= [E{G(U)}-E{G(UG)}]E{xg(wTx)} , (4.46)
In the second case, the neg entropy is approximated using skewness and
kurtosis:
1 1
J(x) = -E {x 3 ) + -K(x)2 , (4.47)
12 48
4.4
Application of ICA to Hyperspectrallmagery
4.4.1
Feature Extraction Based Model
Let us construct a vector space of size equal to the number of spectral bands.
A pixel vector in the hyperspectral image is a point in such a space, with each
coordinate given by the corresponding intensity value (Richards and Jia 1999).
The visualization of the image cube plotted in the vector space provides useful
information related to the data. In the two-dimensional case, a grouping of the
points along a line segment signifies that the bands are highly correlated. This
is usually the case with adjacent bands in a hyperspectral image cube since
the adjacency of corresponding spectral bandwidths does not yield a major
change in the reflectance of the objects (see Fig. 4.4a). On the other hand, for
spectral bands with spectral wavelength ranges that are very distant, the plot
of corresponding bands will have scattered points. This is the situation shown
in Fig. 4.4b where the two bands belong to different spectral ranges (visible
9000
7500
8000
7000
6500 7000
IlOOO
5000
...
4000
3000
4500 5000 5500 IlOOO 6500 7000 7500 8000 ~ 4000 5000 IlOOO 7000 8000
a b
Fig. 4.5. Model of an n-band multispectralJhyperspectral data set as a random vector. Each
band is associated with a component of the random vector. Each realization of the random
vector corresponds to a pixel vector from the image cube
4.4.2
Linear Mixture Model Based Model
A second model can be drawn from the linear mixture model (LMM). In
LMM, each pixel vector is assumed to be a linear combination of a finite
set of endmembers. For a specific endmember, the contribution (abundance)
corresponding to each pixel vector in the hyperspectral image can be seen as
an image band itself. The relationship between LMM and ICA is achieved by
considering the columns in the ICA mixing matrix to represent endmembers
as described in the linear mixture model, and each independent component as
the abundance of an endmember (Chang et al. 2002).
A precise equivalence between LMM and ICA is not possible. In LMM, the
abundances of the endmembers are not required to be independent. However,
since in LMM we are looking for the most representative pixel vectors, it
seems natural to modify the model by assuming that the abundance of one
endmember in a specific pixel does not provide any information regarding the
abundance of other endmembers for that pixel. Another major difference is
the presence of noise in the LMM, which is not considered in ICA. However, we
may include noise in ICA as Gaussian noise with its abundances assumed to be
independent of one of the endmembers. This satisfies one of the restrictions
126 4: Stefan A. Robila
of the leA model that the model is allowed to have one of the components as
Gaussian.
In the linear mixture model, for each image pixel, the corresponding end-
member abundances need to be positive and should sum to one. Neither of the
conditions may be satisfied by the leA generated components. The positivity
of the abundances can be achieved by simple transformations that preserve
independence. For example, for each independent component v, the following
transformation can be used:
v - miny
v= (4.49)
maxy-miny
where miny and maxy correspond to the minimum and maximum values of
the random variable v.
The additivity constraint is more difficult to satisfy. A solution suggested in
(Healy and Kuan 2002) lets leA recover all but one of the endmembers. The
abundance of the remaining endmember can be determined by subtracting
the other abundances from a constant. Unfortunately, in this case, the leA
model is no longer valid since we allow the existence of a component that is
not independent from the others. Alternatively, it has been suggested that the
criterion be modified by requiring the sum of abundances to be less than one.
This is justified by the fact that ground inclination as well as change in viewing
angles lead to differences in scaling across all bands (Parra et al. 2000).
The two models presented above have close relationships if we consider
the endmember abundances from LMM to correspond to the features from
the feature based model. From this analogy, we can also associate each leA
derived independent component with a class present in the image. We may also
note that since the models accounting for classes are not always pure (in the
sense that they are formed by more than one material or mixed), there may be
more than one feature associated with a class, one for each of the composing
materials.
4.4.3
An leA algorithm for Hyperspectral Image Processing
lCA
(compute the unmixing transform for x 1: W2 )
Fig. 4.7. One band of the AVIRIS image with four locations indicated for further analysis of
pixel vectors (a) road, (b) soybean, (c) grass, (d) hay
Background
Alfalfa
• Com-notill
• Com-min
•_ Com
GrassIPasture
_ Grassffrees
_ Grass/pasture-mowed
_ Hay-windrowed
Oat
_ Soybeans-notill
_ Soybeans-min
_ Soybeans-clean
_ Wheal
_Woods
BuiJdings/GrassffreeIRoads
Stone/Steel Towers
Fig.4.8. Reference data (ground truth) corresponding to AVIRIS image in Fig. 4.7. For
a colored version of this figure, see the end of the book
a b c d
Fig.4.9a-d. Resulting ICA components. Notice that the features correspond to information
related to various classes in the scene (such as a road, b soybean, c grass, d hay) are
concentrated in different components
Independent Component Analysis 129
information related to the classes has concentrated in only few of the bands.
Figure 4.9a-d display four of the resulting ICA components.
Next, we analyzed four different pixel locations in the image (indicated by
arrows in Fig. 4.7a), corresponding to four of the sixteen classes road, soybean,
grass and hay. For each location, we have plotted the pixel vectors for the data
produced by ICA. For comparison, the pixel vectors from the PCA-produced
data have also been plotted (Fig. 4.10). There are significant differences between
the plots. In peA, the pixel vectors seem to lack a clear organization, i. e., none
of the features seem to have information related to the type of class. In contrast,
in ICA, we notice information belonging to one class has been concentrated
only in one or very few components.
Further visual inspection reveals that most of the pixel vectors from the
image cube display close to zero values and only one or very few peaks. This
behavior resembles that of the vectors of abundances for the pixel location.
However, since the current algorithm is designed to recover the same number
of endmembers as the number of bands, most of these endmembers will not
contribute to the image and will be associated with noise.
We note that the range of the abundances varies widely and is not restricted
to the interval [0,1], as required by the linear mixture model. Scalar multiplica-
tion, as well as shifting may solve this problem. Unfortunately, there is no clear
indication as to how the scaling may be done when dealing with components
that show up reversed. This aspect was also mentioned in general as affecting
ICA (Hyvarinen et al. 2001). The fact that the LMM conditions cannot always
be satisfied turns out to be a limitation ofICA as a linear unmixing tool.
4.5
Summary
In this chapter, we have provided theoretical background ofICA and shown its
utility as an important data processing tool for hyperspectral imaging. From
the point of view of the linear mixture model, the ability of ICA to recover an
approximation of the endmember abundances is the key indicator that ICA is
a powerful tool for hyperspectral image processing. However, the independent
components are identifiable only up to scaling and shifting with scalars. This
indicates that, in fact, the endmembers as well as their abundances can not be
determined exactly. Moreover, the number of endmembers to be determined
is fixed by the number of spectral bands available. Since there are usually
hundreds of spectral bands, it is improbable that there will be that many
distinct endmembers. This is also indicated by the result of the experiments
where we obtained only a few significant components, the rest being mostly
noise. Given this fact, one may try to find only the components that contain
useful information. Unfortunately, there is no clear method for selecting the
bands that contain useful information.
The feature-based model provides a better understanding of the usefulness
of ICA in the context of hyperspectral imagery. If we consider ICA as a feature
extraction method, the resulting features will be statistically independent. This
130 4: Stefan A. Robila
2 2
0
"" " 0
~ -2 ~" - 1
-4 -2
- .3
-6
a b
S eclro l Profile Profile
6
~"" "
.il
~
- 1
Bond Number
C d
S eclrol Profile
"
.:!
0 -2
>
-4
-6
,
Bo nd Number
e f
S eclrol Profile S edrol Profile
" g""
2
~"
0 -1
-2
Bond Number
g h
Fig.4.10a-h. Pixel vector plots for classes road, soybean, grass and hay derived from lCA
(a, c, e, and g) and PCA (b, d, f, and h)
References 131
References
5.1
Introduction
1996; Mather 1999), decision tree (Richards and Jia 1999; pal 2002) and neu-
ral network (Haykin 1999; Tso and Mather 2001) algorithms. The latter two
fall in the category of machine learning algorithms. Neural network classifiers
are sometimes touted as substitutes for the conventional MLC algorithm in
the remote sensing community. The preferred neural network classifier is the
feed-forward multi-layer perceptron learnt with a back-propagation algorithm
(see Sect. 2.6.3 of Chap. 2). However, even though neural networks have been
successful in classifying complex data sets, they are slow during the training
phase. A number of studies have also reported that neural network classi-
fiers have problems in setting various parameters during training. Moreover,
these may also have limitations in classifying hyperspectral datasets since the
complexity of the network architecture increases manifolds. Nearest-neighbor
algorithms are sensitive to the presence of irrelevant parameters in the dataset
such as noise in a remote sensing image. In case of decision tree classifiers, as
the dimensionality of the data increases, class structure becomes dependent
on a combination of features thereby making it difficult for the classifier to
perform well (Pal 2002).
Recently, the Support Vector Machine (SVM), another machine learning
algorithm, has been proposed that may overcome the limitations of afore-
mentioned non-parametric algorithms. SVMs, first introduced by Boser et al.
(1992) and discussed in more detail byVapnik (1995,1998), have their roots in
statistical learning theory (Vapnik 1999) whose goal is to create a mathemati-
cal framework for learning from input training samples with known identity
and predict the outcome of data points with unknown identity. This results in
two important theories. The first theory is called empirical risk minimization
(ERM) where the aim is to minimize the learning or training error. The second
theory is called structural risk minimization (SRM), which is aimed at mini-
mizing the upper bound on the expected error over the whole dataset. SVMs
are based on the SRM theory while neural networks are based on ERM theory.
An SVM is basically a linear learning machine based on the principle of opti-
mal separation of classes. The aim is to find a linear separating hyperplane that
separates classes of interest. The hyperplane is a plane in a multidimensional
space and is also called a decision surface or an optimal separating hyperplane
or an optimal margin hyperplane. The linear separating hyperplane is placed
between classes in such a way that it satisfies two conditions. First, all the data
vectors that belong to the same class are placed on the same side of the hyper-
plane. Second, the distance or margin between the closest data vectors in both
the classes is maximized (Vapnik and Chervonenkis 1974; Vapnik 1982). In
other words, the optimum hyperplane is the one that provides the maximum
margin between the two classes. For each class, the data vectors forming the
boundary of classes are located on supporting hyperplanes - the term used
in the theory of convex sets. Thus, these data vectors are called the Support
Vectors (Scholkopf 1997). It is noteworthy that the data vectors located along
the class boundary are the most significant ones for SVMs.
Many times, a linear separating hyperplane is not able to classify input data
without error. Under such circumstances, the data are transformed to a higher
dimensional space using a non-linear transformation that spreads the data
Support Vector Machines 135
apart such that a linear separating hyperplane may be found. But, due to very
large dimensionality of the feature space, it is not practical to compute the inner
product of two transformed data vectors (see Sect. 5.3.3). This may, however, be
achieved by using a kernel trick instead of explicitly computing transformations
in the feature space. Kernel trick is played by substituting a kernel function in
place of the inner product of two transformed data vectors. The use of kernel
trick reduces the computational effort by a significant amount.
In this chapter, several theoretical aspects of support vector machines will be
described. In Sect. 5.2, statistical learning theory is briefly reviewed. Section 5.3
describes the construction of SVMs for the binary classification problem for
three different cases: linearly separable case, linearly non-separable case, and
non-linear case. Section 5.4 explains the extension of the binary classification
problem to the multiclass problem. A discussion on various optimization
methods used in the construction ofSVMs is presented in Sect. 5.5. A summary
of the chapter is presented in Sect. 5.6. Though, all the discussion here is
directed towards the classification problem, it is equally applicable for solving
regression problems.
S.2
Statistical Learning Theory
The function fa is called the hypothesis. The set {fa (X) : a E 11} is called the
hypothesis space (Osuna et al. 1997), and L{y,fa{x)) is the loss or discrepancy
between the response y of the supervisor or teacher to a given input x and the
response fa (x) provided by the learning machine. In other words, the expected
risk is a measure of the performance of a decision rule that assigns the class label
y to an input data vector x. However, evaluation of the expected risk is difficult,
since the cumulative distribution function P (x,y) is unknown and thus one
136 5: Mahesh Pal, Pakorn Watanachaturaporn
may not be able to evaluate the integral in (5.1). The only known information
is contained in the training samples. Therefore, a stochastic approximation of
integral in (5.1) is desired that can be computed empirically by a finite sum
given by
(5.2)
and, thus, is known as the empirical risk. The value Remp (a) is a fixed number for
a given a and a particular training data set. Next we discuss the minimization
of risk functions.
5.2.1
Empirical Risk Minimization
The empirical risk is different from the expected risk in two ways (Haykin
1999):
1. it does not depend on the unknown cumulative distribution function
2. it can be minimized with respect to the parameter a
Based on the law of large numbers (Gray and Davisson 1986), the empirical
mean of a random variable converges to its expected value if the size of the
training samples is infinitely large. This remark justifies the use of the empirical
risk Remp(a) instead of the risk function R(a). However, convergence of the
empirical mean of the random variable to its expected value does not imply
that the value a that minimizes the empirical risk will also minimize the
risk function R(a). If convergence of the minimum of the empirical risk to
the minimum of the expected risk does not occur, this principle of empirical
risk minimization is said to be inconsistent. In this case, even though the
empirical risk is minimized, the expected risk may be high. In other words,
a small error rate of a learning machine on the training samples does not
necessarily guarantee high generalization ability (i. e. the ability to work well
on unseen data). This situation is commonly referred to as overfitting. Vapnik
and Chervonenkis (1971, 1991) have shown that consistency occurs if and
only if convergence in probability of the empirical risk to the expected risk
is substituted by uniform convergence in probability. Note that convergence in
probability of R(a) means that for any E > 0 and for any 'l > 0, there exists
a number ko = ko (E, 'l) such that for any k > ko, the inequality R (ak) - R (ao) <
Eholds true with a probability of at least 1-'l (Vapnik 1999). Eis a small number
close to zero, and 'l is referred to as the level of significance - similar to the a
value in statistics. Uniform convergence in probability is defined as
where 'sup A' is the supremum of a nonempty set A and is defined as the
smallest scalar x such that x ::: y for all YEA. Uniform convergence is
a necessary and sufficient condition for the consistency of the principle of
empirical risk minimization.
Vapnik and Chervonenkis (1971, 1979) also showed that the necessary and
sufficient condition for consistency amounts to the fitness of the VC-dimension
to the hypothesis space. The VC-dimension (named after its originators Vapnik
and Chervonenkis) is a measure of the capacity of a set of classification func-
tions or the complexity of the hypothesis space. The VC-dimension, generally
denoted by h, is an integer that represents the largest number of data points
that can be separated by a set of functions fa in all possible ways. For example,
for a binary classification problem, the VC-dimension is the maximum number
of points, which can be separated into two classes without error in all possible
2k ways. The proof of consistency of the empirical risk minimization (ERM)
can be found in Vapnik (1995,1998).
The theory of uniform convergence in probability also provides a bound on
the deviation of empirical risk from the expected risk given by
h 10g('1))
R(a) :::: Remp(a) + <P ( k' - k - , (5.4)
where k is the number of training samples, and the confidence term <p is defined
as
5.2.2
Structural Risk Minimization
the two terms in (5.4). First is the empirical risk and second is the confidence
term, which depends on the VC-dimension of the set of functions. SRM min-
imizes the expected risk function with respect to both the empirical risk and
the VC-dimension. To achieve this aim, a nested structure of hypothesis space
is introduced by dividing the entire class of functions into nested subsets
The symbol C indicates "is contained in." Each hypothesis space has the
property that h(n) ::: h(n + 1) where h(n) is the VC-dimension of the set Hn.
This implies that the VC-dimension of each hypothesis space is finite. The
principle of SRM can be mathematically represented as
.(
~~n Remp(a) + l/J
(hk' -logk('1)- )) . (5.7)
S.3
Design of Support Vector Machines
An SVM is based on SRM to achieve the goal of minimizing the bound on the
VC-dimension and the empirical risk at the same time. An SVM is constructed
by finding a linear separating hyperplane to separate classes of interest. The
linear separating hyperplane is placed between classes such that the data
belonging to the same class are placed on the same side of the hyperplane and
the distance between the closest data vectors in both the classes is maximized.
In this case, called the linearly separable case, the empirical risk is set to zero,
and the bound on the VC-dimension is minimized by maximizing the distance
between the closest data vectors of class 1 and class 2. When the classes in
the dataset are mixed (i. e. erroneous or noisy data), these cannot be separated
by a linear separating hyperplane. This case is known as the linearly non-
separable case. Bennett and Mangasarian (1992) and Cortes and Vapnik (1995)
introduced slack variables and a regularization parameter to compensate for
the noisy data. Thus, in a linearly non-separable case, the empirical risk is
Support Vector Machines 139
controlled by the slack variables and the regularization parameter. The VC-
dimension is minimized in the same fashion as in the linearly separable case.
Often, a linear separating hyperplane is not able to classify input data (either
noiseless or noisy), but a non-linear separating hyperplane can. This has been
referred to as the nonlinear case (Boser et al. 1992). In this case, the input
data are transformed into a higher dimensional space that spreads the data
out such that a linearly separable hyperplane can be obtained. They also
suggested the use of kernel functions as transformation functions to reduce
the computational cost. The design of SVMs for the three cases is described
next.
5.3.1
Linearly Separable Case
Linearly separable case is the simplest of all to design a support vector machine.
Consider a binary classification problem under the assumption that data can
be separated into two classes using a linear separating hyperplane (Fig. 5.1).
Consider k training samples obtained from the two classes, represented by
(Xl> yI) , ... , (Xb Yk), where Xj E ]RN is an N-dimensional observed data vector
with each sample belonging to either of the two classes labeled by y E {-I, + I}.
These training samples are said to be linearly separable if there exists an N-
dimensional vector w that determines the orientation of a discriminating plane
and a scalar b that determines the offset of this plane from the origin such that
Points lying on the optimal hyperplane satisfy the equation WXj + b = o. The
Feature 1
Fig.5.1. A linear separating hyperplane for the linearly separable data sets. Dashed lines
pass through the support vectors
140 5: Mahesh Pal, Pakorn Watanachaturaporn
Yi (w . Xi + b) - 1~ 0. (5.10)
The decision rule for the linearly separable case can be defined by a set of
classifiers (or decision functions) as
Iw·x+bl
D (x;w,b) = , (5.12)
Ilwllz
where 1·1 is the absolute function, and 11·llz is the 2-norm.
Let y be the value of the margin between two separating planes. To maximize
the margin, we express the value of y as
w·x+b+l w·x+b-1 2
y= (5.l3)
Ilwllz Ilwllz Ilwllz
The maximization of (5.13) is equivalent to the minimization of the 2-norm
Ilwllz /2. Thus, the objective function <P(w) may be written as
1
<P(w) = _w T W. (5.14)
2
and
Ai::: 0 for i = 1, ... ,k. (5.18)
oL (w,b,A)
ow =0, (5.19)
oL (w,b,A)
ob = o. (5.20)
After differentiating and rearranging (5.19) and (5.20), the optimality con-
ditions become
k
w= LAiYixi, (5.21)
i=1
k
LAiYi =0. (5.22)
i=1
From (5.21), the weight vector w is obtained from the Lagrange multipliers
corresponding to the k training samples.
Substituting (5.21) and (5.22) into (5.15), the dual optimization problem
becomes
(5.23)
LAiYi = 0 (5.24)
and
Ai ::: 0 for i = 1, ... , k . (5.25)
142 5: Mahesh Pal, Pakorn Watanachaturaporn
(5.26)
(5.27)
where x~ 1 and x~ 1 are the support vectors of class labels +1 and -1 respectively.
The following decision rule is then applied to classify the data vector into
two classes + 1 and -1:
f(x) = sign ( L
support vectors
YiA; (Xi· X) + bO ) • (5.28)
Theorem 5.1 Let vectors X E X belong to a sphere of radius R. Then the set
of L1-margin separating hyperplanes has the VC dimension h bounded by the
inequality
h (I l N)
~ min ~~ +1
r
where E1 denotes the ceiling function that round the element E to the nearest
integer greater than or equal to the value E, L1 is the margin of separation, and
N is the dimensionality of the input space.
Support Vector Machines 143
This theorem shows that the VC-dimension of the set of hyperplanes is equal
to N + 1. However, it can be less than N + 1 if the margin of separation is large.
In case of an SVM, a SRM nested structure of hypothesis space can be
described in terms of the separating hyperplanes as Ale A2 c ... C An C ...
with
(5.29)
(5.30)
where ak is a constant.
After a nested structure is created, the expected risk is then minimized by
minimizing the empirical risk and the confidence term simultaneously. The
empirical risk is automatically minimized by setting it to zero from the re-
quirement that data belonging to the same class are placed on the same side
without error. Since the empirical risk of every hypothesis space is set to zero,
the expected risk depends on the confidence term alone. However, the confi-
dence term depends on the VC-dimension and the number of training samples.
Let the number of training samples be a constant value, then the confidence
term relies on the VC-dimension only. The smallest value of VC-dimension
is determined from the largest value of the margin (from Theorem 5.1). The
margin is equal to 2/llw112 (i. e. the smallest value of the vector w also produces
the smallest value of VC-dimension). Therefore, the classifier that produces
the smallest expected risk is identified as the one that minimizes the vector w.
This clearly shows that the construction of an SVM completely complies with
the principle of SRM.
5.3.2
Linearly Non-Separable Case
It is true that the linearly separable case is an ideal case to understand the
concept of support vector machines. All data are assumed to be separable
into two classes with a linear separating hyperplane. However, in practice,
this assumption is rarely met due to noise or mixture of classes during the
selection of training data. In other words, it is not possible in practice to
create a linear separating hyperplane to separate classes of interest without
any misclassification error for a given training data set (see Fig. 5.2). Thus, the
classes are not linearly separable. This problem can be tackled by using a soft
margin classifier introduced by Bennett and Mangasarian (l992) and Cortes
and Vapnik (l995). Soft margin classification relaxes the requirement that every
data point belonging to the same class must be located on the same side of
a linear separating hyperplane. It introduces slack variables 5i ::: 0, i = 1, ... , I,
to take into account the noise or error in the dataset due to misclassification.
144 5: Mahesh Pal, Pakorn Watanachaturaporn
Feature 1
Fig. 5.2. Illustration of the linearly non-separable case
Yi (W . Xi + b) - 1 + {ii ~ 0. (5.31)
In the objective function, a new term called the penalty value 0 < C < 00
is added. The penalty value is a form of regularization parameter and defines
the trade-off between the number of noisy training samples and the classifier
complexity. It is usually selected by trial and error.
It can be shown that the optimization problem for the linearly non-separable
case becomes
(5.32)
Yi (w . Xi + b) - 1 + {ii ~ 0 (5.33)
and
The penalty value may have a significant effect on the performance of the
resulting support vector machines (see experimental results in Chap. 10).
From (5.32), it can also be seen that when C ~ 0, the minimization problem is
not affected by the misclassifications even though {ii > o. The linear separating
hyperplane will be located at the midpoint of the two classes with the largest
possible separation. When C > 0, the minimization problem is affected by {ii.
When C ~ 00, the values of {ii approach zero and the minimization problem
reduces to the linearly separable case.
Support Vector Machines 145
From the point of view of statistical learning theory, it can be shown that the
minimization of the VC-dimension is achieved by minimizing the first term
of (5.32) similar to the linearly separable case. Unlike the linearly separable
case, minimization of the empirical risk in this case is achieved by minimizing
the second term of (5.32). Thus, here also, the principle of structural risk
minimization is satisfied.
The constrained optimization problem given by (5.32) to (5.34) is again
solved using Lagrange multipliers. The primal Lagrangian for this case can be
written as
(5.35)
where Ai ::: 0 and Pi ::: 0 are the Lagrange multipliers. The terms Pi are
introduced to enforce positivity of ~i. The constraints for the optimization
of (5.35) are
y;( w . Xi + b) - 1 + ~i ::: 0 , (5.36)
~i ::: 0, (5.37)
Ai ::: 0 , (5.38)
Pi ::: 0 , (5.39)
A;/Yi (w. Xi + b) - 1 +~;} = 0 , (5.40)
Pi~i = 0, (5.41)
where i = 1, ... , k. Equations (5.36) and (5.37) are the constraints as given
by (5.33) and (5.34), (5.38) and (5.39) are obtained from the definition of the
Lagrangian method, and the necessary conditions for the Lagrangian method
are represented by (5.40) and (5.41).
By differentiating (5.35) with respect to w, b, and ~i and setting the derivatives
to zero, we obtain
aL(w,b,A,p,~) -w-
aW - "1 y .
~/I.11
.
X .-0
1- , (5.42)
1
k
LAiYi = 0 (5.46)
i=1
and
It can thus be seen that the objective function of the dual optimization
problem for the linearly non-separable case is the same as that of the linearly
separable case except that the Lagrange multipliers are bounded by the penalty
value C.
After obtaining the solution of (5.45), wand b can be found in the same
manner as explained in (5.26) and (5.27) earlier. The decision rule is also the
same as defined in (5.28).
5.3.3
Non-Linear Support Vector Machines
SVM seeks to find a linear separating hyperplane that can separate the classes.
There are instances where a linear hyperplane cannot separate classes with-
out misclassification. However, those classes can be separated by a nonlinear
separating hyperplane. In fact, most of the real-life problems are non-linear
in nature (Minsky and Papert 1969). In this case, data are mapped to a higher
dimensional feature space with a nonlinear transformation function. In the
higher dimensional space, data are spread out, and a linear separating hy-
perplane can be constructed. For example, two classes in the input space of
Fig. 5.3 may not be separated by a linear hyperplane but a nonlinear hyper-
plane can make them separable. This concept is based on Cover's theorem on
the separability ofpatterns (Cover 1965).
.@
• •••• .
• •
• • •• •• •
Fig.5.3. Non linear case. Mapping nonlinear data to a higher dimensional feature space
where a linear separating hyperplane can be found
Support Vector Machines 147
Let a nonlinear transformation function cp map the data into a higher di-
mensional space. In other words, cp(x) represents the data X in the higher
dimensional space. The dual optimization problem, then, for a nonlinear case
may be expressed as
(5.49)
The dual optimization problem for a nonlinear case can then be expressed as
k 1 k k
maxL(w,b,A) = LAi - - LLAiAjYiYjK(Xi,Xj), (5.51)
A .2.
l=1 l=1 j=1
and
C :::: Ai :::: 0 for i = 1, ... , k . (5.53)
In a manner similar to the other two cases, the dual optimization problem
can be solved by using Lagrange multipliers that maximizes (5.51) under the
constraints (5.52) and (5.53). The decision function can be expressed as
Linear X· Xj
Polynomial with (X'Xi+1)d d is a positive integer
degree d
Radial Basis exp ( _lIx~~1I2) a is a user defined value
Function
In fact, every condition of the linearly separable case can be extended to the
nonlinear case with a suitable kernel function. The kernel function for a linearly
separable case will simply be a dot product of two data vectors, K (Xi, Xj) = Xi ·Xj.
Other examples of well-known kernel functions are provided in Table 5.1.
The selection of a suitable kernel function is essential for a particular prob-
lem. For example, the performance of the simple dot product linear kernel
function may deteriorate when decision boundaries between the classes are
non-linear. The performance of sigmoid, polynomial and radial basis kernel
functions may depend on the selection of appropriate values of the user-defined
parameters, which may vary from one dataset to another. An experimental in-
vestigation on the choice of kernel functions for the classification of multi and
hyperspectral datasets has been provided in Chap. 10.
5.4
SVMs for Multiclass Classification
5.4.1
One Against the Rest Classification
This method is also called winner-take-all classification. Suppose the dataset
is to be classified into M classes. Therefore, M binary SVM classifiers may
be created where each classifier is trained to distinguish one class from the
remaining M - 1 classes. For example, class one binary classifier is designed
to discriminate between class one data vectors and the data vectors of the
remaining classes. Other SVM classifiers are constructed in the same manner.
During the testing or application phase, data vectors are classified by finding
the margin from the linear separating hyperplane 0. e. (5.28) or (5.54) without
the sign function}:
m
i(x} = LYiA1K (x, Xi) + lJ for j = 1, . .. ,M , (5.55)
i=1
5.4.2
Pairwise Classification
In this method, SVM classifiers for all possible pairs of classes are created
(Knerr et al. 1990; Friedman, 1996; Hastie and Tibshirani 1998; KreBeI1999).
!
Therefore, for M classes, there will be M(M - 1) binary classifiers. The output
from each classifier in the form of a class label is obtained. The class label that
occurs the most is assigned to that point in the data vector. In case of a tie,
a tie-breaking strategy may be adopted. A common tie-breaking strategy is to
randomly select one of the class labels that are tied.
150 5: Mahesh Pal, Pakorn Watanachaturaporn
5.4.3
Classification based on Decision Directed Acyclic Graph
and Decision Tree Structure
Not"
If M·1 M·l 3 2
tree. The question is how to determine its structure. Use of Euclidean distance
or the Mahalanobis distance as a criterion has been proposed to determine
the decision tree structure. The top node is the most important classifier. The
better the classifier at the higher node, the better will be the overall classification
accuracy. Therefore, the higher nodes are designed to classify a class or some
classes that has/have the farthest distance from the remaining classes.
The input vector x is evaluated starting from the top of the decision tree.
The sign of the value of the decision function determines the path of the input
vector. The process is repeated until a leaf node is reached. The class label
corresponding to the final leaf node is associated with the input vector. It
clearly shows that the number of binary classifiers for this method is less than
the other aforementioned methods since the input is evaluated at most M - 1
times. However, it is important to mention that the construction of the decision
tree is critical to the overall classification performance.
152 5: Mahesh Pal, Pakorn Watanachaturaporn
5.4.4
Multiclass Objective Function
Instead of creating many binary classifiers to determine the class labels, this
method attempts to directly solve a multiclass problem (Weston and Watkins
1998, 1999; Lee et al. 2001; Crammer and Singer 2001; Scholkopf and Smola
2002). This is achieved by modifying the binary class objective function and
adding a constraint to it for every class. The modified objective function allows
simultaneous computation of multiclass classification and is given by Weston
and Watkins (1998):
min [ -
1
L M k]
IIwI12 + CLL~i ' (5.56)
w,b,~ 2. .
1= 1 1= 1 riYi
and
where Yi E {I, ..., M} are the multiclass labels of the data vectors and r E
{I, ..., M} \Yi are multiclass labels excluding Yi.
Lee et al. (2001) and Scholkopf and Smola (2002) showed that the results
from this method and the one-against-the-rest method are similar. However,
in this method, the optimization algorithm has to consider all the support
vectors at the same time. Therefore, although it may be able to handle massive
data sets but the memory requirement and thus, the computational time may
be very high.
Thus, the choice of a multiclass method depends on the problem at hand.
A user should consider the accuracy requirements, the computational time, the
resources available and the nature of the problem. For example, the multiclass
objective function approach may not be suitable for a problem that contains
a large number of training samples and classes due to the requirement oflarge
memory and extremely long computational time.
5.5
Optimization Methods
One of the key processing steps in the development of SVM algorithms is to
employ an optimization method to find the support vectors. A variety of opti-
mization methods may be used. Typically, the conventional SVMs have used an
optimizer based on quadratic programming (QP) or linear programming (LP)
methods to solve the optimization problem. Most of the QP algorithms are
based on a conjugate gradient, quasi-Newton or a prime-dual interior-point
Support Vector Machines 153
This algorithm can process large datasets without large memory requirements.
Hsu and Lin (2002) found that these algorithms work quite well and the results
are similar to any standard QP optimization method. Further, Mangasarian
and Musicant (2000a) proposed the use of mathematical programming to solve
the optimization problem. An 'active set' strategy is used to generate a fast
algorithm that consists of solving a finite number of linear equations of the
order of the dimensionality of the original input space at each step. This method
consists of maximizing the margin between two separating hyperplanes with
respect to both wand b and using a squared 2-norm of slack variables in place
of the I-norm defined in (5.30). Thus, the active set optimization algorithm
requires no specialized quadratic or linear programming software, but a linear
solver to solve an N + 1 by N + 1 matrix, where N is the dimensionality of the
input data space.
Mangasarian and Musicant (2000b) also proposed the Lagrangian SVM
(LSVM) that reformulates the constrained optimization problem as an uncon-
strained optimization problem. The problem is solved through an optimizer
based on the system oflinear equalities. Ferris and Munson (2000a, 2000b) pro-
posed interior point and semi-smooth support vector machines. The interior
point SVM uses proximal point modification to the underlying algorithm, the
Sherman-Morrison-Woodbury formula, and the Schur complement to solve
a linear system. In the semi-smooth support vector machine, Ferris and Mun-
son reformulated the optimality conditions as a semi-smooth system using
the Fischer-Burmeister function, applied as a damped Newton method, and
exploited the Sherman-Morrison-Woodbury formula to efficiently solve the
problem. Both methods can solve linear classification problems proficiently.
These optimization methods are just some examples and are still being pur-
sued in ongoing research. Some of the methods will also be investigated in
Chap. 10.
5.6
Summary
In this chapter, we have introduced the basic concepts of support vector ma-
chines (SVMs) for classification problems. SVMs originate from the structural
risk minimization concept of statistical learning theory. The basic mathemat-
ical background of SVM was discussed by formulating a binary classification
problem. All the three cases - linearly separable and linearly non-separable
cases, and the nonlinear case - were considered. In the nonlinear case, data
are required to be mapped to a higher dimensional feature space through
a kernel function suitably incorporated in the objective function. The binary
classification problem was then extended to multiclass classification and the
associated methods were briefly discussed. A discussion on various optimiza-
tion methods was also provided. This methodology will be employed for the
classification of a set of multi- and hyperspectral data in Chap. 10.
References 155
References
Bennett KP, Campbell C (2000) Support vector machines: hype or hallelujah. Special Interest
Group on Knowledge Discovery in Data Mining Explorations 2(2): 1-13
Bennett KP, Mangasarian OL (1992) Robust linear programming discrimination of two
linearly inseparable sets. Optimization Methods and Software 1: 23-34
Boser H, Guyon 1M, Vapnik VN (1992) A training algorithm for optimal margin classifiers.
Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory,
Pittsburgh, PA, pp 144-152
Campbell C (2002) Kernel methods: a survey of current techniques. Neurocomputing 48:
63-84
Campbell C, Cristianini N (1998) Simple training algorithms for support vector machines.
Technical Report, Bristol University (http://lara.bris.ac.uk/cig)
Cortes C, Vapnik VN (1995) Support vector networks. Machine Learning 20: 273-297
Courant R, Hilbert D (1970) Methods of mathematical Physics, I and II. Wiley Interscience,
New York
Cover TM (1965) Geometrical and statistical properties of systems of linear inequalities
with applications in pattern recognition. IEEE Transaction on Electronic Computers
EC-14: 326-334.
CPLEX Optimization Inc. (1992) CPLEX User's guide, Incline Village, NY
Crammer K, Singer Y (2001) On the algorithmic implementation of multiclass kernel-based
vector machines. Journal of Machine Learning Research 2: 265-292
Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines and other
kernel-based learning methods. Cambridge University Press, Cambridge, UK
Ferris MC, Munson TS (2000a) "Interior point methods for massive support vector ma-
chines:' Data Mining Institute Technical Report 00-05, Computer Science Department,
University of Wisconsin, Madion, WI
Ferris MC, Munson TS (2000b) Semi-smooth support vector machines. Data Mining Insti-
tute Technical Report 00-09, Computer Science Department, University of Wisconsin,
Madion, WI
Friedman JH (1994) Flexible metric nearest neighbor classification.Technical Report, De-
partment of Statistics, Stanford University
Friedman JH (1996) Another approach to polychotomous classification. Technical Report,
Department of Statistics and Stanford Linear Accelerator Center, Stanford University
Gray RM, Davisson LD (1986) Random processes: a mathematical approach for engineers.
Prentice-Hall, Englewood Cliffs, NJ
Hastie TJ, Tibshirani RJ (1996) Discriminant adaptive nearest neighbor classification. IEEE
Transactions on Pattern Analysis and Machine Intelligence 18(6): 607-615
Hastie TJ, Tibshirani RJ (1998) Classification by pairwise coupling. In: Jordan MI, Kearns
MJ, Solla, SA (eds) Advances in neural information processing systemslO, The MIT
Press, Cambridge, MA, pp 507-513
Haykin S (1999) Neural networks: a comprehensive foundation. Prentice Hall, Upper Saddle
River, NJ
Hsu C-W, Lin CJ (2002) A simple decomposition method for support vector machines.
Machine Learning 46: 291-314
Hughes GF (1968) On the mean accuracy of statistical pattern recognizers. IEEE Transactions
on Information Theory 14(1): 55-63
Knerr S, Personnaz L, Dreyfus G (1990) Single-layer learning revisited: A stepwise procedure
for building and training neural network. In: Neurocomputing: algorithms, architectures
and applications, NATO AS I, Springer Verlag, Berlin
156 5: Mahesh Pal, Pakorn Watanachaturaporn
Teerasit Kasetkasem
6.1
Introduction
For decades, Markov random fields (MRF) have been used by statistical physi-
cists to explain various phenomena occurring among neighboring particles
because of their ability to describe local interactions between them. In Win-
kler (1995) and Bremaud (1999), an MRF model is used to explain why neigh-
boring particles are more likely to rotate in the same direction (clockwise or
counterclockwise) or why intensity values of adjacent pixels of an image are
more likely to be the same than different values. This model is called the Ising
model. There are a large number of problems that can be modeled using the
Ising model and where an MRF model can be used. Basically, an MRF model is
a spatial-domain extension of a temporal Markov chain where an event at the
current time instant depends only on events of a few previous time instants.
In MRF, the statistical dependence is defined over the neighborhood system,
a collection of neighbors, rather than past events as in the Markov chain model.
It is obvious that this type of spatial dependence is a common phenomenon
in various signal types including images. In general, images are smooth and,
therefore, the intensity values of neighboring pixels are highly dependent on
each other. Because of its highly theoretical and complex nature, and intensive
computational requirements, practical uses ofMRF were extremely limited un-
til recently. Due to the dramatic improvements in computer technologies, MRF
modeling has become more feasible for numerous applications, for example,
image analysis because many image properties, such as texture, seem to fit an
MRF model, i. e., intensity values of neighboring pixels of images are known to
be highly correlated with each other. The Markovian nature of these textural
properties has long been recognized in the image processing community, and
has widely been used in a variety of applications (e. g. image compression and
image noise removal). However, these applications were limited to empirical
studies and were not based on a statistical model such as the MRF model.
The pioneering work by Geman and Geman (1984) introduced a statisti-
cal methodology based on an MRF model. Their work inspired a continuous
stream of researchers to employ MRF models for a variety of image analysis
tasks such as image classification and segmentation. In their paper, a noiseless
image was assumed to have MRF properties. The noiseless image was disturbed
P. K. Varshney et al., Advanced Image Processing Techniques for Remotely Sensed Hyperspectral Data
© Springer-Verlag Berlin Heidelberg 2004
160 6: Teerasit Kasetkasem
the equivalence property between MRFs and Gibbs fields. Next, some possible
approaches to employ MRF modeling are examined in Sect. 6.3. Here, we also
construct the posterior energy function under the MAP criterion, which is
used as an objective function for optimization problems. Then, several widely
used optimization methods including simulated annealing are introduced and
discussed in Sect. 6.4. Some theoretical results are also presented in this section
to make this chapter self-contained.
6.2
MRF and Gibbs Distribution
6.2.1
Random Field and Neighborhood
6.2.1.1
Random Field
Let -8 = {SI' S2, ... ,SM} be a finite set. An element denoted by S E -8 is called
a site. Let il be a finite set called a phase or configuration space. A random field
on -8 with phase il is defined as a collection X = {X{s)} SEJ of random variables
X{s) taking values in the phase space il.
A random field can be viewed as a random variable or a random vector
taking values in the configuration space il J . A configuration x E il J is of the
form x = (x{s),s E -8), where x{s) E il for all s E -8. We note that, in image
processing applications, -8 is normally a subset ofZ2 andil represents different
intensity values of the image where Z is the set of all integers. Our particular
interest in this chapter is on a Markov random field, which is characterized by
local interactions. As a result, we need to introduce a neighborhood system on
the sites.
6.2.1.2
Neighborhood
The subset Ns is called the neighborhood of the site s. The pair (-8, N) is called
a graph, or a topology. The boundary of A C -8 is the set aA = (U Ns) \A
SEA
where B\A denotes the complement of A in B.
6.2.1.3
Markov Random Field
A random field X is called a Markov random field (MRF) with respect to the
neighborhood system N iffor all sites s E -8, the random variables X(s) and
X (-8\Ns) are independent given X(NJ where Ns is defined as Ns U Is}.
The above definition can also be written in a mathematical form as
(6.2)
and is called the local characteristic of the MRF at the site s that maps il onto
an interval [0, 1]. The family {If} SES is called the local specification of the MRF.
Note that the property in (6.1) is clearly an extension of the general concept
of Markov processes.
6.2.1.4
Positivity Condition
(6.3)
for all Xl, X2, ••• , Xj-l, Xj+ I, ... , XK E il, where Tfj is the marginal distribution
corresponding to the site j.
6.2.2
Cliques, Potential and Gibbs Distributions
The MRF model characterizes the spatial dependence among neighboring sites.
However, a direct implementation of (6.1) is not simple because the probabili-
ties can take up any values. As a result, we introduce the Gibbs distribution in
this section. The general notation for a Gibbs distribution comes from thermo-
dynamics and statistical physics to explain the phenomenon of spin directions
of the particles under a critical temperature. Here, we discuss this distribution
Markov Random Field Models 163
6.2.2.1
Clique
Any singleton {5} is a clique. A subset C C -8 with more than one element is
called a clique of the graph (-8, N) if and only if any two distinct sites of Care
mutual neighbors. A clique C is called maximal iffor any site 5, C U {5} is not
a clique.
For the sake of clarification, we provide examples of cliques for two neigh-
borhood systems: 4-neighborhood (Fig. 6.1a) and 8-neighborhood (Fig. 6.1b).
The three cliques for the 4-neighborhood system as shown in Fig. 6.1a are:
the singleton, the horizontal pair and the vertical pair. Similarly, there are ten
cliques for the 8-neighborhood system: one singleton, pixel pairs with different
orientations, various combinations of three pixels and the group of four pixels
(see Fig. 6.1b). In this example, cliques C2 and C3 are the maximal cliques for
the 4-neighborhood system, and the clique CIO is the maximal clique for the
8-neighborhood system.
6.2.2.2
Gibbs Potential and Gibbs Distribution
i) V c == 0 if C is not a clique,
ii) for all x,x E A -8 and all C C -8,
In this section, we have introduced and defined the Gibbs distribution whose
probability density function depends on configurations of neighboring sites.
It is clear that both MRF and Gibbs distribution are related.
There are many similarities between a Gibbs field and a Markov random
field. A Gibbs field is defined over the neighborhood system N through the
potential function and cliques. Likewise, an MRF, by definition, is defined for
the same neighborhood system by means of the local characteristics. Hence, in
the next theorem, we state that the potential function that leads to a Gibbs field
gives rise to a local characteristic. The mathematical proof of the following
theorem is beyond the scope of this book and can be found in Winkler (1995).
Without loss of generality, we shall assume T = 1 throughout this section for
notational convenience.
Theorem 6.1 Gibbs fields are MRFs. Suppose X is a random field with the
distribution Trwhere the energy E(x) is derived from a Gibbs potential in {VC }CES
relative to the neighborhood system N, then X is Markovian relative to the same
neighborhood system N. Moreover, its local specification is given by the formula
exp {- L Vc(X)}
Jt(x) = C35 (6.7)
L exp {- L Vc (A,X(J \ S))} ,
AEA C3S
4-neighborhood a
~ I C3 ~' ~5
8-neighborhood ~dl'~
b
Fig. 6. 1a,b. Cliques for a 4-neighborhood system b 8-neighborhood system
Markov Random Field Models 165
where L denotes the sum over those sets C that contain the site s. We write
C3S
Vc (A,X(.8\S)) to represent the Vdx') where x'(s) = A and x'(t) = x(t) for all
t E .8\s.
Theorem 6.1 has stated the first part of the relationship between MRFs and
Gibbs fields. We shall continue with the converse part, which will be formally
introduced in Theorem 6.2. Again, the proof of this theorem can be found in
Winkler (1995).
6.3
MRF Modeling in Remote Sensing Applications
So far, we have established the fundamental definitions and relationship of
MRFs and Gibbs fields without linking these models to any remote sensing
application, which is the main emphasis of this book. Hence, this section
examines several approaches for the implementation of MRF modeling in
remote sensing applications. There are a number of problems where the MRF
models are applicable. However, we limit the scope of this section to a discussion
on the application of MRF models for image classification of remotely sensed
imagery. In the later chapters, some specific remote sensing problems, namely
sub-pixel mapping for hyperspectral imagery (see Chap. 11), and image change
detection and fusion (see Chap. 12), where MRF models can be applied, are
dealt in great detail. Here, we first refer to Geman and Geman's paper, where
MRF models have been applied for image restoration (Geman and Geman
1984). Image classification problem may be formulated as an image restoration
problem. In the image restoration problem described in Geman and Geman
(1984), the intensity value of an image pixel is disturbed by image noise that
results in the mapping of the intensity value to a random variable. Likewise,
in the image classification problem such as land cover classification, one land
cover class attribute corresponds to a group of independent random vectors
that follow an identical probability density function. In this model, we assume
that, for a given land cover class, the intensity values of two different pixels
(sites) are statistically independent, i.e,
where Y(-8) and X(-8) are the observed remote sensing image and the corre-
sponding land cover map, respectively. y(s) and x(s) are the intensity values
(vector) of a site s in the observed remote sensing image and the attribute of the
land cover class at the corresponding site, respectively. Often, the conditional
probability density function (PDF) of the intensity value given a land cover
class is assumed to follow a Gaussian distribution. Hence, we have
where
Epost = L VdX) + L EMLE (y(s) Ix(s») . (6.14)
Cc-8 SE-8
Markov Random Field Models 167
The term in (6.14) is called the posterior energy function. Like in the case of
EMLE, higher and lower values of Epost correspond to lower and higher values
of posterior probabilities. By choosing the configurations X( -8) that minimize
Epost> we actually maximize the posterior probability. In other words, the maxi-
mum a posteriori (MAP) criterion is used. Under this criterion, the probability
of error, i. e., that of choosing incorrect classified maps is minimized. We note
that there are other optimization criteria such as the MPM (maximization
of the posterior marginals) developed in Marroguin et al. (1987) that can be
employed. However, the MAP criterion is designed to optimize a statistical
quantity, namely the probability of error. Hence, the MAP criterion is chosen
as the optimization criterion for all the MRF-based image analysis problems
in this book.
Although the MAP criterion is meaningful and powerful in solving various
image analysis problems, it does not provide any systematic method to find
its solutions. If Epost is differentiable and convex (i. e. it has only one saddle
point), any gradient-based optimization algorithm can be used to search for
the MAP solutions. Unfortunately, this is not the case in general. Hence, other
optimization techniques that are capable of handling non-convex functions
must be employed. In the next section, we introduce several optimization
algorithms for finding the MAP solutions.
6.4
Optimization Algorithms
In the previous two sections, we have established the statistical models of the
MRF and its equivalent form, Gibbs field. This model is used to describe spatial
properties of an image. Depending upon the problem at hand, these images can
be classified images (Solberg et al. 1996), noiseless images (Geman and Geman
1984), or change images (Bruzzone and Preito 2000; Kasetkasem and Varshney
2002). Then, based on MAP criterion, the best (optimum) image is chosen. If
the image model is simple, the optimum solution can be determined by using
simple exhaustive search or gradient search approaches. However, since the
MRF model is complex, (i. e. its marginal distribution is non-concave), the
optimum solution under the MAP criterion can no longer be obtained by just
a gradient search approach. Furthermore, for most cases, the image space is
too large for exhaustive search methods to handle in an efficient manner. For
example, there are more than 24000 possible binary images (i. e. the intensity
value of a pixel can either be 0 or 1) of size 64 x 64. Hence, the need for more
efficient optimization algorithms is obvious. In this section, we discuss several
optimization algorithms that are widely used for solving MRF model based
image analysis optimization problems. These include the simulated annealing
(SA), Metropolis, and iterated conditional modes (IeM) algorithms.
168 6: Teerasit Kasetkasem
Initial Image
NO
6.4.1
Simulated Annealing
where Xm denotes the set of all minima of E, and lIall denotes the cardi-
nality of a set a.
Markov Random Field Models 169
ii) For all T' < Til < E and some E > 0, if x E x m , then 1TT' > 1TT" and if
x f. Xm , then 1TT' < 1TT".
Proof Let Em denote the minimum value of E. Then
exp (-+E(x»)
1TT(X) = ----=---'----~,-------'--,- (6.16)
L exp(-+E(A»)
AEAS
exp (-+ (E(x) - Em))
L exp (-+ (E(A) - Em)) + L exp (-+ (E(A) - Em))
AEXII/ A~Xm
y:a(y)<O
exp (-+a (y)) l'
+ L exp (-+a (y))
y:a(y»O
Furthermore, if Part 2 is true, then T' < Til < E implies 1TT' < 1TT" or d::; >0
for VT :s E. Hence,
L ai!2 exp (-+a (y)) + L aW exp (-+a (y))
d1TT y:a(y)<O y:a(y»O
J
dT II {y: E (y) = E(x)} II + L exp (-+a (y))
[ y:a(y)<O
+ L exp (-+a (y))
y:a(y) >0
Since the denominator of the above equation is always positive, we only have
to consider the numerator:
170 6: Teerasit Kasetkasem
The second term tends to zero, and the first term tends to infinity as T ~ O.
Therefore, there must exist some E such that d:;.J > 0 for '1fT :::: E.
For x E X m , the distribution is
exp (-+ (E(x) - Em))
7TT(X) = -----'--'--'---::-----'-'---
Ilxmll+ L exp(-+(E(A)-Em))
}."'xm
1
Ilxmll + L exp (-+ (E(A) - Em)) .
}."'xm
for 0 < T' < T" < E. Hence, there exists an E such that
7TT'(X) > 7TTII(X)
for all x E X m • Q.E.D.
From the above proposition, we observe that the Gibbs distribution at zero
temperature is uniformly distributed among the global optima. If the zero tem-
perature stage can be reached from arbitrary starting points (configurations),
the optimum solution under the MAP criterion can also be obtained regard-
less of the initial configurations. The procedure described in Fig. 6.2 attempts
to produce an inhomogenous Markov chain of images (X) that eventually
converges to the limiting distribution in (6.15).
To successfully approach the result given by (6.15), we first need to determine
the visiting scheme defined as -8 = {I, 2, ... , M}. Different visiting schemes may
result in different rates of convergence of the SA algorithm. From practical
considerations, the row wise visiting scheme may be employed due to its sim-
plicity. However, in the literature (e. g. Bremaud 1999), more random schemes
that may result in faster convergence rates than the row wise visiting scheme,
have also been used. In addition, the "cooling schedule" which is a decreasing
sequence of the positive value T(n) that eventually becomes zero needs to be
determined. To guarantee convergence, the cooling sequence must decrease to
zero at the rate at least
ML1
T(n) ;::: In(n) , (6.19)
Example
We present a simple example to illustrate the implementation of the SA algo-
rithm. In this example, we consider a Gibbs field that consists of two sites {I, 2}
whose configurations can be either -lor 1. The Gibbs distribution associated
with this system is given by
(6.20)
where
-I if a = b
I(a,b)= { 1 (6.21)
if a i b
and Z = L L exp (-I (XI, X2)) = 6.17. The prior probabilities asso-
x2={-1,1} xl={-I,I}
ciated with each configuration can be computed from (6.20) and are given in
Table 6.1.
Furthermore, let us assume that the realization of the configuration is {I, I},
but because of identical, independent Gaussian noise with zero mean and
unit variance, we actually obtain the real number {YI>Y2} = {0.5,-0.15}. The
172 6: Teerasit Kasetkasem
MAP detector is chosen here to estimate the configurations from the observed
data {Yl>Y2}. The goal of the MAP detector is to select {Xl>X2} such that the
a posteriori probability is maximum, i. e.,
(6.22)
The MAP detector is optimum under the minimum probability of error crite-
rion (Trees 1968; Varshney 1997). The posterior probability in this example is
given by
Under the MAP criterion, it is equivalent to choose {Xl> X2} such that
Table 6.2. The Gibbs energy associated with all possible configurations
Xl = -1 0.4864 1.4863
Xl = 1 2.7862 -0.2138
Markov Random Field Models 173
1.5
r-
0.5
o
{l.5
-1
·1 .5
-2
o 200 400 600 800 1000 1200
Iteration Number
a
2
1.5
-
0.5
o
-0.5
-1
-1.5
-2
o 200 400 600 800 1000 1200
Iteration Number
b
6.4.2
Metropolis Algorithm
Initial Image
Reduce Tusing a
Randomly propose predetermined
a new configuration schedule.
Xnew Move to a new site
h=h+1.
can take many different values. However, since we used the proposal matrix
G to propose a new configuration at a given site, there is a possibility that
some configurations may not be reachable after completion of a sweep (com-
plete update of all pixels.). This causes the transition matrix to have zero
elements which may jeopardize the positivity conditions required for conver-
gence of the induced Markov chain. As a result, the sufficient conditions for
the Metropolis algorithm must be modified to make the positivity condition
valid. The following definition provides this sufficient condition for a pro-
posal matrix that will guarantee the positivity conditions for the transition
matrix.
Definition: A Markov kernel G on A -8 is called irreducible iffor all x, yEA -8
there exists a chain x = uQ, Ul, ••• , Ua(x,y) = Y in A -8 such that G (Uj-l, Uj) > 0 for
1 ::: j ::: a (x,y) < 00. The corresponding homogeneous Markov chain is also
called irreducible as well. Next, we shall call YEA -8 a neighbor of x E A -8 if
G (x,y) > 0, i. e.,
The cooling schedule for the Metropolis algorithm is different from the SA
algorithm due to the fact that, in the SA algorithm, any possible configurations
can be reached after one sweep of an entire image while several sweeps may
be required in the Metropolis algorithm. The optimum cooling schedule is
given by
ul
In(n) ,
6.4.3
Iterated Conditional Modes Algorithm
For many remote sensing applications, we often deal with images of very large
size (e.g. data from hyperspectral sensors or high resolution images). These
data are too large for a global optimization algorithm to handle in an efficient
manner, even with highly efficient algorithms such as the SA and Metropo-
lis algorithms. In such cases, it is preferable to deal with a simpler and less
computationally intensive optimization algorithm that may only guarantee
local optima rather than a more accurate and complex global optimization
algorithm (such as the SA and Metropolis algorithms). Among suboptimum
algorithms, the iterated conditional modes (ICM) algorithm, proposed by Be-
gas (1986), has received a great deal of attention because of its simplicity and
fast convergence rate. In this algorithm, the MAP equation is still the objective
function to be optimized. However, unlike the previous two algorithms where
176 6: Teerasit Kasetkasem
an image model can take any valid forms (having a finite PDF), the ICM algo-
rithm only permits the prior probability of an image to be the multiplication
of local characteristics of all sites, i. e.,
Pr(X) = n
SE~
Pr (X(s) IX (-8\s)) . (6.25)
The above assumption does not fit the conventional MRF models, described
in Sects. 6.2 and 6.3. Nevertheless, it still considers the local interactions among
adjacent pixels through a statistical model while allowing the optimization
process to be completed at a faster rate because it deals with individual lo-
cal characteristics rather than the Gibbs distribution of an entire image. In
addition, Begas pointed out that an image analysis algorithm based on the
conventional MRF model may not always produce an accurate result because
the conventional MRF model tends to favor single color cases (e. g. an entire
image composed of only one intensity value) of X over the more realistic mul-
tiple color cases (multiple land cover class attributes) of X. Single or multiple
color cases of X do not have significant influence on the prior probability under
the Begas' model as long as a spatial dependence is characterized correctly. As
a result, the model in (6.25) may be more suitable for the multiple color cases.
There are several approaches to find the optimum solutions of the MAP
problem under the assumption given in (6.25). The simplest one is to visit
one pixel (e.g. s) at a time, and try to replace the configuration (e.g. x(s)) of
this pixel with the configuration that maximizes the posterior probability. As
mentioned above, computation of the posterior probability can be obtained
easily under (6.25) than the conventional MRF model because, for one update,
we need to determine only the summation over all the possible configurations
of a site rather than the summation of all the possible configurations of the
entire image. Furthermore, the convergence of this method is guaranteed as
long as the posterior probability is well defined, since the posterior probability
always increases as the number of iterations increases, and is bounded by one.
The above procedure is summarized in Fig. 6.5. Another approach is to update
the configuration of a site by only considering its local characteristic, i. e.,
where the subscripts new and old indicate configurations of the current and
previous iterations, respectively. For this second approach, the convergence
may not always be achieved since a higher posterior probability may not be
attained after one complete update. Moreover, based on the same reason,
the configurations may oscillate between two or more values. However, this
approach allows the entire image to be updated at the same time because the
update at a site has no effect on other sites.
Markov Random Field Models 177
Initial Image
1
Find visiting scheme
{SJ,S2""}' Set h = 1.
NO
6.5
Summary
In this chapter, we have introduced the concept of Markov random field models.
Here, statistical correlations among neighboring pixels or sites are quantified
through the Gibbs energy functions (e. g. Ising model). The low and high
energy values are associated with similarity and dissimilarity between the
configurations (intensity values ) of neighboring sites. The exponent of negative
energy functions was defined as the Gibbs distribution, which is equivalent to
the MRF model. Based on the MRF models, the maximum a posteriori (MAP)
criterion was selected to solve image analysis problems. Here, the most likely
image given the observed image was chosen as the optimum solution. The
simulated annealing and Metropolis algorithms were proposed as the suitable
choices to find the optimum solution under the MAP criterion because image
spaces are generally very large and the a posteriori probability is also extremely
concave for exhaustive search algorithms or gradient-based approaches to
handle efficiently. Both algorithms generate a sequence of random images that
converge to the global optima. Convergence usually occurs after hundreds of
iterations. To speed up the convergence rate, a suboptimum approach, namely
the ICM algorithm, was also introduced in this chapter to reduce computational
time. The ICM algorithm looks for the closest saddle point of the a posteriori
probability.
178 6: Teerasit Kasetkasem
References
Begas J (1986) On the statistical analysis of dirty pictures. Journal of Royal Statistical Society
B 48(3): 259-302
Bremaud P (1999) Markov chains Gibbs field, Monte Carlo simulation and queues. Springer
Verlag, New York
Bruzzone L, Prieto DF (2000) Automatic analysis of the difference image for unsupervised
change detection. IEEE Transactions on Geoscience and Remote Sensing 38(3):1171-
1182
Geman S, Geman D (1984) Stochastic relaxation, Gibbs distributions and the Bayesian
restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence
PAMI-6(6): 721-741
Hasting WK (1970) Monte Carlo sampling method using Markov chains and their applica-
tions. Biometrika 57(1): 97-109
Kasetkasem T, Varshney PK (2002) An image change detection algorithm based on Markov
random field models. IEEE Transactions on Geoscience and Remote Sensing 40( 8): 1815-
1823
Marroguin J, Mitter S, Poggio T (1987) Probabilistic solution of ill-posed problems in
computer vision. Journal of the American Statistical Association 82: 76-89
Nikolova M (1999) Markovian reconstruction using a GNC approach. IEEE Transactions on
Image Processing 8(9): 1204-1220
Solberg AHS, Taxt T, Jain AK (1996) A Markov random field model for classification of
multisource satellite imagery. IEEE Transactions on Geoscience and Remote Sensing
34:100-1l3
Van Trees HL (1968) Detection, estimation and modulation theory. Wiley, New York
Varshney PK (1997) Distributed detection and data fusion. Springer Verlag, New York
Winkler G (1995) Image analysis random fields and dynamic Monte Carlo methods. Springer
Verlag, New York
Part III
Applications
CHAPTER 7
7.1
Introduction
a b
Fig.7.1a,b. Multi-temporal remote sensing images. a Landsat TM band 1 images taken in
1997. b Landsat TM band 1 images taken in 1998. The two crosses denote the points of the
same UTM coordinates in each image. The noticeable shift demonstrates the presence of
registration errors in systematic-corrected data
7.2
Registration Consistency
In the absence of reference data, registration consistency (Holden et al. 2000)
may be used as a measure to evaluate the performance of different intensity-
based image registration algorithms. Let TA,B (TB,A) be the transformation
obtained using image A (B) as the floating image and image B (A) as the
reference image, the registration consistency (dp) of TA,B and TB,A can be
defined as
where the composition TA,B 0 TB,A represents the transformation that applies
TB,A first and then TA,B. The overlap region of images A and B is defined as
IA,B. The discrete domains of images A and Bare hand IB respectively, NA and
NB are the number of pixels of image A and image B within the overlap region.
The registration consistency defined in (7.la) specifies the mean distance
of the two mapped points TA,B(P) and Tii,~(p), where p is a pixel in image
A. Similarly, (7.lb) represents the mean distance of the two mapped points
TB,A (p) and TA,~(P)' where p is a pixel in image B. In general, it is expected that
the two values from (7.la) and (7.lh) will practically he the same. Similarly,
a three-date registration consistency can be defined as
1
(dp)3 = - " I (x,y) - TC,A 0 TB,C 0 TA,B (x,y) II
2NA ~
(x,y) E(IA nIA,B,cl
(7.3)
184 7: Hua-mei Chen, Pramod K. Varshney
(dph = (II VA,B + VB,C + VC,A I + I VA,C + VC,B + VB,A II) /2 . {7.4}
Though the registration consistency defined above may not be treated as a mea-
sure of registration accuracy, it is one of the means to quantitatively assess the
quality and reliability of the registration method employed because a reliable
registration algorithm should result in a consistent result, no matter which
image is served as the floating image and which is served as the reference
image.
7.3
Multi-Sensor Registration
7.3.1
Registration of Images Having a Large Difference in Spatial Resolution
a b
Fig.7.2a,b. Images used in the experiment. a Digital aerial photograph (red band) from
Eastman Kodak and b airborne HyMap image (band 9)
feasible to use the low resolution image, the HyMap image, as the floating im-
age because the resulting MI registration function is very rough, which makes
the optimization barely possible.
Figure 7.3 shows the 2D MI registration function using i) HyMap image
(low resolution) as the floating image and ii) digital aerial photograph (high
resolution) as the floating image. Clearly, the registration function shown in
Fig. 7.3a is very rough and it is extremely difficult to find the global optimum.
On the other hand, the registration function shown in Fig. 7.3b is very smooth
and it is much easier to find the global optimum. For this reason, we use
aerial photograph as the floating image in our experiment. The multi-scale
optimization procedure developed for the registration of these two images is
described next.
First, an image pyramid is constructed for the aerial photograph using Haar
wavelet decomposition. This is done by repeating the following procedure at
each resolution level of the image pyramid: convolving the image with an
averaging filter of size 2 x 2 pixels followed by subsampling (taking every other
pixel) along both vertical and horizontal directions to generate the image
of the next (lower) resolution level. Many algorithms exist to generate an
Fig.7.3a,b. 2D registration function using a HyMap image and b Digital aerial photograph
as the floating image. The displacements are shown in meters
186 7: Hua-mei Chen, Pramod K. Varshney
image pyramid (Burt and Adelson 1983; Burt 1984; Toet 1992). Here we used
Haar wavelet decomposition to accomplish this task because of its simplicity.
For the experiment, an image pyramid of 6 levels (level 0 ~ level 5, level
o is the original resolution) was constructed. Table 7.1 lists the image size
and corresponding spatial resolution at each level. To apply the multi-scale
optimization strategy, we start with the registration of the HyMap image and
the aerial photograph at levelS. The aerial photograph at level 5 is served
as the floating image because of the reason mentioned previously. We denote
the transformation parameter set obtained at this level (i. e. scale, rotation, x-
displacement (in pixel), y-displacement (in pixel)) as [55, rs, dxs, dys], then the
parameter set [55/2, rs, dxs, dys] is used as the initial search point to optimize
the MI similarity measure between the HyMap image (reference image) and
the aerial photograph at level 4 (floating image). The reason for doubling the
scaling factor is that the resolution of the digital aerial photograph is doubled
when we move from levelS to level 4, as a result, the scaling factor between
the photograph at level 4 and the HyMap image is decreased by the same
factor 2. The above procedure is repeated until the digital aerial photograph at
level 0 (original resolution) is reached. Figure 7.4 summarizes the multi-scale
optimization procedure stated above.
In Fig. 7.4, {Fdf:Ol forms an image pyramid of L levels and the image at
each level is used in consecutive iterations, from i = L - 1 to i = 0, as the
~
Floati ng
Image Fi
L ...
-'"
Maximization of MI
using any local optimizer
. [SOi_I,'ili_l.dxOi_I,dYOi_l]
Table 7.2. Numerical results for HyMap and digital aerial photograph registration
Level dx dy
(i) (m) (m)
5 -71.1938 -119.5488
4 -71.0765 -119.6574
3 -71.1267 -119.4965
2 -71.1059 -119.5197
1 -71.1327 -119.5900
0 -71.1482 -119.4564
188 7: Hua-mei Chen, Pramod K. Varshney
a b c
Fig. 7.Sa-c. Registration result. a Floating image. b Resampled (registered) reference image.
c Superimposed image
different levels are almost the same. The possible reason for this is that, in this
experiment, only the floating image is used to construct the image pyramid
for multi-scale optimization. In this manner, the resolution of the reference
image is fixed throughout the entire optimization process. However, if both
images have approximately the same resolution and both are used to construct
the image pyramids for multi-scale optimization, we expect improved regis-
tration accuracy as we move to higher resolution levels as shown in (Chen
and Varshney 2000). Nevertheless, the major advantage of this multi-scale op-
timization approach is its computational efficiency. Significant speedups by
using the multi-scale optimization strategy were reported in Maes et al. (1999)
and Pluim et al. (2001). This is due to the fact that the initial search point at
each resolution level (except for at the lowest resolution level) is in the vicinity
of the optimum at that resolution level. This accelerates the convergence rate of
the local optimizer in use. In this experiment, the MI measure was computed
through the PVI algorithm. In the next section, performances of different
joint histogram estimation methods are compared using a pair of multi-sensor
images having similar spatial resolution.
7.3.2
Registration of Images Having Similar Spatial Resolutions
a b
Fig.7.6a,b. Remote sensing images used in the experiment. a IRS PAN image. b Radarsat
SARimage
The first three methods belong to the two-step joint histogram estimation
category (see Sect. 3.3.1), in which, an intermediate resampled image is pro-
duced and is employed to compute mutual information. The fourth method
falls in the one-step joint histogram estimation category. Since for multi-sensor
registration, images are acquired from different types of sensors placed on dif-
ferent platforms, spatial resolutions are seldom exactly the same. Therefore,
the phenomenon of interpolation induced artifacts (see Sect. 3.4) is not an
issue. Hence, the GPVE algorithm is not compared here in this experiment.
For each method, the simplex search procedure (NeIder and Mead 1965) is
used to find the global optimum. Since the simplex search procedure is just
a local optimizer, there are instances when it may fail to find the position of
the global optimum. In this situation (observed by visual inspection), we may
change the initial search points until the correct optimum is obtained or adopt
the hybrid global optimization procedure as illustrated in the previous section.
Table 7.3 lists the registration results for the IRS PAN and Radarsat SAR im-
ages shown in Fig. 7.6 using different joint histogram estimation methods. The
transformation parameters (TA.B) representing rotation (in degrees), vertical
and horizontal displacements (in meters), and the registration consistencies
obtained using various interpolation algorithms are shown in this table as well.
From Table 7.3, by comparing the transformation parameters, it can be
observed that the registration results from different interpolation algorithms
are very close to each other. However, partial volume interpolation method
190 7: Hua-mei Chen, Pramod K. Varshney
Table 7.3. Results from the registration of IRS PAN and Radarsat SAR images. T denotes
the transformation parameters (rotation (degree), vertical displacement (m) and horizontal
displacement (m))
obtained the most consistent results, as observed from the registration consis-
tency shown in the last column of the table. Notice that the nearest neighbor in-
terpolation algorithm yields performance comparable to the other algorithms.
Hence, nearest neighbor interpolation may be suitable for joint histogram es-
timation for MI based multi-sensor registration because of its computational
efficiency.
The experimental results presented in this section have demonstrated the
ability of an intensity based image registration technique using MI as a sim-
ilarity measure for the multi-sensor registration problem. Next, we consider
the use of the MI based registration technique for multi-temporal registration
problems.
7.4
Multi-Temporal Registration
Accurate registration of multi-temporal remote sensing images is quite essen-
tial for various change detection applications based on the processing of data
collected at differenttimes (i. e. see Chapter 2). A variety of change detection al-
gorithms based on techniques such as image differencing (Castelli et al. 1998),
principal component analysis (Mas 1999), change vector analysis (Lambin and
Strahler 1994), Markov Random Fields (Kasetkasem and Varshney 2002) and
neural networks (Liu and Lathrop 2002) may be used. A fundamental require-
ment for the success of all these algorithms is, however, accurate registration
of images taken at different times. For example, registration accuracy of less
than one-fifth of a pixel is required to achieve a change detection error ofless
than 10% (Dai and Khorram 1998).
Conventionally, mean square difference (MSD) and normalized cross-corre-
lation (NCC) are used as similarity measures for multi-temporal image reg-
istration problems (Brown 1992). However, while using these similarity mea-
sures, images are assumed radiometrically corrected. In some cases, even
though accurate radiometric correction has been applied, registration results
using MSD or NCC as similarity measures may still not be reliable. This oc-
curs, for example, when the scene undergoes a significant amount of change
between two times. This can be illustrated by registering two small portions
MI Based Registration of Multi-Sensor and Multi-Temporal Images 191
a b
Fig. 7.7a,b. Two small portions of Landsat TM band 1 images taken in a Year 1995 and b Year
1998, respectively. The white box in image b indicates the corresponding sub-set of image a
(128 x 128 pixels and 228 x 228 pixels) of Landsat TM band 1 images taken in
years 1995 and 1998 (see Fig. 7.7). The images are systematically corrected, and
thus contain no radiometric errors. From this figure, a significant change be-
tween the intensity values of the two images can be observed which is probably
due to the change in crop types in the region. The 2D registration functions
obtained by using MSD and Nee as similarity measures, to register images in
Fig. 7.7, are shown in Fig. 7.8a,b. Ifboth MSD and Nee were suitable similarity
measures in this case, we would expect a global minimum in Fig. 7.8a and
i 0.2
~ 0.1
§ 0
~
~ -0.1
S 20-0.2
20 20
a b
Fig.7.8a,b. Two dimensional registration functions while registering images shown in
Fig. 7.7 using a MSD and b NCC as similarity measures
192 7: Hua-mei Chen, Pramod K. Varshney
a global maximum in Fig. 7.8b both at the position (0,0). Instead, it occurs
at (20, -14) for MSD and at (20, -12) for Nee. Thus, these two similarity
measures fail to register the two images accurately. In order to overcome this
potential difficulty pertaining to MSD and Nee similarity measures, a more
robust similarity measure is required. Since MI has been successful as a sim-
ilarity measure for many multi-sensor registration problems, it is natural to
examine its efficacy for registration of images acquired at different times.
Figure 7.9 illustrates the 2D registration function using MI as the similarity
measure for the registration of images shown in Fig. 7.7. It is clear from Fig. 7.9
that the maximum occurs at the position (0,0), which demonstrates that MI
is able to successfully register the two images. This was not the case either
with the use of Nee or MSD as similarity measures. Now, we evaluate the
performance of mutual information (MI) as the similarity measure for multi-
temporal remote sensing image registration.
Three multi-temporal image data sets are used in our experiments. These
are Landsat TM band 1 images of three different dates (1995, 1997 and 1998),
IRS PAN images of two different dates (1997 and 1998), and Radarsat SAR
images of two different dates (1997 and 1998). Figure 7.10 shows the images
used in the experiments. The images obtained from the source were already
systematically corrected. Since the images from the same sensor have the
same spatial resolution, interpolation artifacts are expected to be present in
these experiments. The header files associated with each image indicate that
there is no rotational difference in Landsat TM images, a slight rotational
difference of 0.02° in IRS PAN images and a significant amount of rotational
difference (i. e. 18.57°) in the SAR images. Therefore, it is anticipated that
interpolation-induced artifacts may be more pronounced while registering
multi-temporal Landsat TM and IRS PAN images than while registering multi-
temporal Radarsat SAR images. This is because in the case of SAR image
registration, when the two images are nearly registered, the significant amount
of rotational difference makes the agreement between the grid points of the
two images rare upon registration.
We evaluate the performance of four joint histogram estimation algorithms
to compute mutual information. These are linear interpolation, the first order
y. clspBcemort x-dspbcement
Fig. 7.9. 2D registration function while registering images shown in Fig. 7.7 using MI as the
similarity measure
MI Based Registration of Multi-Sensor and Multi-Temporal Images 193
a b c
d e
f g
Fig.7.lOa-g. Landsat TM band I images of a Year 1995, b Year 1997 and c Year 1998. IRS
PAN images of d Year 1997, e Year1998 and Radarsat SAR images of f Year 1997 and g Year
1998
GPVE algorithm, which is actually the PVI algorithm, and the second and third
order GPVE algorithms. Again, the simplex search procedure is used to find the
global optimum in all cases. When the global optimum position is not found,
we change the initial search points until the correct optimum is obtained.
Table 7.4 to 7.9 show all the experimental results obtained while registering
different pairs of multi-temporal remote sensing images. The first date in each
pair serves as the floating image and the second date as the reference image.
Since, Landsat TM images are obtained for three dates, three sets of registration
are performed. Table 7.4 to 7.6 show the registration results in the form of
194 7: Hua-mei Chen, Pramod K. Varshney
Table 7.4. Registration results for Landsat TM 1995 and 1997 images. T denotes the trans-
formation parameters (vertical and horizontal displacements)
Table 7.5. Registration results for Landsat TM 1995 and 1998 images. T denotes the trans-
formation parameters (vertical and horizontal displacements)
Table 7.6. Registration results for Landsat TM 1997 and 1998 images. T denotes the trans-
formation parameters (vertical and horizontal displacements)
Table 7.7. Three-date registration consistency for registration of Landsat TM 1995, 1997 and
1998 images
Linear 0.1187
1st order 0.0008
2nd order 0.0289
3rd order 0.0291
MSD 0.2314
NCC 0.2023
Table 7.8. Registration results for IRS PAN 1997 and 1998 images. T denotes the transfor-
mation parameters (rotation angle, vertical and horizontal displacements)
Table 7.9. Registration results for Radarsat SAR 1997 and 1998 images. T denotes the
transformation parameters (vertical and horizontal displacements)
four joint histogram estimation algorithms in Table 7.4 to 7.7, we notice that
the PVI algorithm results in almost perfect registration consistency. However,
if we plot the registration function corresponding to the PVI algorithm for
a pair of images to be registered, we can dearly observe interpolation-induced
artifacts. One such illustration is provided in Fig. 7.lla and 7.llb for the
registration of Landsat TM images of 1995 and 1997. This explains the perfect
registration consistency of the PVI algorithm: it is a consequence of the artifact
pattern. Therefore, we have a good reason to question the registration accuracy
achieved by PVI in this case. In contrast, the second and higher order GPVE do
196 7: Hua-mei Chen, Pramod K. Varshney
1.4,-----------------, 1.4,------------------,
1.3 1.3
1.2 1.2
1.1 1.1
~ 1 ~1
0.9 0.9
0.8 0.8
0.7 0.7
0.60
Q~ 00 ~ ~ ro M E 2 3 4 5 6
x-displacement y-displacement
a b
1.1 ,---~---------,
0.9 0.9
~ ~
0.8 0.8
0.7 0.7
0.6 0.60
59 60 61 62 63 64 65 2 3 4 5 6
y-displacement y-displacement
c d
Fig.7.11a-d. lD registration functions resulting from PVI (a and b) and 2nd order GPVE
(c and d) in registering Landsat TM images of 1995 and 1997
not result in any artifact patterns (see Fig. 7.11 c and 7.11 d, for second order as
an example). Thus, higher order GPVE algorithms clearly have an advantage
over linear and PVI algorithms since the resulting registration function is very
smooth.
Similar conclusions can be drawn from the registration results of IRS PAN
images, which are reported in Table 7.8. In case of the registration of Radarsat
SAR images (Table 7.9), we did not observe any artifact patterns when either
linear interpolation or PVI algorithm is used. This is because of the rotational
difference between the two images. In this case, linear interpolation again
results in relatively poor registration consistency while PVI and higher or-
der GPVEs have similar performance. Thus, joint histogram estimation using
higher order GPVE algorithms produces more reliable registration results than
using linear or PVI algorithms.
Further, in order to evaluate the performance of our MI based registra-
tion algorithm, the registration results of the 3rd order GPVE algorithm are
compared with those obtained from image registration using MSD and NCC
as similarity measures. Linear interpolation is used when implementing the
References 197
algorithms based upon these two similarity measures. The registration results
obtained via MSD and NCC similarity measures are shown in the last two rows
of Table 7.4 to 7.9. From our experiments, it is hard to determine which sim-
ilarity measure results in better registration accuracy due to a lack of ground
data. However, on the basis of registration consistency it can be clearly seen
that MI based registration implemented through higher order GPVE algo-
rithms outperforms the registration obtained with MSD and NCC as similarity
measures.
7.S
Summary
In this chapter, we applied the MI based registration technique introduced in
Chap. 3 for multi-sensor and multi-temporal registrations. Multi-sensor regis-
tration was performed for two sets of remote sensing data: the images from two
sensors having large difference in spatial resolutions and the images from two
sensors having similar spatial resolution. For the former case, multi-scale opti-
mization strategy was introduced to speedup the whole process. For the latter
case, four joint histogram estimation algorithms were used to compute the
MI measure. They were nearest neighbor interpolation, linear interpolation,
cubic convolution interpolation and PVI interpolation. Registration consis-
tency was used to evaluate registration performance. Our experiments show
that PVI produced the most consistent result and surprisingly, nearest neigh-
bor interpolation outperformed linear interpolation and cubic interpolation
in most cases. Since the sizes of remote sensing images are often very large
and nearest neighbor interpolation is computationally most efficient, it seems
reasonable to adopt this algorithm when registering images oflarge sizes using
an MI based approach. It should be noted that this is appropriate only when
the two images involved have different spatial resolutions, which is true in
most multi-sensor registration applications; otherwise interpolation-induced
artifacts have to be taken into account.
For multi -temporal registration, images to be registered often have the same
spatial resolution and artifacts are likely to be present. To overcome this prob-
lem, higher order GPVE was used to implement the MI based registration
technique. Although, a precise evaluation of registration accuracy is not pos-
sible without the availability of accurate ground data, we have shown that MI
based registration implemented through the higher order GPVE algorithm re-
sults in better registration consistency than the registration performed using
MSD and NCC as the similarity measures.
References
Brown LG (1992) A survey of image registration techniques. ACM Computing Surveys 24:
325-376
Burt PJ, Adelson E (I983) The Laplacian pyramid as a compact image code. IEEE Transac-
tions on Communications, Com-31(4): 532-540
198 7: Hua-mei Chen, Pramod K. Varshney
Burt PJ (1984) The pyramid as structure for efficient computation. In: Multiresolution Image
Processing and Analysis, Springer-Verlag, pp 6-35 edited by A. Rosenfeld
Castelli V, Elvidge CD, Li CS, Turek JJ (1998) Classification-based change detection: Theory
and applications to the NALC data set. In: Lunetta RS, Elvidge CD (eds), Remote sensing
change detection: environmental monitoring methods and applications, Ann Arbor
Press, Michigan, pp 53-74
Chen H, Varshney PK (2000) A pyramid approach for multimodality image registration
based on mutual information. Proceedings of yd International Conference on Informa-
tion Fusion, 1, pp MoD3 9-15
Chen H, Varshney PK, Arora MK (2003a) Mutual information based image registration for
remote sensing data. International Journal of Remote Sensing 24( 18): 3701-3706
Chen H, Varshney PK, Arora MK (2003b) Automated registration of multi-temporal remote
sensing images using mutual information. IEEE Transactions on Geoscience and Remote
Sensing 41(11): 2445-2454
Dai X, Khorram S (1999) A feature-based image registration algorithm using improved
chain-code representation combined with invariant moments. IEEE Transactions on
Geoscience and Remote Sensing 37(9): 2351-2362
Dai X, Khorram S (1998) The effects of image misregistration on the accuracy of remotely
sensed change detection. IEEE Transactions on Geoscience and Remote Sensing 36:
1566-77
Holden M, Hill DLG, Denton ERE, Jarosz JM, Cox TCS, Rohlfing T, Goodey J, Hawkes
DJ (2000) Voxel similarity measures for 3-D serial MR brain image registration. IEEE
Transactions on Medical Imaging 19: 94-102
Kasetkasem T, Varshney PK (2002) An image change detection algorithm based on Markov
random field models. IEEE Transactions on Geoscience and Remote Sensing 40: 1815-
1823
Lambin EF, Strahler AH (1994) Indicators ofland-cover change for change-vector analysis
in multitemporal space at coarse spatial scales. International Journal of Remote Sensing
15: 2099-2119
Liu X, Lathrop RG (2002) Urban change detection based on an artificial neural network.
International Journal of Remote Sensing 23: 2513-2518
Maes F, Vandermeulen D, Suetens P (1999) Comparative evaluation of multiresolution
optimization strategies for multimodality image registration of mutual information.
Medical Image Analysis 3(4): 373-386
Mas JF (1999) Monitoring land-cover changes: a comparison of change detection techniques.
International Journal of Remote Sensing 20: 139-152
Neider JA, Mead R (1965) A simplex method for function minimization. The Computer
Journal 7: 308-313
Pluim JPW, Antoine Maintz JB, Max AV (1998) A multiscale approach to mutual information
matching. Proceedings of SPIE Conference on Image Processing, San Diego, California,
3338, pp 1334-1344
Pluim JPW, Maintz JBA, Viergever MA (2001) Mutual information matching in multireso-
lution contexts. Image Vision and Computing 19(1-2): 45-52
Thevenaz P, Unser M (2000) Optimization of mutual information for multiresolution image
registration. IEEE Transactions on Image Processing 9(12): 2083-2089
Toet A (1992) Multiscale contrast enhancement with application to image fusion. Optical
Engineering 31(5): 1026-1031
Toth CK, Schenk T (1992) Feature-based matching for automatic image registration. IrC
Journal 1: 40-46
CHAPTERS
8.1
Introduction
based algorithm that, unlike the original one, allows the extraction of fewer
sources than the number of observations. This algorithm is known as the under
complete ICA-FE algorithm (UlCA-FE). There is a clear need for such an algo-
rithm in the context of hyperspectral imagery, where the number of spectral
bands available far exceeds the number of independent classes contained in
the image. Designing a method that produces only these useful components
will constitute a major improvement.
In both approaches, the number of independent components to be generated
is computed based on the eigenvalue information produced in the PCA step. It
corresponds to the number of highest eigenvalues that make up approximately
99% of the variance. This criterion is frequently used in PCA for reducing data
dimensionality.
This chapter is organized as follows. In Sect. 8.2, we provide a brief assess-
ment of the advantages of ICA over PCA when used for feature extraction.
In Sect. 8.3, we present the two ICA based feature extraction algorithms. Sec-
tion 8.4 contains an assessment of the efficiency of the two methods based
on a practical experiment that uses hyperspectral data from AVIRIS sensor.
The performance metrics used are: comparative execution times, mutual in-
formation and visual inspection of the extracted features (bands) as well as the
accuracy of unsupervised classification. The results obtained by the use of PCA
are also presented. We note that the usual quantitative measures (distance be-
tween the means, distance between the distributions, etc.) have not been used
for assessing feature extraction algorithms as they employ information from
training data sets. In the case of unsupervised processing, this information is
not available.
Given the fact that in the context of hyperspectral data, the term feature is
similar to the term band, and that in ICA, the components are associated with
features, all these three terms will be used interchangeably throughout this
chapter.
8.2
PCA vs ICA for Feature Extraction
may not be error-free, so the quality of the training data may be poor in that
it may not accurately characterize the classes. This would decrease the quality
of both feature extraction and classification (Swain and King 1973).
Alternatively, when no prior information about the classes is available,
one can perform unsupervised feature extraction. In this case, the statistics
and distance measures between classes cannot be computed or estimated and
the main goal of feature extraction shifts to reduction of data redundancy.
The narrow bandwidths associated with the spectral bands of hyperspectral
data lead to correlation between the adjacent bands resulting in a relatively
high level of redundancy in the data (Richards and Jia 1999). Based on this
observation, one can simply proceed to perform feature selection by analyzing
the correlation matrix and selecting only a few bands from each group of
highly correlated bands. A better approach is to transform the data such
that the resulting features are decorrelated. The variance of the individual
components is considered to be an indicator of information content; large
values suggest high levels of information and low values indicate the presence
of mostly noise. Based on this, only the features with high variance are selected
for further processing (Richards and Jia 1999).
Both PCA and ICA are multivariate methods that, given a random vector,
proceed in such a way that the resulting components have increased class
separability. In case of PCA, the separability is achieved through decorrelation
whereas in ICA, it is via independence (Lee 1998).
When considering hyperspectral data, each band corresponds to a com-
ponent of random pixel vectors and constitutes a feature of the data. In this
context, the image cube resulting after PCA processing has the bands decor-
related and sorted according to their variance. This indicates that most of the
information can be retrieved from the first few features and that most of the last
features have close to zero variance (indicating a lack of information). Based
on this observation, PCA is frequently used for feature extraction by dropping
the lowest variance components (Richards and Jia 1999).
Decorrelation based feature extraction has been observed to be less efficient
when dealing with small classes (i.e. classes having small spatial extents) in
the image. In other words, small classes are sparsely represented in the data
(i.e. small classes contain very few pixels). Due to their size, these classes tend
to have little influence on the band variance leading to the possibility of being
discarded in the lower variance bands. In the context of target detection, loss
of information regarding small targets (that correspond to small classes in the
image) affects the accuracy of the feature extraction algorithms (Achalakul
and Taylor 2000; Tu et al. 2001). Therefore, there is no guarantee that band
reduction using PCA will correctly preserve the entire information content of
the image cube (Richards and Jia 1999; Tu et al. 2001).
The PCA model assumes that each original band is a linear mixture of
unknown uncorrelated bands, and proceeds to recover them such that their
variance is maximized. In ICA, the assumption is that each original band is,
a linear mixture of unknown independent bands (Robila et al. 2000). The goal
is to find the unmixing matrix and to perform the inversion, in order to recover
the independent bands (Robila and Varshney 2002). In the case of ICA based
202 8: Stefan A. Robila, Pramod K. Varshney
8.3
Independent Component Analysis Based
Feature Extraction Algorithm (lCA-FE)
When applied to hyperspectral imagery, ICA produces only a few independent
components that contain important information. Most of the other components
are mainly associated with either noise or artifacts introduced by the sensor
or the data acquisition conditions. In the ICA-FE algorithm, first the data
dimensionality is reduced using PCA and then ICA is applied. This is due to
the fact that only a reduced number of components are relevant for further
processing by ICA, which can be recovered from the first few components of
the PCA processed data. The underlying assumption is that these components
contribute significantly to the variance, and are thus pushed in the highest
eigenvalue PCA components (Swain and Davis 1978). In this case, reduction
of the number of bands after PCA should not significantly affect the recovery
of the classes. We can, therefore, proceed to apply the ICA algorithm to the
first few principal components. The result will have the bands (components)
as independent as possible.
The steps involved in the independent component analysis feature extraction
algorithm (lCA-FE) are shown in Fig. 8.1. First, PCA is applied to the n-band
data for dimensionality reduction on the basis of eigenvalues. Following the
computation of the eigenvalues and eigenvectors for the covariance matrix of
x, the number of bands to be retained is determined by taking the highest
eigenvalues that make up a pre-specified percentage of the sum of all the
eigenvalues.
The corresponding principal components can be directly obtained by trans-
forming the data through the eigenvectors associated with the selected eigen-
values. The ICA step described in Chap. 4 (see Sect. 4.3.2) is then applied. The
algorithm results in m-independent features (bands).
The ICA-FE algorithm is very fast, compared to the direct application ofICA
to the full data. In both cases, PCA is used as the preprocessing step. When
PCA is employed in conjunction with band reduction, the complexity of an ICA
iteration is reduced from O(n 2p) to O(m 2p) where n is the number of original
bands, m is the number of resulting PCA components, and p is the number of
pixel vectors.
Another difference between the ICA-FE algorithm and ICA applied on the
full data set is the fact that we no longer need to perform band selection on
the results obtained by ICA. This may, however, be a major drawback. Since
we are using only the high variance bands, we assume that all the independent
components contribute to them. In the case of a low variance independent
component, it is possible that, its contribution may be relegated to a lower
Feature Extraction from Hyperspectral Data Using ICA 203
Hyperspectral Data
x: n band~
~
Principal
Component
Analysis
~
x' : n uncorrelated mbands to be
bands extracted
i
Drop the lowest n-m
variance bands
i
I I
..
x" : m uncorrelated bands
Independent
..
Component Analysis
I u\ : m independent components
I
Fig.S.l. Independent component analysis feature extraction (leA-FE) algorithm
variance PCA band and the algorithm may fail to recover it. In that case, the
PCA step may need to be modified such that all the principal components
with nonzero variance are selected for processing by ICA. However, since the
hyperspectral data contain noise, zero variance bands are seldom found. The
ICA-FE algorithm will then reduce to the application of ICA on the original
data and will not provide any improvement in computational speed. These
issues are further discussed when an example is considered in Sect. 8.5.
8.4
Undercomplete Independent Component Analysis Based
Feature Extraction Algorithm (UlCA-FE)
Sometimes the hyperspectral images may contain independent components
with low variance. For example, in automatic target detection problem (see
Sect. 2.9 of Chap. 2), a small target in the image is generally characterized by
a limited number of pixels that will not contribute significantly to the overall
204 8: Stefan A. Robila, Pramod K. Varshney
u= Wx, (8.1)
p(x)
p(u) = Idet WI . (8.2)
In the case of m < n, expressing the pdf of the random vector u using the
conditional pdfs, we have,
Assume now that the output random vector u is perturbed by additive inde-
pendent Gaussian noise and consider the limit when the variance of the noise
goes to zero. In this case, we can use the multivariate Gaussian distribution to
express the conditional pdf as (Shriki et al. 2000):
1 _--L Ilu-wxf
p(ulx) = lim In e 20 2 • (8.4)
2
a -+0 ( J2na 2 )
u = Wxo. (8.5)
Feature Extraction from Hyperspectral Data Using ICA 205
(8.9)
The limit corresponds to a dirac delta function in x around Xo. Putting this
result into (8.3), for u in the image of Wx, we obtain:
p(u) = f Jdet(WW 1
T)
c5(x-xo)p(x)dx=
p(xo)
yldet(WWT )
. (8.10)
Note that in the case when W is a square matrix, i.e., when m = n, we get
back the relationship in (8.2) since det (W) = det (WT).
206 8: Stefan A. Robila, Pramod K. Varshney
(8.1I)
m
L L
~l
oE{log(p(u))} 1 1 odet(WWT)
oW 2 det (WWT) oW
= det (~T) adj (WWT) W = (WWT)-l W. (8.15)
(8.16)
where g(u) is the vector formed as (g(ud, ... ,g(u n )), and g(.) is a nonlinear
function that can be closely approximated by:
(8.17)
Note that we can no longer use the natural gradient to eliminate the inverse
of the matrix. The steps of complete algorithm are presented in Fig. 8.2. It
starts by randomly initializing Wand then proceeds to compute the update
step described in (8.16). In the algorithm described in the figure, k is an update
coefficient (smaller than one) and controls the convergence speed. Wold and
W new correspond to the matrix W, prior to and after the update step. The
algorithm stops when the mutual information does not change significantly
(based on the value of a presepecified a). The result consists of m independent
components.
208 8: Stefan A. Robila, Pramod K. Varshney
x
n dimensional
observed data
W",,=W,1d +kL'.W
yes
u=Wx
m dimensional
independent components
Hyperspectral Data
x : n bands
+
Principal
Component
•
Analysis
+
I u : m independent components I
Fig. 8.3. Undercomplete independent component analysis based feature extraction (UrCA-
FE) algorithm
PCA UICA
Fig. 8.4a-c. Schematic description of the two rCA based feature extraction algorithms along
with the original PCA algorithm. a PCA, b rCA-FE, c urCA-FE
ofUICA-FE (Fig. 8Ac), the band elimination step is replaced by the direct lCA
computation. Even though this step is time consuming, it has the potential of
extracting more informative features than PCA for further processing.
210 8: Stefan A. Robila, Pramod K. Varshney
8.5
Experimental Results
To assess the effectiveness of the proposed algorithms we have conducted
several experiments using AVlRlS data (Sect. 4.4.3 of Chap. 4 for details on this
dataset) (Fig. 8.5).
Several of the original bands were eliminated (due to sensor malfunctioning,
water absorption, and artifacts not related to the scene) reducing the data to 186
bands. Then, we applied the lCA-FE and urCA-FE algorithms and compared
the results with those produced by PCA.
After applying PCA, it was found that the variance of the first ten compo-
nents contributes over 99% to the cumulative variance of all the components.
Therefore, only ten components were further used in the lCA-FE algorithm.
Figure 8.6a-j displays the 10 components obtained from the PCA algorithm.
A concentration of the class information can be noticed in the first four com-
ponents (Fig. 8.6a-d). However, a number of classes can also be discerned in
lower ranked components. For example, the classes trees and grass (showing
dark in left and lower part in Fig. 8.6e) and grass pasture (showing bright in
the middle left part of Fig. 8.6g) are clearly identifiable in fifth through seventh
components.
Figure 8.7 displays the results of the ICA-FE algorithm when run on the
10 PCA derived components. A clear separation of the classes present in the
image is noticeable, with several of the classes being projected in different
components. The class soybean is projected in first component (bright in the
top area of Fig. 8.7a), the roads and the stone-steel towers are separated in
the third component (dark in Fig. 8.7c) and the corn has been projected in
fourth component (bright, right hand side in Fig. 8.7d). It is also interesting
to point out that both the wheat and grass pasture show up together (dark in
lower left side in Fig. 8.7b) indicating that not enough information is available
to separate them. It may be mentioned that both the lCA and PCA gener-
ated components are orthogonal. Therefore, when the dataset contains several
a c
Fig.8.6a-j. First 10 components with the highest variance data processed with peA
Fig.8.7a-j. Components produced by the rCA-FE algorithm. a soybean; b wheat and grass
pasture; c roads and towers; d corn; e trees; g pasture
to differentiate between the classes wheat and the grass pasture, which was not
the case with leA-FE. The wheat was projected in dark, in the lower side of the
Fig. 8.8f, while the grass pasture has been projected in bright in the left side of
the Fig. 8.8c.
We employ mutual information as a quantitative measure of how well the
classes have been separated among various components. For an m dimensional
Feature Extraction from Hyperspectral Data Using ICA 213
g h
I
random vector u, the mutual information of its components is defined as:
0.09 , - - - - - - - - - - - - - - - - - - - - - - ,
0.08
0.07 PCA
0.06
0.05
0.04
ICA
0.03
0.02
UICA
0.01
OL-_L-_L-_~_~_~_~_~_~_~_~
o 10 15 20 25 30 35 40 45 50
Fig. 8.9. Graph of mutual information for peA, leA and meA when applied on the AVIRIS
hyperspectral data
of change needed to continue the clustering). In this context, and given the
available reference data, the PCA derived components provided a classification
accuracy of 43%, ICA-FE an accuracy of 53% and UICA-FE an accuracy of
58%, indicating superior performance of our ICA based algorithms. There are
several potential reasons for the lower values of classification accuracy. First,
the reference data contained large areas that were not classified, including
the road and railroad segments. Additionally, several of the crops were further
detailed in various sub-classes (such as soybeans - notill/min/clean). This finer
classification is useful in other experiments but these classes could not always
be clearly distinguished when only unsupervised classification is employed.
Another set of experiments using the same algorithms was performed on
HYDICE data (Robila and Varshney 2003). In that case, due to a lack of reference
data, the evaluation was based only on visual inspection or was coupled with
target detection (Robila and Varshney 2003; Robila 2002). The results were
consistent with those provided by the above experiment and, therefore, are not
described here.
8.6
Summary
In this chapter, we have presented two ICA-based algorithms for feature extrac-
tion. The first (ICA-FE) algorithm uses PCA to reduce the number of bands and
then proceeds to transform the data such that the features are as independent
as possible. The second (UICA-FE) algorithm produces the same number of
components, relying on PCA only for determination of the number and decor-
relation of components. In the UICA-FE algorithm, the data are projected into
lower dimensionality directly through the ICA transform. Basically, both al-
gorithms have the same goal: independence of the features to be extracted.
However, by using the information from all the components and not only the
ones selected by PCA reduction, UICA-FE is able to increase the separability
of the classes in the derived independent features. This was validated through
experiments on hyperspectral data. The UICA-FE derived features display an
increased class separation as compared to ICA-FE. Thus, both ICA-FE and
UICA-FE provide attractive approaches for unsupervised feature extraction
from hyperspectral data.
References
Achalakul T, Taylor S (2000) A concurrent spectral-screening PCT algorithm for remote
sensing applications. Information Fusion 1(2): 89-97
Hyviirinen A, Karhunen J, Oja E (2001) Independent component analysis. John Wiley and
Sons, New York
Lee TW (1998) Independent component analysis: theory and applications. Kluwer Academic
Publishers, Boston
Rencher AC (1995) Methods of multivariate analysis. John Wiley and Sons, New York
Richards JA, Jia X (1999) Remote sensing digital image analysis: an introduction. Springer-
Verlag, Berlin
216 8: Stefan A. Robila, Pramod K. Varshney
Robila SA, (2002) Independent component analysis feature extraction for hyperspectral
images, PhD Thesis, Syracuse University, Syracuse, NY
Robila SA, Varshney PK (2002) Target detection in hyperspectral images based on indepen-
dent component analysis. Proceedings of SPIE Automatic Target Recognition XII, 4726,
pp 173-182
Robila SA, Varshney PK (2003) Further results in the use of independent component analysis
for target detection in hyperspectral Images. Proceedings of SPIE Automatic Target
Recognition XIII, 5094, pp 186-195
Robila SA, Haaland P, Achalakul T, Taylor S (2000) Exploring independent component
analysis for remote sensing. Proceedings of the Workshop on Multi/Hyperspectral Sen-
sors, Measurements, Modeling and Simulation, Redstone Arsenal, Alabama, U.S. Army
Aviation and Missile Command, CD
Shriki 0, Sompolinski H, Lee DD (2001) An information maximization approach to over-
complete and recurrent representations. In Advances in Neural Information Processing
Systems l3, Leen TK, Dietterich TG, and Tresp V, Eds., l3, pp 612-618
Swain PH, King RC (1973) Two effective feature selection criteria for multispectral re-
mote sensing. Proceedings of First International Conference on Pattern Recognition,
November 1973, pp 536-540
Swain PH, Davis SM (eds) (1978) Remote sensing: the quantitative approach, McGraw Hill,
New York
Tu TM, Huang PS, Chen PY (2001) Blind separation of spectral signatures in hyperspectral
imagery. IEEE Proceedings on Vision, Image and Signal Processing 148(4): 217-226
CHAPTER 9
Hyperspectral Classification
Using ICA Based Mixture Model
Chintan A. Shah
9.1
Introduction
order statistics. Chang et. al. (2002) have considered an application of ICA
to linear spectral mixture analysis, referred to as ICA-based linear spectral
random mixture analysis (LSRMA). They model an image pixel as a random
source resulting from a random composition of multiple spectral signatures
of distinct materials (or classes) in the image. Their experimental results have
demonstrated that the proposed LSRMA method is an effective unsupervised
approach for feature extraction and hence for classification and target detec-
tion problems. However, in these research efforts, the application of ICA has
been limited to being a feature extraction algorithm.
It is important to note that it may be inappropriate to apply a classification
or a target detection algorithm based on second order statistics on the features
extracted by ICA. This is due to the fact that these algorithms do not possess
the statistical properties that they can complement the enhancement of the
information content of the features (extracted based on higher order statistics
as is done in ICA).
The work presented in this chapter is the application of a relatively new ap-
proach, the ICA mixture model (ICAMM) algorithm (Lee et al. 2000), derived
from ICA, for an unsupervised classification of non-Gaussian classes from re-
mote sensing data. So far, to the best of our knowledge, the ICAMM algorithm
has been employed for unsupervised classification problems in other applica-
tions such as speech signals (Lee et al. 2000), learning efficient codes of images
(Lee and Lewicki 2002) and blind signal separation in teleconferencing (Bae et
al. 2000) but not for the classification of remote sensing data.
In the proposed approach, we model each pixel of a hyperspectral image as
a random vector of intensity values that can be described by the ICA mixture
model. Unlike the approach adopted by Robila and Varshney (2002a,b), Chang
et. al. (2002) and many others who have employed the ICA model for feature
extraction, we employ K ICA model to explain the data generation properties
and propose an ICAMM algorithm for unsupervised classification. Here, K is
the prespecified number of spectral classes.
The ICAMM algorithm views the observed hyperspectral data as a mixture
of several mutually exclusive classes. Each of these classes is described by a lin-
ear combination of independent components with non-Gaussian (leptokurtic
or platykurtic) probability density functions (see Sect. 4.2 in Chap. 4). The
ICAMM algorithm finds independent components and the mixing matrix for
each class using an extended information-maximization learning algorithm
and computes the class membership probabilities for each pixel. The pixel is
allocated to the class with the highest posterior class probability to produce
a classification map.
This chapter is organized in the following manner. In Sect. 9.2, we formu-
late the ICAMM algorithm for unsupervised classification of non-Gaussian
classes. The steps involved in the ICAMM algorithm are also provided in this
section. In Sect. 9.3, we describe the proposed experimental methodology.
The performance of the algorithm is evaluated by conducting experiments
on a hyperspectral dataset in Sect. 9.4. Finally, we provide summary of the
experimental results.
Hyperspectral Classification Using ICA Based Mixture Model 219
9.2
Independent Component Analysis Mixture Model (lCAMM) - Theory
Given only sensor observations that are assumed to be linear mixtures of the un-
observed, statistically independent source signals, the problem of blind source
separation is to recover these independent source signals. The term "blind"
indicates that both the source signals and the way the signals were mixed are
unknown. ICA provides a solution to the blind source separation problem (see
Sect. 4.1 in Chap. 4). The goal of ICA is to perform a linear transformation
of the observed sensor signals, such that the resulting transformed signals
are as statistically independent from each other as possible. When compared
to correlation-based transformations such as PCA, ICA not only decorrelates
the sensor observations composed of mixed signals (in terms of second order
statistics), but also reduces the higher order statistical dependencies between
them. Chapter 4 provided a detailed description of ICA, where various ap-
proaches for the implementation of ICA were discussed. In this chapter, we
will employ an extended version of the information maximization (infomax)
algorithm known as the extended infomax algorithm. A detailed description
of this algorithm can be found in Lee et al. (1999).
Consider a random vector XI> whose elements are the intensity values Xi,t
of the pixel t in the spectral band i of a hyperspectral image such that Xt =
[XI,t, ... , xN,tl', where [.]' denotes the transpose. Assume that the pixel vectors
(for t = 1 to T) are obtained probabilisticallyfrom the set of classes {WI, ... ' WK}.
Class Wj is selected with prior class probability P (Wj) , so that the probability
density function for Xt may be expressed by the mixture density model as
(Duda et al. 2000),
K
P (xtl e ) = I) (xtlWj, OJ) P (Wj) , (9.1)
j=1
where, e = (0[, ... ,0 K) are the K class parameter vectors. The class-conditional
densities P (Xt IWj' OJ) are referred to as class-component densities, and the
prior class probabilities P (Wj) are the mixing parameters. Let, X = {XI, •• ·, XT}
be a set of T unlabeled pixel vectors, drawn independently from the mixture
model in (9.1). The likelihood of these observed samples, by definition, is the
joint density, given by,
T
P (Xle) = TIp (xtl e ) . (9.2)
1=1
the ICA model (Lee et al. 1999) with the addition of an N-dimensional bias
vector bj as,
(9.3)
In (9.3), vector Xt, corresponds to the N-dimensional random vector with its
dimensionality reduced by an appropriate feature extraction technique such
that N = M, where M is the number of unobserved independent sources.
Aj is a N x N, full rank, real-valued mixing matrix. Additionally, the vector
J'
Sj,t = [Sj,l,T, ••. , Sj,N,T for class j in (9.3) corresponds to the random vector
of N independent, unobserved, real-valued, source signals 5 = [51> ••• , SN]' as
defined for the ICA model (Lee et al. 1999) (see Chap. 4 for more details).
Since the hyperspectral data X is modeled as a mixture of K classes, and each
of these K classes are described by a linear combination of N independent,
non-Gaussian sources, the generative model in (9.3) depicts one of the K ways
for generating the random pixel vector Xt. Thus, depending on the values of
the class parameters OJ = {Aj, bj}, and the unobserved independent sources
Sj,t corresponding to each class j, there are K ways for viewing Xt (i. e. Xt =
A1S1,t+bl, .•• ,Xt = Ajsj,t+bj, .•• , Xt = AKSK,t+bK). Additionally, it is significant
to note that all the assumptions and restrictions (see Sect. 4.2) necessary for
the identifiability of the ICA solution are implicitly inherited by the ICAMM
algorithm as well.
Expressing (9.3) as
(9.4)
we get further insight into the generative model. Here we subtract the bias for
class j from the random pixel vector Xt> which is then linearly transformed by
the unmixing matrix W; = Ai 1, to obtain the unobserved independent sources
9.2.1
ICAMM Classification Algorithm
Step 1:
2. Randomly initialize the mixing matrix Aj and the bias vector bj, for each
class j, where j = 1 to K.
For t = 1 to T do
Step 3: Compute the log of the class-component density. From (9.3), the class-
component density can be expressed as (Papoulis 1991),
(9.5)
Thus,
(9.6)
Log of the prior probability p (Sj,t), in (9.6) can be approximated as (Lee et al.
I
2000),
log [p (Sj,t)] ex - ~
M tPj,dog [cosh (Sj,i,t)] _ 52.\
J~,t , (9.7)
where tPj,i is defined as the sign of the kurtosis of the ith independent component
corresponding to class j,
Step 4: Compute the posterior class probability, for each pixel vector Xt> using
the Bayes theorem,
Step 5: Adapt Aj for each class j, using gradient ascent, where the gradient
is approximated using an extended information-maximization learning rule
(Lee et al. 1999),
(9.10)
Step 6: Update the bias bj by using the pixel vectors from r = 1 to t, where tis
the index of the current pixel vector being processed
t
L xrP (wjlxn B )
bj = _r=_~_ _ _ __ (9.11)
L P (wjlxn B)
r=1
End For
(9.12)
Step 9: Use Bayes decision rule to determine class allocation for pixel vector
as,
(9.13)
9.3
Experimental Methodology
The steps involved in our experimental methodology for classification of hy-
perspectral data using the ICAMM algorithm are depicted in Fig. 9.1. First, we
remove the water absorption bands and some noisy bands as observed from
visual inspection of the dataset. Next, we perform preprocessing of data, also
known as feature extraction to reduce the dimensionality further in order to
satisfy N = M assumption for implementing the ICAMM algorithm.
Hyperspectral Classification Using ICA Based Mixture Model 223
Hyperspectral Data
~
~minate water absorption-i
~_ and noisy bands ~
I
Feature Extraction
V: . Projection Projection
Variance anance order order
Ranking Criterion
I I _ ~l~~------,
Feature Selection
[
-------.------.-- ---.------.------------'
[
Fig. 9.1. Proposed experimental methodology for unsupervised classification of hyper spec-
tral data
9.3.1
Feature Extraction Techniques
Due to a large number of narrow spectral bands and their contiguous nature,
there is a significant degree of redundancy in the data acquired by hyperspec-
tral sensors (Landgrebe 2002). In classification studies involving data of high
spectral dimension, it is desirable to select an optimum subset of bands, in
order to avoid the Hughes phenomenon (Hughes 1968) and parameter esti-
mation problems due to interband correlation, as well as to reduce computa-
tional requirements (Landgrebe 2002; Shaw and Manolakis 2002; Tadjudin and
Landgrebe 2001; Richards and Jia 1999). This gives rise to the need to develop
algorithms for reducing the data volume significantly. From the perspective of
statistical pattern recognition, feature extraction refers to a process, whereby
a data space is transformed into a feature space, in which the original data
224 9: Chintan A. Shah
9.3.2
Feature Ranking
9.3.3
Feature Selection
Once the features are ranked in the order of significance, we perform the
task of feature selection, i. e., we determine the optimal number of features
to be retained, which lead to a high compression ratio and at the same time
minimize the reconstruction error. This is indicated by the feature selection
stage in Fig. 9.1, where we selectthe firstM features from the features extracted
by each of the feature extraction techniques. Recall our discussion in Sect. 9.2,
pertaining to the assumption about the restriction on the number of sources
(M) and the number of sensor observations (N) in the ICAMM algorithm. We
Hyperspectral Classification Using ICA Based Mixture Model 225
have made an assumption that the number of sources is equal to the number
of sensor observations (N = M). This simplification is justified due to the fact
that if N > M, the dimensions of the sensor observation vector Xt can always
be reduced so that N = M. Such a reduction in dimensionality can be achieved
by all of the preprocessing or feature extraction techniques discussed above.
However, a detailed review of these feature extraction techniques reveals that,
other than PCA, none of the above mentioned feature extraction techniques
have a well-established theory on estimating the optimal number of ranked
features to be retained. In the case of PCA, the criterion for feature selection
can be stated as - sum of the variance of the retained principal components
should exceed 99% of the total data variance. Hence, we consider PCA as
a technique for estimating the intrinsic dimensionality of the hyperspectral
data. PCA, though optimal in the mean square sense, suffers from various
limitations (Duda et al. 2000), and therefore, there is a need to investigate
the performance of other feature extraction techniques for hyperspectral data.
Thus, in addition to PCA, we consider SPCA, OSP and PP as feature extraction
techniques.
9.3.4
Unsupervised Classification
9.4
Experimental Results and Analysis
To comparatively evaluate the classification performance we utilize the mea-
sure, overall accuracy that is computed as the percentage of correctly classified
pixels among all the pixels. The overall accuracy, obtained by the ICAMM
algorithm, varies for each run. This is due to the random initialization of
the mixing matrix and the bias vectors for the classes. Hence, we present the
results averaged over 100 runs of the ICAMM algorithm. We also provide fur-
ther insight into the ICAMM derived classification obtained from one run by
critically analyzing the classification map and its corresponding error matrix.
Individual class accuracies have been assessed through producer's and user's
accuracy measures (see Sect. 2.6.5 of Chap. 2 for details on accuracy measures).
Data from the AVIRIS sensor, has been used for this experiment (see
Sect. 4.4.3 of Chap. 4 for details on this dataset). For computational efficiency,
a sub-image of size 80 x 40 pixels has been selected here. Some of the bands
(centered at 0.48 11m, 1.59 11m, and 2.1611 m) are depicted in Fig. 9.2a-c. The
226 9: Chintan A. Shah
Com·ootiD
a b c d
7000 ,--------------------r==~==~~
- Bacl.ground
Soybea.ns-miD Gras<fJ'rees - Corn·1lOtiII
6000 - Gras<fJ'rees
- Soybea.ns-min
2000
JOOO ~----------~~--~~~~
0.40 2.45
Wa.'dength (JuD)
e
Fig.9.2a-e. Three images (80 x 40 pixels) centered at a 0.48 \lm, b 1.59 \lm, and c 2.16 \lm.
These images are a subset of the data acquired by the AVIRIS sensor. d Corresponding refer-
ence map for this dataset consisting of four classes: background, corn-notill, grass/trees and
soybeans-min. e Mean intensity value of each class plotted as a function of the wavelength
in \lm. For a colored version of this figure, see the end of the book
Hyperspectral Classification Using ICA Based Mixture Model 227
Background 594
Corn-notill 640
Grass/Trees 383
Soybeans-min 1583
Total 3200
crop canopies had approximately 5% cover, the rest being soil, covered with the
residue of previous year's crop. Given this low canopy cover, the variation in
spectral response due to - i) the soil type variations and ii) the varying amount
of residue from last season's crop, may have a much greater influence upon the
net pixel spectral response than the variation in the type of class. The other
two classes, which are background and grass/trees, are distinguishable from
each other as well as from the other two classes in most wavelength regions.
Figure 9.3a-d show the results of the first six features produced by PCA,
SPCA, OSP and PP respectively. The first 10 principal components obtained
from the PCA have a total data variance of 99.67%, and thus have been used
for classification by the ICAMM algorithm, with N = M = 10. Similarly,
the features extracted by the remaining three techniques have been ranked
based on their corresponding ranking criterion. We select the first 10 features
extracted by each of these feature extraction techniques.
The first six features obtained by SPCA, correspond to the first two principal
components obtained by performing PCA transformation on each block of
correlated bands. It can be seen that the first feature obtained by both PCA and
SPCA extract almost the same information. We investigate the cause of this
similarity. Since PCA has a significant property that it concentrates the data
variability to the maximum extent possible into the first few components and
the variability of the data is scale-dependant, PCA is sensitive to the scaling of
the data to which it is applied.
For example, if the intensity values of one or some of the hyperspectral
bands are arbitrarily doubled, their contribution to the variance of the dataset
will be increased fourfold, and they will therefore be found to contribute more
to the earlier eigenvalues and eigenvectors. With this remark, it is of interest
to carefully observe the first feature in Fig. 9.3a,b as well as the image of
a spectral band centered at 0.48 JIm (Fig. 9.2a). These three images appear
almost identical and hence we conclude that it is one or some of the bands
in the first block of highly correlated bands employed for SPCA, which have
contributed the most to the first feature obtained by PCA performed on the
entire data.
The improvement provided by SPCA over PCA, in enhancing interclass sep-
arability as determined from visual inspection, can be comprehended through
the fact that the third, fourth, fifth and the sixth features generated by PCA do
not exhibit high data variability, while all the six features (shown in Fig. 9.3b)
228 9: Chintan A. Shah
d
Fig.9.3a-d. First six features extracted by a PCA, b SPCA, c OSP, and d PP applied on the
AVIRIS data. The features extracted by PCA and SPCA, are ranked in decreasing order of
variance. For OSP and PP, the features are ranked in the order of projections
information pertaining to this class is split into several features. The first three
features from PP (Fig. 9.3d) reveal that though the information pertaining to
classes background and grassltrees has been significantly enhanced, it does not
capture enough information needed to distinguish the remaining two classes
(i.e. corn-notill and soybeans-min). However, it is interesting to note that the
fourth feature generated by PP exhibits a significant amount of information
when compared with the fourth feature generated by each of PCA, SPCA and
OSP. A remark applicable to each of these feature extraction techniques is that
they show excellent performance in extracting the information correspond-
ing to classes background and grass/trees, but provide not much variability
between the remaining two classes. This observation confirms the earlier dis-
cussion, where we had noted that corn-notill and soybeans-min have almost
equal mean spectral responses in many of the bands, leading to a non-linear
separability between these two classes.
The accuracy of classification produced from the I CAMM and the K -means
algorithm are presented in Table 9.2. It can be seen from this table that the
SPCA technique leads to the highest mean overall accuracy (i.e. 61.4%) as
compared to those achieved by the other feature extraction techniques (i. e.
PCA, OSP and PP). The minimum overall accuracy obtained by SPCA is also
higher than the others. The maximum overall accuracy (i. e. 76.4%) has been
achieved by OSP but is insignificantly higher in comparison to the maximum
overall accuracy achieved by SPCA.
On comparing the classification accuracy of the ICAMM algorithm with
that obtained by the K-means algorithm, it can be seen from Table 9.2 that
the mean overall classification accuracies obtained by the ICAMM algorithm
for all the feature extraction techniques are significantly higher than those
obtained from the K-means classification algorithm. In particular, for the
dataset preprocessed by SPCA, the minimum (53.4%), maximum (75.9%) and
mean (61.4%) overall accuracies attained by the ICAMM algorithm, far exceed
the overall accuracy produced by the K-means algorithm, (i. e. 48.3%).
We further analyze the performances of the ICAMM and the K-means clas-
sification algorithm through their classification maps and the corresponding
error matrices. In Fig. 9.4a-d and Fig. 9.4e-h, we present the classification
Table 9.2. Comparison of overall classification accuracy (%) achieved by the ICAMM and
K-means algorithms applied on the AVIRIS dataset
a b c d
e g h
Fig. 9.4a-h. Classification maps obtained by the ICAMM classification algorithm applied on
data preprocessed by a PCA (Overall Accuracy = 68.6%), b SPCA (Overall Accuracy = 71.7%),
c OSP (Overall Accuracy = 73.1 %), and d PP (Overall Accuracy = 61.6%). Classification maps
obtained by the K -means classifier applied on data preprocessed by e PCA (Overall Accuracy
= 50.5%), f SPCA (Overall Accuracy = 48.3%), g OSP (Overall Accuracy = 51.7%), and h PP
(Overall Accuracy = 52.5%)
maps obtained by the ICAMM and the K-means algorithm applied on the
AVIRIS data preprocessed by each of the feature extraction techniques. In
Table 9.3 to Table 9.6, we present the error matrices from classifications ob-
tained by performing ICAMM and K-means algorithms on the features ex-
tracted by each of PCA, SPCA, asp and PP respectively.
Though, there is no optimal measure for selecting a subset of features from
those obtained by asp, it performs considerably better than PCA and PP in
terms of their minimum, mean, and maximum overall classification accuracies.
PCA detects all the four classes and its mean overall accuracy is comparable
to asp. However, its classification map as shown in Fig. 9.4a reveals a 'noisy'
classification, leading to lower overall accuracies. In the case of PP, once again
there is no optimal measure for determining the number of feature to be
selected, and a mean overall accuracy of 56.7% is obtained. Fig. 9.4d shows
its poor performance in classifying corn-notill, a major portion of which is
classified as soybeans-min. This was expected from the earlier analysis that
the features produced by PP failed to capture information pertinent to these
two classes.
an comparing the classification maps obtained by the ICAMM algorithm,
via all the four feature extraction techniques with those obtained by the K-
Hyperspectral Classification Using ICA Based Mixture Model 231
Table 9.3. Error matrix for the classification of data, preprocessed by PCA. a ICAMM
classification algorithm and b K-means classification algorithm
a
C1 321 23 3 63 410
Classification C2 29 489 0 291 809
Map C3 201 53 380 223 857
C4 43 75 0 1006 1124
Total 594 640 383 1583 3200
C1 43 0 0 331 374
Classification C2 129 154 21 176 480
Map C3 293 1 362 19 675
C4 129 485 0 1057 1671
Total 594 640 383 1583 3200
Producer's Accuracy (%) 7.2 24.1 94.5 66.8
User's Accuracy (%) 11.5 32.1 53.6 63.3
Overall Accuracy (%) = 50.5
C 1 - Background, C 2 - Corn-notill, C 3 - Grass/Trees, C 4 - Soybeans-min
Table 9.4. Error matrix for the classification of data, preprocessed by SPCA. a ICAMM
classification algorithm and b K-means classification algorithm
Cl 350 60 52 463
Classification C2 35 521 0 390 946
Map C3 154 18 382 99 653
C4 55 41 0 1042 1138
Total 594 640 383 1583 3200
Cl 42 0 0 288 330
Classification C2 129 162 10 271 572
Map C3 311 1 373 55 740
C4 112 477 0 969 1558
Total 594 640 383 1583 3200
error matrix (Table 9.3b), where 485 pixels out of the total 640 pixels belonging
to class corn-notill have been classified as soybeans-min. This misclassifica-
tion caused by the K-means algorithm is compensated by significantly higher
accuracies (both producer's and user's) for class soybeans-min. The lowest
producer'S and user's accuracies for the K-means algorithm are obtained for
the class background (7.2% and 1l.5% respectively). In case of the K -means
algorithm applied on the data preprocessed by the SPCA (Table 9.4b), similar
conclusions can be drawn about the accuracies of the class background. These
accuracies are as low as 7.1 % (producer's) and 12.7% (user's). However, these
accuracies obtained by the K -means algorithm for the class background are rel-
atively higher for the asp and PP preprocessed data. For the SPCA preprocessed
data, ICAMM exhibits very high accuracy (99.7%) in classifying grass/trees.
Thus, based on the classification accuracies (Table 9.2), the classification
maps (Fig. 9.4) and the corresponding error matrices (Table 9.3 to Table 9.6),
Hyperspectral Classification Using ICA Based Mixture Model 233
Table 9.5. Error matrix for the classification of data, preprocessed by OSP. a ICAMM clas-
sification algorithm and b K-means classification algorithm
a
we conclude that the AVIRIS data preprocessed by SPCA and classified by the
ICAMM algorithm, exhibits an improved overall classification performance.
Additionally, these results clearly show that the ICAMM algorithm outper-
forms the conventional K-means algorithm for the classification of AVIRIS
data considered in this experiment.
9.5
Summary
The primary issue that motivated this research is the limitation of Gaussian
mixture model based classification algorithms. The underlying Gaussian dis-
tribution assumption is limited in the sense that the Gaussian mixture model
exploits only second order statistics of the observed data to estimate the poste-
234 9: Chintan A. Shah
Table 9.6. Error matrix for the classification of data, preprocessed by PP a ICAMM classifi-
cation algorithm and b K-means classification algorithm
a
Cl 275 8 26 71 380
Classification C2 27 191 0 192 410
Map C3 211 47 357 173 788
C4 81 394 0 1147 1622
Total 594 640 383 1583 3200
rior densities. Quantifying the higher order statistics of each class, rather than
just estimating the mean and covariance to better fit the data into a parametric
class probability density function seems desirable.
In this chapter, we have described a novel approach, derived from ICA, for
unsupervised classification of non-Gaussian classes in hyperspectral remote
sensing imagery. This approach models class distributions with non-Gaussian
densities, formulating the ICA mixture model (ICAMM). We demonstrated
the successful application of the ICAMM algorithm in classifying hyperspec-
tral remote sensing data. In particular, we employed the AVIRIS dataset for
our experiment. Four feature extraction techniques - Principal Component
Analysis (PCA), Segmented Principal Component Analysis (SPCA), Orthogo-
nal Subspace Projection (aSP) and Projection Pursuit (PP) were employed as
a preprocessing step to reduce the dimensionality of the hyperspectral data.
References 235
For the AVIRIS dataset preprocessed by each of the four feature extraction
techniques, the ICAMM classification algorithm produced significantly higher
accuracy as compared to the K-means algorithm. Similar results have been
obtained for the data acquired by another hyperspectral sensor - HYDICE and
have been reported in Shah (2003).
References
Bae U-M, Lee T-W, Lee S-Y (2000) Blind signal separation in teleconferencing using ICA
mixture model. Electronic Letters 36: 680-682
Chang C-I, Chiang S-S, Smith JA, Ginsberg IW (2002) Linear spectral random mixture anal-
ysis for hyperspectral imagery. IEEE Transactions on Geoscience and Remote Sensing
40(2): 375-392.
Common P (1994) Independent component analysis, a new concept? Signal Processing 36:
287-314.
Duda RO, Hart PE, Stork DG (2000) Pattern classification, 2nd edition. John Wiley and Sons,
New York
Harsanyi JC, Chang C-I (1994) Hyperspectral image classification and dimensionality reduc-
tion: An orthogonal subspace projection approach. IEEE Transactions on Geoscience
and Remote Sensing 32: 779-785
Hughes GF (1968) On the mean accuracy of statistical pattern recognition. IEEE Transactions
on Information Theory 14: 55-63
Hyvarinen A, Oja E (2000) Independent component analysis: algorithms and applications.
Neural Networks 13: 411-430
Ifarraguerri A, Chang C-I (2000) Unsupervised hyperspectral image analysis with projection
pursuit. IEEE Transactions on Geoscience and Remote Sensing 38: 2529-2538.
Landgrebe 0 (2002) Hyperspectral image data analysis. IEEE Signal Processing Magazine
19: 17-28
Lee T-W, Girolami M, Sejnowski TJ (1999) Independent component analysis using an ex-
tended infomax algorithm for mixed sub-Gaussian and super-Gaussian sources. Neural
Computation 11: 417-441
Lee T-W, Lewicki MS, Sejnowski TJ (2000) ICA mixture models for unsupervised classifica-
tion of non-Gaussian classes and automatic context switching in blind signal separation.
IEEE Transactions on Pattern Analysis and Machine Intelligence 22: 1078-1089
Lee T-W, Lewicki MS (2002) Unsupervised image classification, segmentation, and enhance-
ment using ICA mixture models. IEEE Transactions on Image Processing 11: 270-279
Papoulis A (1991) Probability, random variables, and stochastic processes, 3rd edition.
McGraw Hill, New York
Richards JA, Jia X (1999a) Remote sensing digital image analysis: an introduction. Springer-
Verlag, Berlin
Richards JA, Jia X (1999b) Segmented principal components transformation for efficient
hyperspectral remote sensing image display and classification. IEEE Transactions on
Geoscience and Remote Sensing 37: 538-542
Robila SA, (2002) Independent component analysis feature extraction for hyperspectral
images, PhD Thesis, Syracuse University, Syracuse, NY
Robila SA, Varshney PK (2002a) Target detection in hyperspectral images based on inde-
pendent component analysis. Proceedings of SPIE Automatic Target Recognition XII,
4726, pp 173-182
236 9: Chintan A. Shah
Robila SA, Varshney PK (2002b) A fast source separation algorithm for hyperspectral image
processing. Proceedings ofIEEE International Geoscience and Remote Sensing Sympo-
sium (IGARSS), pp 3516-3518
Shah CA (2003) ICA mixture model algorithm for unsupervised classification of
multi/hyperspectral imagery. M.S. Thesis, Syracuse University, Syracuse, NY
Shah CA, Arora MK, Varshney PK (2004) Unsupervised classification of hyper spectral data:
an ICA mixture model based approach. International Journal of Remote Sensing 25:
481-487
Shaw G, Manolakis D (2002) Signal processing for hyperspectral image exploitation. IEEE
Signal Processing Magazine 19: 12-16
Stark H, Woods JW (1994) Probability, random processes, and estimation theory for engi-
neers, 2nd edition. Prentice Hall, Upper Saddle River, NJ
Tadjudin S, Landgrebe DA (2001) Robust parameter estimation for mixture model. IEEE
Transactions on Geoscience and Remote Sensing 38: 439-445
CHAPTER 10
10.1
Introduction
the structure of an SVM is less complex even with the high dimensional data.
Neural networks, for example, have a very complex structure for processing
high dimensional data. Finally, as compared to another recent nonparametric
classifier namely the decision tree classifier, an SVM does not require the gen-
eration of rules that heavily depend on the knowledge from experts. This is
crucial to achieve high classification accuracy.
Classification of remote sensing data using SVMs has been introduced re-
cently. Gualtieri and Cromp (1998) and Gualtieri et al. (1999) used SVMs to
classify an AVIRIS image. The results were compared with those obtained
from 'the extraction and classification of homogeneous objects' (ECHO) clas-
sifier and the Euclidean classifier (Tadjudin and Landgrebe, 1998). Accuracy
of 87.3%,82.9%, and 48.2% were obtained from SVM, ECHO, and Euclidean
classifiers respectively. In 2002, Zhu and Blumberg (2002) classified an ASTER
image using SVMs. They reported the results of a classification experiment
with an SVM using polynomial kernels of different degrees and the radial ba-
sis function (RBF) kernel. The overall accuracies were obtained in the range of
87% to 91 %. Melgani and Bruzzone (2002) used an SVM to classify an AVIRIS
image and compared that with K-Nearest Neighbors (K-NN), and RBF neu-
ral network (RBF-NN) classifiers, and obtained accuracies of 93.42%, 83.94%,
and 86.99% from SVM, K-NN, and RBF classifiers respectively. Huang et al.
(2002) intensively investigated the accuracy obtained from SVM classifiers on
a Landsat TM image. They reported an accuracy of 75.62% from the SVM,
74.02% from the back propagation neural network, 73.31 % from the decision
tree, and 7l.76% from the maximum likelihood classifiers. Shah et al. (2003)
reported the accuracies attained by SVM, MLC, and back-propagation neural
network classifiers when applied on an AVIRIS image with and without the use
of supervised feature extraction techniques. They used the LSVM (see Chap. 5)
with a linear kernel, polynomial kernel of different degrees, a RBF kernel, and
a sigmoid kernel. They showed that accuracies of 90.9% to 97.2% were obtained
from SVM classifiers while the maximum likelihood and neural network clas-
sifiers could achieve accuracies of 59.4% and 6l.9% respectively on the full
dimensionality dataset. They also showed that the classification results from
the full dimensionality dataset were better than the classification of features
obtained using the discriminant analysis feature extraction (DAFE) and the
decision boundary feature extraction (DBFE) techniques (Lee and Landgrebe
1993).
From these limited studies, it can be seen that in all the cases SVM derived
classifications of both multi- or hyperspectral datasets produced the highest
accuracy. However, there are many issues, which need to be further investigated
before SVM classifiers can be implemented at the operational level.
The aim of this chapter is to understand the application of SVMs and the
behavior of associated parameters for the classification of multi and hyper-
spectral remote sensing data. In the next section, the details of parameters
considered are provided. Section 10.3 describes the remote sensing data used
to produce SVM classification. Experimental set up, results and their analyses
is presented in Sect. 10.4. A summary is provided in Sect. 10.5.
Support Vector Machines for Classificationof Multi- and Hyperspectral Data 239
10.2
Parameters Affecting SVM Based Classification
In practice, the training data do not comprise of only pure pixels. Some of
them are mixed pixels due to either the noise present in the training data or
the mixture of classes during the selection of training data due to mislabeling
thereby introducing classification errors. A penalty parameter is introduced to
handle this error. The penalty parameter is also called the C value and depends
on the training data and the type of kernel. In general, it is found by trial and
error.
A kernel function determines the characteristic of an SVM. When the classes
in the input data are separable using a linear hyperplane, a linear kernel is used.
Similarly, a nonlinear kernel function may be used to separate data when the
classes are nonlinearly separable. However, it is very difficult to determine
which non-linear kernel function is suitable for a given application. In prac-
tice, a user will apply standard kernel functions such as a linear, polynomial,
RBF (also known as the Gaussian kernel), or sigmoid kernel. If these standard
kernel functions are unable to perform adequately, a user may have to design
a problem-specific kernel function. Moreover, the choice of the kernel further
brings in another issue that a user has to consider, which is to determine the val-
ues of parameters for a given kernel function. These parameters may be called
hyperparameters. For example, in case of an RBF kernel, the hyperparameter
that needs to be determined is a (see Sect. 5.3.3).
Thus, finding the parameters or hyperparameters that are able to accurately
predict the unknown data for a given application requires a method of model
selection or parameter search. The unknown data, in this case, refers to testing
samples or the image to be classified during the allocation stage. It is also a fact
that a model that achieves high training accuracy may not guarantee high
accuracy for the unknown data. Therefore, to establish that the model works
well on the whole dataset (i. e. the model has high generalization capacity),
a common strategy is to split the data into two parts - a training set, and
a validation set or a testing set. An SVM is trained on the training set and is
evaluated for its accuracy on the testing set. Another model selection method
is called k-fold cross-validation. In this method, the training data are divided
into k subsets of equal size. An SVM is trained on the k - 1 subsets of data
and is tested on the remaining one subset. Training and testing are performed
240 10: Pakorn Watanachaturaporn, Manoj K. Arora
on the input data k times with different k - 1 subsets for training and the
remaining one subset for testing. In other words, the testing subset is used to
assess the accuracy of the model trained by the k-1 subsets. After repeating the
above process k times, the overall accuracy measured by the cross-validation
method is the percentage of the number of correctly classified data for all k
testing subsets. The model that produces the highest overall accuracy is used
to classify data during the allocation stage.
A straightforward way to find a judicious combination of the penalty value
and the hyperparameters of kernel functions is to use a grid-search method.
In the grid-search method, various combinations of the penalty value and
the hyperparameters are formed and their effect on the accuracy is assessed
by using either a validation set or k-fold cross-validation. The model that
gives the best accuracy is selected. Applying the grid-search method is very
time consuming because it acts on every combination of the penalty value
and hyperparameters. To reduce the model selection time, one may perform
a coarse grid-search first. For example, the penalty value may take values 10 2
apart such as 10-5, 10-3 , .•• , 105. Then, after identifying the best classification
accuracy region obtained from the coarse grid-search, a finer grid-search may
be conducted over that region only. For instance, the best accuracy from the
coarse grid search might be between the penalty values of 10 1 to 103 . The finer
grid search might be performed over penalty values with finer resolution (e. g.
10 1,101.1, ... ,103 ). The final model is trained using the whole training set with
the hyperparameters identified by the grid-search method.
Another issue related to SVM classifiers is the selection of appropriate multi-
class method (see Sect. 5.4). SVMs were initially developed to perform binary
classification; though, applications of binary classification are very limited.
More practical applications involve multiclass classification. For example, in
remote sensing, land cover classification is generally a multiclass problem.
A number of methods have been proposed to employ SVMs to produce mul-
ticlass classification. Most of the methods generate multiclass classifications
from a number of binary SVM classifiers. All the methods have their own merits
and demerits. For example, the one against the rest approach gives an advan-
tage in terms of simplicity and requires fewer binary classifiers. However, many
have argued that in this method the number of training samples of each class for
each binary classifier become significantly imbalanced. To solve this problem,
methods such as a re-sampling method have been proposed. In the re-sampling
method, pseudo training data are created by copying existing training data and
adding/subtracting noise. In other words, a new training area is identified for
a class and a number of training data are selected randomly within that area.
The latter approach may help in improving the accuracy in some cases. How-
ever, it results in increased training time since the number of training data
is increased due to the use of the pseudo training data. Both the approaches
are simpler than the pairwise and directed acyclic graph (DAG) approaches.
However, experiments show that, training time of the one against the rest both
with and without balancing training data is longer and percent accuracy is
lower. On the contrary, the pairwise and DAG approaches require more binary
classifiers, but they take less training time and give more accurate results.
Support Vector Machines for Classificationof Multi- and Hyperspectral Data 241
10.3
Remote Sensing Images
10.3.1
Multispectral Image
A UTM rectified Landsat 7 ETM + multispectral image acquired in 8 bands
has been used. The image was acquired in 1999 and covers an urban region of
Syracuse, NY. Only 7 bands excluding the panchromatic band, due to its finer
spatial resolution than the remaining bands were considered.
Approximately 10% of the area in the image selected is covered by water
whereas the remaining 90% covers a combination of built up area, small veg-
etation, and trees. The size of the image is 445 x 595 pixels, which has been
resampled to 25 m at the source (see Fig. 10.1).
A large number of pure pixels were extracted from the Landsat ETM + image
for the six classes of interest, namely water, highways/runways, grassland,
Table 10.1. Total pure pixels, training, and testing pixels from the Landsat ETM+ image
Fig. 10.1. Color composite from the Landsat 7 ETM+ multispectral image. For a colored
version of this figure, see the end of the book
commercial areas, trees and residential areas. UTM rectified IKONOS MSS
and PAN images at 4 m and 1 m fine spatial resolution and the previous
knowledge about the study area were used as reference data (ground data)
to determine the class identity of the pure pixels. From the database of pure
pixels, the training and testing datasets consisting of randomly selected 200
pixels of each class were generated (Table 10.1).
10.3.2
Hyperspectral Image
Hyperspectral image used here comes from the AVIRIS sensor (see Sect. 4.4.3
of Chap. 4 for more details on this dataset).
In our experiments, similar to the other studies, twenty water absorption
bands numbered [104-108], [150-163], and 220 were removed from the
original image. In addition, fifteen noisy bands [1 - 3], 103, [109 - 112], [148 -
149], [164 - 165], and [217 - 219], as observed from visual inspection, were also
discarded. A pure pixel database was created from which a number of training
and testing pixels for each class were randomly selected (Table 1O.2).
Support Vector Machines for Classification of Multi- and Hyperspectral Data 243
Table 10.2. Total number of pure pixels, training, and testing pixels from the AVIRIS image
Alfalfa 13 7 6
Corn-not ill 234 117 117
Corn-min 140 70 70
Corn 60 30 30
Grass/pasture 94 47 47
Grass/trees 116 58 58
Grass/pasture-mowed 18 9 9
Hay-windrowed 160 80 80
Oats 26 13 13
Soy-notill 231 115 116
Soy-mintill 291 146 145
Soy-clean 91 45 46
Wheat 49 25 24
Woods 208 104 104
Bldg-grass-trees-drives 81 40 41
Stone-steel towers 19 10 9
10.4
SVM Based Classification Experiments
This section illustrates the use of SVMs to produce land cover classification
from multi and hyperspectral remote sensing data described in the previous
section. The effect of a number of factors on the accuracy of classification
produced by the SVM classifier has been investigated. These factors are - se-
lection of the multiclass method, choice of the optimizer and type of the kernel
function. As a result, a number of SVM classifications have been performed.
The accuracy of all the classifications has been assessed using the most widely
used measure namely overall accuracy obtained from the error matrix (see
Sect. 2.6.5 of Chap. 2).
10.4.1
Multiclass Classification
To perform multi class classification, four methods have been used to examine
the effect of each on classification accuracy. These methods are one against the
rest, one against the rest with a balanced number of training data, pairwise clas-
sification, and directed acyclic graph (DAG). A number of classifications using
the Lagrangian Support Vector Machine and the linear kernel for different val-
ues of C have been produced. The variations in classification accuracy obtained
by applying each multiclass method are analyzed with the help of plots shown
in Fig. lO.2 and Fig. lO.3 for multi and hyperspectral datasets respectively.
244 10: Pakorn Watanachaturaporn, Manoj K. Arora
100
90
~
DO
~ ~ ...
80
>-
u 70 /
f! T /" ~
::J
u
u
60
50 /r
/,/
r ""E>
/
4(
40
~ V Ii
"
II ~Pairwise
> 30
0
20 IK' -B-OAG
10
. ~
! ~ Qne.Y·Rest
~.~. ~
-"- .~.
J ~ Qne.Y·RestBaI
Fig.lO.2. Overall accuracy of multi class approaches applied on the multispectral image
100
90
80
>.
u 70
E
::J 60
u
u
4( 50
e. 40
> 30
0
20
10
0
It can be seen that for both the datasets, the pairwise and DAG methods re-
sult in significantly higher accuracies than the other two methods. Both these
methods show a similar trend with accuracy reaching 95%, even for a small
value of C. The other two methods could attain an accuracy of only 70%. In
all the methods, as the value of C is increased, the accuracy increases but gets
saturated at a certain value of C in case of multispectral data. However, in the
hyperspectral dataset, the accuracy drops after attaining a maximum value for
each method. This shows that there is an optimum value of C irrespective of
Support Vector Machines for Classification of Multi- and Hyperspectral Data 245
the multiclass method used for the classification of a certain dataset. Thus, the
choice of C is highly data dependent.
10.4.2
Choice of Optimizer
100
~~
90
80
~
>- 70 //
U
...:::Ico 60
r
U
U
50
/
I
ct
iii 40
a;
> 30
J,
0
20 if _ _ CD
I
~
I.3--l:::t--tf
10 .......... lSVM
I
"'
Penalty Value (C)
Fig. lOA. Overall accuracy resulting from different optimizers for the multispectral image
246 10: Pakorn Watanachaturaporn, Manoj K. Arora
200
~
180
160
140
H----- CD --B- LSVM!
I
<II
E 120
I
j::
100
I
C>
t: 80
I
';::
'iii
... 60 Y
I-
40 ~
20
~9f
I,",,",,",,",,",,",,",,",,",,",,",==~ L>...d
Fig. 10.5. Training time when using different optimizers on the multispectral image
0.70
0.60
c:r c:r
I~
~ ~
~~~X>
""
0.50
/
~
~~
A
~
j::
OAO
V~""
"" L>. f()
'V V'
~ 0.30
~
~ 0.20
_____ CD --B- LSVM
0.10
Fig. 10.6. Testing time when using different optimizers on the multispectral image
and 103 (Fig. 10.7). For larger values of C, while the accuracy of the CD method
remains the same, it drops for LSVM.
In terms of training and testing times (Fig. 10.8 and Fig. 10.9), both the
optimizers take very little time and are very close to each other up to a C value
of 5, where the accuracy reaches its maximum. For values of C greater than 5,
the LSVM takes longer time than the CD method. But this becomes immaterial
since the maximum value of accuracy has been achieved at C == 5.
Support Vector Machines for Classificationof Multi- and Hyperspectral Data 247
100
90
80
>-
u 70
I!!
::I 60
u
u
c( 50
...GI
'iii 40
> 30
0 20
10
0
,,<:) <:)<:)
c"
<:)0<§>"
c"
<:)0<:)"
c"
<:)'il <:)0<:)" "
<:)0
" "
<§><:) c<:) s§><:)
" "r§S <:)"
"
Penalty Value (C)
Fig. 10.7. Overall accuracy of different optimizers for the hyperspectrai image
3500 r.======::::;-----------r:1-~
3000 ___ CD - - L S V M f - - - - - - - - - - - - + - - T - - - - i
~2500~==========~----------_n~~~~
GI
~ 2000t------------------r---D~1
~ 1500t-----------------+-----T1
·c
~ 1000t-----------------tir-----1
500t--------------~~~-----1
Fig.l0.S. Training time when using different optimizers on the hyperspectrai image
,
30.00
~_ _ CD - G - LSV.. j ~
/
25.00
~ 20.00
E
i= 15.00
~
OJ
I:
:;::;
/
~ 10.00
L>.
5.00
IR.~;:;~"'::'""::,,
~~~ ~ ........
""~
L>.
~~
dJ.A L>.
0.00
f::Jt;;:)\:)"- \:)\:)\:J"-
"
,,<;5 ".
Fig. 10.9. Testing time when using different optimizers on the hyperspectral image
10.4.3
Effect of Kernel Functions
The choice of a proper kernel function also plays an important role in SVM
based classification. Four types of kernels, namely the linear kernel, the poly-
nomial kernel, the RBF kernel, and the sigmoid kernel have been investigated.
Five polynomial kernels with degrees varying from 2 to 7 have been used. Thus,
the effect of 9 kernel functions on the accuracy of classification implemented
using the LSVM optimizer and the pairwise multiclass method for different
values of the penalty parameter has been assessed. Classification of both multi-
and hyperspectral datasets has been performed. The variation in the overall
accuracy of multispectral classification over different C values is presented in
the form of plots drawn for different types of kernel functions (Fig. 10.10 to
Fig. 10.14).
It can be seen from these plots that the use of all the kernels, except the
sigmoid function, results in an accuracy of more than 90%. The sigmoid ker-
nel could attain a maximum accuracy of 79% at C == O. 5, which gradually
dropped with any increase in C. This shows a relatively poor performance
of this kernel with respect to others. In contrast, all other kernels showed
an increase in accuracy as C increased and reached their maximum after
which accuracies remained constant in general. This shows that there is an
optimum C value for a particular kernel. Focusing on the performance of
only polynomial kernels, it can be seen that the polynomials with degrees 2
to 4 produced very low accuracies (of the order of 20%) when C was very
small. But the accuracy shoots up with a small increase in C. On the other
hand, the initial accuracy obtained from polynomials with higher degrees
(from 5 to 7) is high (of the order of 70% to 80%) at small C values. Also,
Support Vector Machines for Classificationof Multi- and Hyperspectral Data 249
Linear Kernel
100
90
~~
eo
~ 70
/
I!!
:I 60 I
~
u
50
I
40
I
~ 1
~
30
20
/
~
10
Fig.lO.lO. SVM performance using the linear kernel applied on the multispectral image
100
90 = ......
12rW""'<7
eo ~
~ 70
/rjf /
E
:I 60
//?
!:l
C( 50
/ t2f 1
/ / / ~PoIy02
~
40
'/ I -B-PoIy03
& 30
J Jb -&-PoIy04
20
I~
10
0
,
#" #"
I)-
<s>'
<:I-
I)"
I)-
I)' "I) ,,<s> ,#' ,#' #'
I)-
'"
Penalty Value (e)
Fig.IO.II. SVM performance using the polynomial kernels of degrees 2 to 4 applied on the
multispectral image
100
~ ~ -
90
o ~ = l!r1!I'~
>-
(,l
80
70
[3/
f!
:l 60
j
(,l
(,l
< 50
40 -+-PoIy D5
~
5 30
20
-e-PoIyD6
-tr-Poly D7
10
0
.!'., #., ., ., .,<5> .,# .,.!'
~., .,~
<5>"
~'
~.
~.
~
"
~.
~.
Fig. 10.12. SVM performance using the polynomial kernels of degrees 5 to 7 applied on the
multispectral image
100
A A
90
80 V' -v'
~ 70
f!
:l 60
::I
< 50
40
~
~ 30
0 20
10
0
.!'., #., ., ., .,<5> .,# .,.!'
~., .,~
<5>" ~
~.
(>. ~'
(>.
(>'
"
Penalty Value (C)
Fig. 10.13. SVM performance using the Radial Basis Function kernel applied on the multi-
spectral image
the maximum. This shows that there is a limiting C value for all the kernel
functions.
In the case of hyperspectral data (see Fig. 10.15 to Fig. 10.19), generally,
a similar trend has been observed except for a couple of observations worth
mentioning. Although, all the kernels achieved a maximum accuracy of more
than 90%, this maximum occurs at different C values for each kernel. For
Support Vector Machines for Classificationof Multi- and Hyperspectral Data 251
Sigmoid Kernel
100
90
80
>- 70
/~
"f! ? \
::I
""
<I:
60
50 r \
E 40 / ~
CD
> 30 / .~.
0
20 /b ~
v
10
o
()."- "-
Fig.1O.14. SVM performance using the sigmoid kernel applied on the multispectral image
Linear Kernel
100
90
80 /' ~
>- ? ~
"E 70
/ \
::I 60
""
<I: 50 ~ \
40
Af' ~
ECD 30
~ \
0
>
20 \
10
~
~
Fig.IO.IS. SVM performance using the linear kernel applied on the hyperspectral image
100
A-..I>.. ...
90
80
,~
ft r
...>- 70
./ fi
f! I~
.7
......
~ 60
~
"'f
50
40
l<f ~PoIyD2
5•
30 -B-PoIyD3
2() -fr-PoIy04
10
0
,
~' #"
~.
~.
~'
~.
~,
~.
~':- ,~ ,~ ,#' ,~ #"
'"
Penalty Value (C)
Fig. 10.16. SVM performance using the polynomial kernels of degrees 2 to 4 applied on the
hyperspectral image
100
90
I~.L>.
v V' V' V' V V v
80 1r::J =-- .-. ,-,0
>- 70
~ ~<;>~ ~~ ~L.> ~~~ ~u L.>~
...
~ 60
~ 50
~PoIyD5
40
f
5 30
2()
-B-PoIyD6
-fr-PoIyD7
10
0
, ,
~' #"
~.
~.
~'
~.
~,
~.
~.
,~
,~ ,#' ,~ ,#"
Penalty Value (C)
Fig. 10.17. SVM performance using the polynomial kernels of degrees 5 to 7 applied on the
hyperspectral image
with any further increase in the C value unlike the accuracy achieved with
multispectral data which gets stabilized after the maximum has been achieved.
For the RBF and sigmoid kernels, the maximum occurs at very high values
of C. But, since an increase in C is directly proportional to the training time
required for LSVM optimization methods (as observed in the previous section),
any kernel functions , which can produce the highest accuracy at lower C values
Support Vector Machines for Classificationof Multi- and Hyperspectral Data 253
100
90 ~
80 /
>-
u 70
y:r
e
::I 60 d
u
u
<t 50
Y
~
0
..
li!
>
40
30
~ ~ ~ ~ ~
d
20
10
Fig.IO.1S. SVM performance using the RBF kernel applied on the hyperspectral image
Sigmoid Kernel
100
h
90
80 r ~
>-
u d \
e::I 70
60
y:f \
«
u
u
50 L \
/ '\
EII)
40
~ \
> 30
0
20
~
"\
~
10
0
Penalty Parameter
Fig. 10.19. SVM performance using the sigmoid kernel applied on the hyperspectral image
10.S
Summary
In this chapter, we have considered the application of SVMs for the classification
of multi and hyperspectral remote sensing datasets. We also investigated the
effect of various parameters on the accuracy of SVM based classification. It
is clear from the results that with an appropriate selection of the multiclass
method, optimizer and the kernel function, accuracy of the order of 95% can be
obtained for the classification of multi and hyperspectral image classification.
The penalty value (the C value) has an important bearing on the performance
of any SVM classifier. However, this may have to be selected by trial and error
for a given dataset.
References
Byun H, Lee SW (2003) A survey on pattern recognition applications of support vector
machines. International Journal of Pattern Recognition and Artificial Intelligence 17:
459-486
Chang CC, Lin CJ (2002) LIBSVM: a library for support vector machines
(http://www.csie.ntu.edu.tw/~cjlin/libsvm )
Gualtieri JA, Cromp RF (1998) Support vector machines for hyperspectral remote sens-
ing classification. In: Merisko RJ (ed), Proceeding of the SPIE, 27th AIPR Workshop,
Advances in Computer Assisted Recognition, 3584, pp 221-232
Gualtieri JA, Chettri SR, Cromp RF, Johnson LF (1999) Support vector machine classifiers
as applied to AVIRIS data. Proceedings of Summaries of the Eighth JPL Airborne Earth
Science Workshop (ftp:/ /popo.jpl.nasa.gov/pub/docs/workshops/99_docs/31.pdf)
Huang C, Davis LS, Townshend JR (2002) An assessment of support vector machines for
land cover classification. International Journal of Remote Sensing 23(4): 725-249
Joachims T (2002) The SVM1ig ht package (http://svmlight.joachims.org or ftp://ftp-aLcs.uni-
dortmund.de)
Keerthi SS (2002) Efficient tuning ofSVM hyperparameters using radius/margin bound and
iterative algorithms. IEEE Transactions on Neural Networks 13: 1225-1229
Lee C, Landgrebe DA (1993) Feature Extraction and Classification Algorithms for High
Dimensional Data. Technical Report, TR-EE 93-1, School of Electrical Engineering,
Purdue University
Mangasarian OL, Musicant D (2000) Lagrangian support vector machines, Technical Report
(0006), Data Mining Institute, Computer Science Department, University of Wisconsin,
Madison, Wisconsin (ftp://ftp.cs.wisc.edu/pub/dmi/tech-reportl0006.ps)
Melgani F, Bruzzone L (2002) Support vector machines for classification of hyperspectral
remote-sensing images. International Geoscience and Remote Sensing Symposium,
IGARSS'02, CD.
Shah CA, Watanachaturaporn P, Arora MK, Varshney PK (2003) Some recent results on hy-
perspectral image classification. Proceedings of IEEE Workshop on Advances in Tech-
niques for Analysis of Remotely Sensed Data, NASA Goddard Space Flight Center,
Greenbelt, MD, CD
Tadjudin S, Landgrebe DA (1998) Classification of High Dimensional Data with Limited
Training Samples. PhD thesis, School of Electrical Engineering and Computer Science,
Purdue University.
References 255
11.1
Introduction
that considers this spatial distribution within and between pixels in order to
produce maps at sub-pixel scale. This process is called super-resolution map-
ping (Tatem et al. 2002) or sub-pixel mapping (Verhoeye and Wulf 2002) to
distinguish it from sub-pixel classification. Thus, a sub-pixel map is a map
that is derived at an improved spatial resolution finer than the size of the pixel
of the coarse resolution image being classified. Tatem et al. (2002) provide an
excellent review on this subject. A range of algorithms based on knowledge-
based procedures (Schneider, 1993), Hopfield neural networks (Tatem et al.
2002) and linear optimization methods (Verhoeye and Wulf 2002) have been
proposed for sub-pixel mapping. Knowledge based procedures depend on the
accurate identification of boundary features that divide the mixed pixels into
pure components at improved resolution. The drawbacks of this technique
are:
Similarly, for Hopfield neural network and linear optimization based meth-
ods, the availability of an accurate sub-pixel classification derived from some
other techniques is a pre-requisite. Thus, the accuracy of the resulting sub-pixel
map is limited by the accuracy of the sub-pixel classification technique used.
Moreover, in these algorithms, the spatial dependence within and between
pixels is incorporated only after the fraction images from a sub-pixel classi-
fication technique are obtained. In contrast, the Markov random field (MRF)
model based algorithm, proposed here, neither relies on the availability of ac-
curate boundary features nor on sub-pixel classification produced from other
techniques. Under an MRF model, the intensity values of pixels in a particular
spatial structure (i. e. neighborhood) are allowed to have higher probability
(i. e. weight) than others. For instance, in a remotely sensed land cover classi-
fication, the spatial structure is usually in the form of homogenous regions of
land cover classes. As a result, an MRF model assigns higher weights to these
regions than to the isolated pixels thereby accounting for spatial dependence
in the dataset.
The aim of this chapter is to introduce an MRF model based approach for
obtaining a sub-pixel map from hyperspectral images. The approach is based
on an optimization algorithm whereby raw coarse resolution images are first
used to generate an initial sub-pixel classification, which is then iteratively
refined to accurately characterize the spatial dependence between the class
proportions of the neighboring pixels. Thus, spatial relations within and be-
tween pixels are considered throughout the process of generating the sub-pixel
map. Therefore, the proposed approach may be more suitable for sub-pixel
mapping as the MRF models can describe the spatial dependence in a more
accurate manner than algorithms proposed in Verhoeye and Wulf (2002) and
Tatem et al. (2002).
This chapter is organized as follows. In Sect. 11.2, the theoretical concept of
MRF models for sub-pixel mapping is presented. The details of an MRF based
An MRF Model Based Approachfor Sub-pixel Mapping from Hyperspectral Data 259
11.2
MRF Model for Sub-pixel Mapping
Let Y be the observed coarse spatial resolution image having M x N pixels and
X be the fine resolution sub-pixel map (SPM) having aM x aN pixels where a is
the scale factor of the SPM. This means that a particular pixel in the observed
coarse resolution image contains a2 pixels of the SPM. We assume that the pixels
in the fine resolution image are pure and that mixed pixels can only occur in the
observed coarse resolution image. Thus, more than one class can occupy a pixel
in a coarse resolution image. In general, a can be any positive real number, but,
for simplicity, here it is assumed to be a positive integer number. Also, let .8
and 7 denote the sets of all sites (i. e. pixels) belonging to the observed image
and the SPM, respectively. Thus, the number of sites belonging to 7 will be a2
times the number of sites in the set .8. Let 7 j = {~, ... , ~2} represent the set
of all the pixels in 7 that correspond to the same area as the pixel Sj in .8. The
observed coarse resolution multi or hyperspectral image is usually represented
in vector form so that, Y(Sj) E IRK for the pixel Sj where IR denotes the set of
real numbers (e. g. intensity values) and K is the number of spectral bands. As
stated earlier, each pixel in the SPM is assumed pure. That is, its configuration
x(t) (i. e. attribute) denotes one and only one class. Hence, x(t) E {I, ... , L}
can only take an integer value corresponding to the class at a pixel t in the
actual scene, where L is the number of classes. There can be L a 2 MN dissimilar
sub-pixel maps X (7) E {1, ... , L} T each having a different class allocation in
at least one pixel.
It is further assumed that the SPM has the MRF property, i. e., the condi-
tional probability of a configuration (i. e. intensity value) of a pixel given the
configurations of the entire image excluding the pixel of interest is equal to
the conditional probability of the configuration of that pixel given the con-
figurations of its neighboring pixels. This can mathematically be represented
as
where 7 - It} is the set of all the pixels in 7 excluding the pixel t, and Nt is
the set of pixels in the neighborhood of pixel t. For example, in the context of
classification of remotely sensed images, this property implies that the same
class is more likely to occur in connected regions than at isolated pixels. Hence,
the conditional probability density functions (pdfs) in (I l.1) have a higher
value if the configuration of a pixel t is similar to the configurations of its
neighboring pixels than the cases when it is not. From Winkler (1995) and
Bermaud (1999), the marginal PDF of X takes the form of Gibbs distribution,
260 11: Teerasit Kasetkasem, Manoj K. Arora, Pramod K. Varshney
i. e.,
This model is also applied here to describe the SPM since, in general, the
distribution of classes is similar to the phenomenon described above (i. e.
classes occupying neighboring pixels are likely to be the same).
Assume a fine resolution remote sensing image with spatial resolution equal
to the SPM, such that each pixel of the image corresponds to only one class,
and that each class has normal distribution. The pdfs with mean vector PI and
covariance matrix II for each class are given by,
1
Pr(zlx=i)
(2rr)K/2 Jldet (II) I
where z is the observed vector of the fine resolution image and x' denotes the
transpose matrix of x.
As discussed earlier, a2 pixels in the SPM correspond to one pixel in the
observed coarse resolution image. Hence, the pdf of an observed vector y(s)
in the coarse resolution image is assumed normally distributed with the mean
vector and covariance matrix given by,
L
p(s) = L bz(S)PI (l1.5)
1=1
and
L
I(s) = L bl(s)II , (l1.6)
1=1
respectively, where PI andII are the mean vector and covariance matrix for each
class, bz(s) is the proportion of class 1present in TSsuch that L::T=l bl(s) = 1. Note
An MRF Model Based Approachfor Sub-pixel Mapping from Hyperspectral Data 261
here that both the mean and the variance of the observed image are functions
of pixels s. Hence, the conditional PDF of the observed coarse resolution image
can be written as
Pr (Y IX) =Tl Pr (y(s) Ib(s))
SES
=Tl 1
SES (2rr)K/2 jldet (2's) I
11.3
Optimum Sub-pixel Mapping Classifier
The algorithm based on the maximum a posteriori probability (MAP) criterion
selects the most likely SPM among all possible SPMs given the observed image.
The MAP criterion is expressed as (Trees 1968; Varshney 1997),
where Pr (X IY) is the posterior probability of the SPM when the coarse reso-
lution observed image is available. By using the definition of the conditional
I I
pdf, (11.8) can be rewritten as
Principal
components of the
image treated as the
observed image
Initial SPM
SPM. A similar procedure is adopted for each fraction image and the pixels in
the set 7 j are, then, randomly labeled with all the classes. This is referred to as
the initial SPM. It is expected that the initial SPM will have a large number of
isolated pixels since their neighboring pixels may belong to different classes.
The initial SPM is used as the starting point in the iteration phase of the search
process. An appropriate starting point (which is often difficult to obtain) may
result in a faster convergence rate of the SA algorithm.
The initial SPM together with the observed data and its estimated parameters
are the inputs to the iteration phase of the algorithm (Fig. 11.2). Since the
SA algorithm is used for the optimization of (11.l3), a visiting scheme to
determine the order of pixels in the SPM whose configurations (i. e. class
attributes) are updated, needs to be defined. As stated in Chap. 6, different
visiting schemes may result in different rates of convergence (Winkler 1995).
Here, a row wise visiting scheme is used. The temperature T, which controls
the randomness of the optimization algorithm (Winkler 1995; Bremaud 1999)
is set to a predetermined value To, and the counter, h, is set to zero. In the
next step, the configurations of pixels are updated. Without loss of generality,
let us assume that the pixel tl is currently being updated. Then, the value of
the energy function, Epost (X Iy), from (11.12), associated with all possible
configurations at the pixel tl are determined as
(11.14)
1. Initial SPM
2. Observed Data
3. Estimated Parameters
Reduce Tby a
Find visiting scheme
pre-determined value
{t" t2 , ... }, set T= To
Move to a new site
and h =0.
NO
Fig. 11.2. Iterative phase of the optimization algorithm to generate the SPM
An MRF Model Based Approachfor Sub-pixel Mapping from Hyperspectral Data 265
where LCOlt! denotes the summation over all the cliques to which tl belongs,
St! is the site in the observed image corresponding to tl, and r is due to
the remaining terms in (11.12) that are independent of the configuration of tl.
Note that depending upon the neighborhood system considered in a particular
problem, different types of cliques could be employed. Since r is independent
of X(tl), a list of probabilities associated with all possible configuration values
at tl can be determined by ignoring r,
L VdX)
COlt!
+! (y(St!) - p(St!»)' (11.15)
x (I(St!)r 1 (y(St!) - p(St!»
+! log Idet (I(St!») I
where ZI is the normalizing constant such that summation of (11.15) over all
possible values of P[x(tdl is equal to one. A new configuration of pixel tl is,
thus, generated based on (11.15). It is from this equation that the configuration
corresponding to a lower energy value has higher likelihood of being generated
than the one with a higher energy value. Next, we repeat the same process
for each pixel until all pixels are updated to new configurations. Then, the
algorithm increases the counter by one (i. e., h = h + 1) and determines a new
temperature value such that the temperature decreases at the rate of l/log(n).
In other words,
To
T=---- (11.16)
log(h + 1)
The updating process is repeated with the new temperature value until the
counter exceeds some predetermined value. Gradually, the number of isolated
pixels in the SPM is reduced because the contextual information present in
the Gibbs potential function V C forces the SA algorithm to iteratively generate
a new SPM that is closer to the solution of the MAP criterion in (11.13), which
is the desired optimum SPM. From the resulting optimum SPM, a contextually
refined version of fraction images at coarse resolution may also be generated
as bypro ducts thereby producing sub-pixel classification with higher accuracy
than that of the initial sub-pixel classification.
11.4
Experimental Results
In this section, we present two examples illustrating the use of the MRF model
based algorithm to produce a sub-pixel map from multispectral and hyper-
spectral data respectively.
266 11 : Teerasit Kasetkasem, Manoj K. Arora, Pramod K. Varshney
11.4.1
Experiment 1: Sub-pixel Mapping from Multispectral Data
a b
Fig.l1.3a,b. IKONOS images of a portion of the Syracuse University campus. a False color
composite (Blue: 0.45 - 0.52 J.lm, Green: 0.52 - 0.60 J.lm, Red: 0.76 - 0.90 J.lm) of the multi-
spectral image at 4 m resolution. b Panchronometric image at 1 m resolution. For a colored
version of this figure, see the end of the book
Road
Roof!
Fig. 11.4. Crisp reference image prepared from IKONOS PAN image at 1 m spatial resolution
of the Syracuse University campus
An MRF Model Based Approachfor Sub-pixel Mapping from Hyperspectral Data 267
Grass Roof 1
Tree
.,....
Shadow
.....
.J..
I
.
Road Roof 2
·"1 " .
,
.........~, ' -.
..
Fig. 11.5. Fraction reference images (4 m spatial resolution) of six land cover classes: grass,
roof!, tree, shadow, road and roof2. These images have been generated from the crisp
reference image shown in Fig. 11.4
Due to the availability of the crisp reference image of the SPM, the MPLE
algorithm is employed for estimation of the parameters of the Gibbs poten-
tial function. The MPLE algorithm selects the sets of Gibbs parameters that
maximize the product of all the local characteristic functions of X( J), i. e.,
Here, for the four clique types, the associated parameters, /32, /33, /34 and /35 are
obtained as 1.1, Ll, 0.3 and 0.3, respectively. However, if the crisp reference
image is not available, we can choose some appropriate values for /3 such as
fJ = [1 1 1 1] to make the SPM smooth (or having more connected regions). It
is obvious that this approach is quite heuristic, but it is based on the common
knowledge that a land cover map is generally composed of connected regions
rather than isolated points.
The estimated parameters have been used in the initialization phase to
generate sub-pixel classification or fraction images (Fig. 11.6) from the coarse
resolution image. The initial SPM at 1 m resolution is obtained from these
fraction images (Fig. 11.7). A visual comparison of this SPM with the crisp
reference map (Fig. 1104), sufficiently demonstrates the poor quality of the MLE
.." . *. - •
~.r;
Grass
J .•
r
'.of
0 I
0
Rod 1
o.: . ,
: I
-r.,4.. ..,
Tree
r
; I. ...
'. • 'r
Shadow
oo{r' ~
. I
: 1. .~ I
, • • I • ;
. I
.b i - . J"':" : •
Road Rod 2
. .
to.., ..... ·
.~.
r" 0
I.. ,
.' . ,
-....
~
••
, • ,. L
• :--. I r
Fig. 11.6. Initial sub-pixel classification (fraction images) of six land cover classes: grass,
roofl, tree, shadow, road and roof2
An MRF Model Based Approachfor Sub-pixel Mapping from Hyperspectral Data 269
Fig. 11.8. Resulting sub-pixel map derived from the MRF model
Grass Roof 1
..,.
· .' r
Tree Shadow
. - ~" ,:
.
.' -IT
'.
..- , • ., n I
r"r
I
Road Roof 2
r . .
., ' -':~~. t ;
I ~-
,
"
. , .
Fig. 11.9. Resulting MRF model based sub-pixel classification for grass, roofl, tree, shadow,
road and roof2
Table 11.2. Overall accuracies of MLE and MRF derived sub-pixel classification
11.4.2
Experiment 2: Sub-pixel Mapping from Hyperspectral Data
Fig. Il.IO. False color composite of the HyMap image (Blue: 0.6491 ~m, Green: 0.5572 ~m,
Red: 0.4645 ~m). For a colored version of this figure, see the end of the book
272 11: Teerasit Kasetkasem, Manoj K. Arora, Pramod K. Varshney
Component 1
20 20
40 40
60 60
20 20
40 40
60 60
20 20
40 40
60 60
Fig.ll.ll. First six principal components generated from 126 bands of the HyMap data
The MRF model is then used to produce a sub-pixel map at a spatial resolu-
tion of 0.85 m. Accordingly, the scale factor has been kept as 8 for the generation
of a resampled reference map at 0.85 m spatial resolution (Fig. 11.15). Here,
we define the resampled reference map as the one generated by degrading the
highest resolution map (i. e. 0.15 m) to the corresponding spatial resolution
according to the value of the scale factor. The class attribute of a pixel in the
resampled reference map is determined by applying the majority rule to the
corresponding area in the original fine resolution map. Since, the Gibbs poten-
tial functions represent the spatial dependence of class attributes for SPM at
a scale factor, there is a unique set of parameters associated with Gibbs poten-
tial functions for a given scaled reference map. In other words, the parameters
of Gibbs potential functions are estimated directly from the reference map.
Again, the Ising model with clique types C2 , C3, C4 and Cs given in (i1.1?) has
been applied.
Similar to the previous experiment, the initial MLE-based SPM is deter-
mined in the initialization step of the algorithm and is shown in Fig. 11.16.
The initial SPM then forms the input to the iteration phase of the proposed
An MRF Model Based Approachfor Sub-pixel Mapping from Hyperspectral Data 273
Fig. 11.12. True composite of the digital aerial photograph at 0.15 m spatial resolution, used
as the reference image. For a colored version of this figure, see the end of the book
Fig.lI .l3. The corresponding land cover map produced from the aerial photograph, used
as the reference image. For a colored version of this figure, see the end of the book
Table 11.3. Number of pure pixels for different classes used as training data (Note that for
the classes water and road, all the pixels containing at least 60% proportion of the respective
class have been considered as pure)
Water lO 47
Tree 4724 7054
Bare Soil 132 1349
Road lO 482
Grass 926 3319
Roof 26 275
274 11: Teerasit Kasetkasem, Manoj K. Arora, Pramod K. Varshney
water tree
20 20
40 40 .
60 60
20 40 60 so 100 120
road
20 20
40 40
60 60
20 20
40 40
60 60
Fig. 11.14. Reference fraction images of the six land cover classes at 6.8 m resolution water;
tree; bare soil; road; grass; and roof
Fig.ll.15. The resampled ground reference map at 0.85 m resolution. For a colored version
of this figure, see the end of the book
An MRF Model Based Approachfor Sub-pixel Mapping from Hyperspectral Data 275
algorithm to obtain the resulting SPM shown in Fig. 11.17. We observe that
the MRF-based algorithm has produced a more connected map than the ini-
tial SPM. For instance, the class road in the initial SPM has a large number
of speckle-like noise pixels whereas the resulting SPM represents this class
as more connected and smooth, and compares well with the reference im-
age. Similar observations can also be made for the classes grass and tree. The
accuracy of sub pixel maps (both MLE derived and MRF derived SPMs) is
also determined using the Kappa coefficient obtained from the error matrix
generated from 10000 testing samples and is provided in Table 1104. The 95%
confidence intervals for Kappa coefficient for both initial and final SPMs are
also given. Clearly, the intervals do not overlap indicating that the proposed
MRF model based derived sub-pixel map is significantly better than the initial
MLE derived sub-pixel map.
Fig. 11.16. Initial SPM at 0.85 m resolution derived from MLE. For a colored version of this
figure, see the end of the book
Fig. 11.17. Resulting SPM at 0.85 m resolution derived from MRF model. For a colored
version of this figure, see the end section of the book
276 11: Teerasit Kasetkasem, Manoj K. Arora, Pramod K. Varshney
11.5
Summary
In this chapter, a new sub-pixel mapping algorithm based on the MRF model
is proposed. It is assumed that a sub-pixel map has MRF properties, i. e., two
adjacent pixels are more likely to belong to the same class than different classes.
By employing this property of the model, the proposed MRF based algorithm
is able to correctly classify a large number of misclassified pixels, which often
appear as isolated pixels. In our experimental investigations, the efficacy of the
proposed algorithm has been tested on both multi and hyperspectral datasets
obtained from IKONOS and HyMap sensors respectively. The results show that
for both the datasets, a significant improvement in the accuracy of the sub-
pixel map over that produced from the conventional and most widely used
maximum likelihood estimation algorithm can be achieved. The approach has
been able to successfully reduce a significant number of isolated pixels thereby
producing a more connected and smooth land cover map.
References
Aplin P, Atkinson PM (2001) Sub-pixel land cover mapping for per-field classification.
International Journal of Remote Sensing 22(14): 2853-2858
Atkinson PM (1997) Mapping sub-pixel boundaries from remotely sensed images. Innova-
tion in GIS 4: 166-180
Bezdek JC, Ehrlich R, Full W (1984) FCM: the fuzzy c-means clustering algorithm. Comput-
ers and Geosciences 10: 191-203
Binaghi E, Brivio PA, Ghezzi P, Rampini A (1999) A fuzzy set-based accuracy assessment of
soft classification. Pattern Recognition Letters 20: 935-948
Bremaud P (1999) Markov chains Gibbs field, Monte Carlo simulation and queues. Springer
Verlag, New York
Brown M, Lewis HG, Gunn SR (2000) Linear spectral mixture models and support vector
machines remote sensing. IEEE Transactions on Geosciences and Remote Sensing 38:
2346-2360
Foody GM (1996) Approaches for the production and evaluation of fuzzy land cover clas-
sifications from remotely sensed data. International Journal of Remote Sensing 17:
1317-1340
Foody GM (1998) Sharpening fuzzy classification output to refine the representation of sub-
pixelland cover distribution. International Journal of Remote Sensing 19(13): 2593-2599
Foody GM (2000a) Mapping land cover form remotely sensed data with a softened feed for-
ward neural network. Journal ofIntelligent and Robotic System 29: 433-449
References 277
12.1
Introduction
The basic theory of Markov random fields was presented in Chap. 6. In this
chapter, we employ this modeling paradigm for two image processing tasks
applicable to remote sensing. The objectives of the two tasks are:
1. To investigate the use of Markov random field (MRF) models for image
change detection applications
2. To develop an image fusion algorithm based on MRF models.
As mentioned in Chap. 2, image change detection is a basic image analysis
tool frequently used in many remote sensing applications (such as environ-
mental monitoring) to quantify temporal information. An MRF based image
change detection algorithm is presented in the first part of this chapter. The
image fusion algorithm is primarily intended to improve, enhance and high-
light certain features of interest in remote sensing images for extracting useful
information. In this chapter, an MRF based image fusion algorithm is applied
to fuse the images from a hyperspectral sensor with a multispectral image
thereby merging a fine spectral resolution image with a fine spatial resolu-
tion image. Examples are used to illustrate the performance of the proposed
algorithms.
12.2
Image Change Detection using an MRF model
The ability to detect changes that quantify temporal effects using multitem-
poral imagery provides a fundamental image analysis tool in many diverse
applications (see Sect. 2.7 of Chap. 2). Due to the large amount of available
data and extensive computational requirements, there is a need to develop
efficient change detection algorithms that automatically compare two images
taken from the same area at different times to detect changes. Usually, in the
comparison process (Singh 1989; Richards and Jia 1999; Lunneta and Elvigge
1999; Rignot and van Zyle 1993), differences between two corresponding pix-
els belonging to the same location for an image pair are determined on the
P. K. Varshney et al., Advanced Image Processing Techniques for Remotely Sensed Hyperspectral Data
© Springer-Verlag Berlin Heidelberg 2004
280 12: Teerasit Kasetkasem, Pramod K. Varshney
statistical models for the given images as well as for the change image is cru-
cial. Here, we assume that the given images are obtained by the summation
of noiseless images (NIM) and noises in the image. Both the NIMs and the
change image are assumed to have MRF properties. Furthermore, we assume
that configurations (intensity values) of the NIMs are the same for the un-
changed sites. In addition, configurations of the changed sites from one NIM
are independent of configurations of the changed sites from the other NIM
when the configurations of the unchanged sites are given. Based on the above
assumptions, the a posteriori probability is determined and the MAP criterion
is used to select the optimum change image.
Due to the size of the search space, the solution of the MAP detection
problem cannot be obtained directly. As a result, a stochastic search method
such as the simulated annealing (SA) algorithm (see Sect. 6.4.1 of Chap. 6
and Geman and Geman 1984) is employed. Here, the SA algorithm generates
a random sequence of change images in which a new configuration depends
only on the previous change image and observed images by using the Gibbs
sampling procedure (Geman and Geman 1984; Winkler 1995; Bremaud 1999;
Begas 1986). The randomness of the new change image gradually decreases as
the number of iterations increases. Eventually, this sequence of change images
converges to the solution of the MAP detector.
12.2.1
Image Change Detection (lCD) Algorithm
The ICD algorithm uses the MAP detector whose structure is based on statis-
tical knowledge. Therefore, statistical models for the given images as well as
for the change image (CI) are required.
where e = {qO' ql, . . .qf-I } is the phase space corresponding to noisy observed
images and Wi( -8) is a vector of additive Gaussian noises with mean zero and
covariance matrix 0 2 1. Here, I is an identity matrix of size M x M, where
M = 1-81.
Obviously, there are a total of K = 2M possible change images (CIs) that
may occur between any pair of NIMs (including the no-change event). Let
282 12: Teerasit Kasetkasem, Pramod K. Varshney
(12.2)
and
(12.3)
where Zx = L exp [- L
XEA'8 CcS
VdX)] and ZH = ~1 exp [- L
k=O CcS
UdHk)] ,
respectively. Since the elements of Wi( J) are additive Gaussian noises which
are independent and identically distributed, the conditional probability density
ofYi(J) givenXi(J) is given by
i=O,I,···N.
(12.4)
Given Hb we can partition a set J into two subsets JCHk and JNCHk such that
JCHk n JNCHk = I/J and JCHk U JNCHk = J, where JCHk and JNCHk contain
all changed and unchanged sites, respectively. We further assume that the
configurations of changed sites between a pair ofNIMs are independent given
the configurations of unchanged sites, i. e.,
where Xi( JCHk) and Xi( JNCHk) are the configurations of changed sites and
unchanged sites of Xi, respectively. Here, to keep the analysis tractable, we
have made a simplifying assumption that the pixel intensities of the changed
sites in a pair of images are statistically independent given the intensities of
the unchanged sites. A justification for this assumption is that changes are not
predictable based on the current knowledge. Note that for unchanged sites,
Image Change Detection and Fusion Using MRF Models 283
configurations of the same site in a pair of images must be equal. This equal
configuration value is denoted by x~CHk. For notational convenience, let us
denote X/8 CHk ) and Xi(.8 NCHk ) by XfHk and Xf CHk , respectively. The joint
PDF of the pair is given by
L1 !XfHk = XfHk,xfcHk = x~CHk'l
P(Xi,Xj IHd =P CH CH NCH 'IvCH
x.} k = x.J k,X.J k = x..I} k
=P {XC;Hk
I
= XC;Hk
I
IXNCHk
I
= x~CHk
I}
}
X P {X~Hk
}
= x~Hk
J
IXNCHk = x~CHk }
J I}
X P {X NCHk
I
= X~CHk
}
= x~CHk
I}
} •
(12.6)
Since Xi and Xj correspond to the same scene, we assume that they are statis-
tically identical. By summing over all possible configurations of the changed
sites, the joint PDF can be expressed as,
X '"'
~
P(X CHk
I
= A'XINCHk = x~CHk)
I}·
(12.7)
AEA"CH
By using the notation L to denote the sum over the sets C that contain site 5,
C35
the term L Vc(Xi) can be decomposed as
Cc-8
where the boundaries of the changed region, denoted by ax~H are contained
within the changed region. Substituting (12.9) into (12.8), we obtain
exp [-Ek (Xi, Xj)]
P(Xi,Xj IH k) = , (12.lO)
Zx
284 12: Teerasit Kasetkasem, Pramod K. Varshney
where
and
(12.12)
are the joint image energy functions associated with Hk and the normalizing
constant, respectively. Based on these assumptions, we design the optimum
detector in the next section. Here, we have considered the case of discrete-
valued MRFs. The derivation in the case of continuous-valued fields (used in
the examples) is analogous.
12.2.2
Optimum Detector
= arg lm~x [ L P(Yi IXi )P(Yj IXj )P(Xi, Xj 1Hz )P(HZ)]) , (12.14)
Xj,XjEJ\J
Image Change Detection and Fusion Using MRF Models 285
where P(y Ix) and P(Xi,xj IHI) denote P(Y = Y IX = x), and P(Xi = Xi,Xj =
Xj IHI), respectively.
Substituting (12.3), (12.4) and (12.lO) into (12.14) and taking the constant
I
term out, we obtain
where
(12.16)
and
Epost (HI) = L UdHI) - d (yi,Yj) (12.17)
Cc-S
12.3
Illustrative Examples of Image Change Detection
Simulated data and real multispectral remote sensing images are used to illus-
trate and evaluate the performance of the image change detection algorithm.
In these examples, we consider only five clique types, C1 , C2, C3 , C4 , Cs associ-
ated with the singleton, vertical pairs, horizontal pairs, left-diagonal pairs, and
right-diagonal pairs, respectively. Furthermore, we assume that image poten-
tial functions are translation invariant. The Gibbs energy function of images
is assumed to be given by
(12.19)
286 12: Teerasit Kasetkasem, Pramod K. Varshney
which is the NIM potential vector associated with clique types, C1> ... , Cs , re-
spectively. We observe that the NIM potential vector is in a quadratic form.
The quadratic assumption has widely been used to model images in numerous
problems (e.g. Hazel 2000), and is called the Gaussian MRF (GMRF) model.
GMRF models are suitable for describing smooth images with a large num-
ber of intensity values. Furthermore, using the GMRF model, (12.19) can be
solved more easily due to the fact that the summation over configuration space
in (12.16) changes to infinite integration in the product space. Hence, the
analytical solution for (12.18) can be derived.
Using the GMRF model, we can rewrite (12.19) as (Hazel 2000),
1
L Vc(x) = "lx(-8)T [I-I] x(-8) , (12.20)
CcS
I
where the element (sa> Sb) of the M x M matrix [Io-I] is given by
5
f31 + L 2f3i' if Sa = Sb
[I-I] (Sa,Sb) = -f3o
I ,
i=2 Of {
1
} Co
Sa, Sb E I
(12.21)
o, otherwise
Similarly, we define the Gibbs energy function of a CI over the clique system
[C2 C3 C4 Csl as
(12.23)
is the CI potential vector associated with clique types as mentioned above and
I(a, b) = -1 if a = b, and I(a, b) = 1, otherwise. The optimum detector has
been derived in Kasetkasem (2002).
In order to obtain the result in a reasonable time, NIMs can be divided into
subimages of much smaller size, (7 x 7) pixels in our case, due to the intensive
computation required for the inversion of large matrices (e. g. the matrix size
Image Change Detection and Fusion Using MRF Models 287
is 4096 x 4096 for an image of size (64 x 64) pixels). The suboptimum approach
of considering one sub image at a time sacrifices optimality for computational
efficiency. However, the statistical correlation among sites is concentrated
in regions that are only a few sites apart. Therefore, our approximation is
reasonable and yields satisfactory results in the examples.
The optimal algorithm used here involves matrix inversion and multipli-
cation, both of which have the computational complexity O(n 3 ) for an image
of size n x n. During each iteration, Lb given in (12.23), is computed at least
2n2 times. As a result, the total complexity of our optimal algorithm is O(n 5 ).
However, when we employ the suboptimum algorithm, the number of opera-
tions for each subimage is fixed. Consequently, the overall complexity reduces
to O(n2).
12.3.1
Example 1: Synthetic Data
Two noiseless simulated images of size (128 x 128) pixels are shown in Fig. 12.1.
The corresponding CI is shown in Fig. 12.2, where black and white regions
denote change and no change respectively. We observe that changes occur
in the square region from pixel coordinates (60,60) to (120,120). To test our
proposed ICD algorithm, these simulated images are disturbed with an additive
Gaussian noise with zero mean and unit variance. An SA algorithm, with initial
temperature To = 2, is employed for optimization. For the difference image, the
average image intensity power to noise power is 3.4 dB. For both the noiseless
images, the average signal power is about 3.9 dB. Next, our proposed ICD
algorithm is employed, and the results are shown in Fig. 12.3a-d after 0, 28, 63
and 140 sweeps, respectively. At O-sweep, an image differencing technique is
used, and the resulting CI is extremely poor. The situation improves as more
sweeps are completed. Significant improvement can be seen when we compare
20
40
60
80
100
120 120
a b
Fig.12.1a,b. Two noiseless simulated images
288 12: Teerasit Kasetkasem, Pramod K. Varshney
20
40
60
80
20 40 60 80 100 120
Fig. 12.2. Ideal change image (used as ground truth) produced from two noiseless images
in Fig. 12.1
,..
r
20.
..
.•
100
120
204060 80 100120 20 40 60 80 100120
a b
20
20
40
40
60 60
80 80
100
100
120 L -_ _ _ _ _--' 120
20 40 60 80 100120
20 40 60 80100120
C d
Fig.12.3a-d. Results of MRF based algorithm after a 0 sweep, b 28 sweeps, c 63 sweeps, and
d. 140 sweeps
Fig. 12.3a,b and Fig. 12.3b,c. However, very little improvement is seen in Figure
12.3d when we compare it with Fig. 12.3c. In order to evaluate the accuracy of
our algorithm, we introduce three performance measures, average detection,
false alarm and error rates. The average detection rate is defined as the total
number of detected changed sites divided by the total number of changed sites.
The average false alarm rate is defined as the total number of unchanged sites
that are declared changed sites divided by the total number of unchanged sites.
Similarly, the average error rate is given by the total number of misclassified
sites (changed sites that are declared unchanged and vice versa.) divided by the
Image Change Detection and Fusion Using MRF Models 289
total number of sites in the image. We plot the average detection, false alarm
and error rates in Fig. 12.4 for a quantitative comparison. We observe rapid
improvement (in accuracy) from 0 to 60 sweeps, with no further improvement
after around 80 sweeps. The detection rate increases from about 60 percent
to more than 90 percent and the average error rate decreases from around 35
percent to less than 5 percent.
We also compare the performance of our algorithm, with a variation ofBruz-
zone and Prieto's algorithm (abbreviated here as VBPA) described in Bruzzone
and Prieto (2000) and apply it to the same dataset. In their algorithm, the two
original images are replaced by their difference image as the input to the image
change detection algorithm. Similar to our approach, the spatial information
is characterized through an MRF model and the objective of their algorithm
is to determine the most likely change image given the difference image of
images at two different times. However, in Bruzzone and Prieto (2000), the
expectation maximization (EM) algorithm is used for the estimation of means
and variances associated with changed and unchanged regions, respectively.
However, in this experiment, since the ground truth is available to us, we calcu-
late these statistics beforehand. Furthermore, we substitute the IeM algorithm
with the SA algorithm as the optimization algorithm since the SA algorithm is
optimum.
The resulting change image after 500 iterations from VBPA is shown in
Fig. 12.5a while Fig. 12.5b illustrates the change image presented in Fig. 12.3d
for comparison purposes. The error rate of VBPA is 7% while our algorithm
achieves the accuracy less than 1.5%. This is because VBPA uses the differ-
0.1
oL---~-=~~~~~~~~~~~~~~
o 20 40 60 80 100 120 140
Number of Iterations
Fig. 12.4. Performance of the MRF based leD algorithm in terms of average detection rate,
average false alarm rate and average error rate
290 12: Teerasit Kasetkasem, Pramod K. Varshney
20 •
• .. 20
40
60
•
• 40
60
80 80
00 100
20 • 120 L -_ _ _ _ _- - - '
ence image instead of original images. Some critical information may be lost
when two images are transformed into one image through the subtraction
operation. In particular, the noise power in the difference image is roughly
the sum of the noise power levels of individual images. Hence, the SNR of
the difference image is much lower than the original images. Our algorithm,
on the other hand, depends mainly on the observed images rather than the
difference image, which makes it more robust in a low SNR environment. The
number of floating point operations for each iteration associated with VBPA is
4.6 x 105 when using MATLAB while for our algorithm, it is 3.1 x 1010. VBPA
has the complexity O(n 2 ) because the number of instructions per iteration is
independent of the image size. Hence, our algorithm consumes more compu-
tational resources to achieve performance gain and is, therefore, more suitable
for off-line applications than real-time ones.
12.3.2
Example 2: Multispectral Remote Sensing Data
In this example, we apply our ICD algorithm to two real Landsat TM images of
San Francisco Bay taken on April 6th, 1983 and April 11 th 1988 (Fig. 12.6a,b).
These images represent the false color composites generated from the images
acquired in band 4, 5 and 6, and are given at http://sfbay.wr.usgs.gov/access/
change_detect/Satellite_Images_1.html. For simplicity, we will apply the ICD
algorithm only on images of the same bands acquired at two times to determine
changes. In other words, we will separately find CIs for red, green, and blue
band spectra, respectively.
Since we do not have any prior knowledge about the Gibbs potential and its
parameters for both NIMs and CI, assumptions must be made and a parameter
estimation method must be employed. Since the intensity values of NIMs can
range from 0 to 255, we assume that they can be modeled as Gaussian MRF as in
Hazel (2000), i. e., the Gibbs potential is in a quadratic form. Furthermore, we
choose the maximum pseudo likelihood estimation (MPLE) method described
in Lakshmanan and Derin (1989) to estimate the Gibbs parameters. Here, we
estimate unknown parameters after every ten complete updates of the CI.
Image Change Detection and Fusion Using MRF Models 291
a b
Fig.12.6a,b. Landsat TM false color composites of San Francisco Bay acquired in a April 6th,
1983; b April II th 1988. For a colored version of this figure, see the end of the book
20 20
_.1
40 40
60 60
, '.
80 80
100 ... , 100
120 120 L----"'~~..£.£.Wl!!>=
2040 60 80100 120 20 40 60 80100 120
a b
20 20
40 40
60 60
80 80
100 100
120 120 ~~~~1~~~~~
20 40 60 80 100 120 20 40 60 80100 120
c d
20
40
60 l.l: :·~~;·:,,-,;:;t:~i.;';;:.:,
80
100
120
2040 60 80100 120 20 40 60 80100 120
e
Fig. 12.7a-f. Change Images: MRF based ICD algorithm on the left and image differencing
on the right: a,b red spectrum; c,d green spectrum; and e,f blue spectrum
292 12: Teerasit Kasetkasem, Pramod K. Varshney
The results of changed sites for individual spectra are displayed in Fig. 12.7a,b,
12.7c,d and 12.7e,f, respectively. Figure 12.7a,c and e are determined by the
MRF based ICD algorithm while Fig. 12.7b,d and f result from image differenc-
ing. By carefully comparing results within each color spectrum, we conclude
that our ICD algorithm detects changes that are more connected than those
found using image differencing.
12.4
Image Fusion using an MRF model
The idea of combining information from different sensors both of the same
type and of different types to either extract or emphasize certain features
has been intensively investigated for the past two decades (Varshney 1997).
However, most of the investigations have been limited to radar, sonar and
signal detection applications (Chair and Varshney 1986). Image fusion has
gained quite a bit of popularity among researchers and practitioners involved
in image processing and data fusion. The increasing use of multiple imaging
sensors in the fields of remote sensing (Daniel and Willsky 1997), medical
imaging (Hill et al. 1994) and automated machine vision (Reed and Hurchinson
1996) has motivated numerous researchers to consider the idea of fusing image
data. Consequently, image fusion techniques have emerged for combining
multisensor and/or multiresolution images to form a single composite image.
The final composite image is expected to provide more complete information
content or better quality than the individual source image.
The simplest image fusion algorithm is the image averaging method (Petro-
vic and Xydeas 2000) in which a fused image is a pixel-by-pixel average of
two or more raw images. This technique is very robust to image noise when
raw images are taken from the same type of sensor under identical environ-
ments (Petrovic and Xydeas 2000). However, if images are taken from different
types of sensors, the image averaging method may result in the loss of con-
trast information since the bright region in an image obtained by one type of
sensor may correspond to a dark region in other images taken from differ-
ent types of sensors. Furthermore, the image averaging method only utilizes
information contained within pixels even though images are known to have
high correlations among neighboring pixels. To utilize this information, Toet
(1990) developed a hierarchical image fusion algorithm based on the pyramid
transformation in which a source image is decomposed successively into a set
of component patterns. Each component pattern corresponds to the repre-
sentation of a source image at different levels of coarsenesses. Two or more
images are fused through component patterns by some feature selection/fusion
procedures to obtain the composite pyramid. The feature selection procedure
must be designed in such a manner that it enhances the information of in-
terest while removing irrelevant information from the fused image. Then, the
fused image is regenerated back through the composite pyramid. Burt and
Kolczynski (1993) have developed a similar algorithm with gradient pyramid
transformation. Uner et al. (1997) have developed an image fusion algorithm
Image Change Detection and Fusion Using MRF Models 293
a b
Fig.12.8a,b. IKONOS images of the Carrier Dome (Syracuse University Campus): a red
spectrum image and b panchromatic image
294 12: Teerasit Kasetkasem, Pramod K. Varshney
12.4.1
Image Fusion Algorithm
In this section, we develop an image fusion algorithm based on an MRF model
for spatial enhancement. The algorithm employs the MAP criterion to pick the
"best" fused image from given observed images. The Metropolis optimization
algorithm is used to search for the solution of the MAP equation.
12.4.1.1
Image Model
In this problem, we are interested in fusing two images coming from different
sensing modalities and with different resolutions. The image from the first
modality is assumed to have low resolution while the image from the second
modality has high resolution. The goal is to obtain an enhanced image in the
first modality at the same resolution as the image in the second modality (high
resolution). Here, let -8 be a set of sites (pixels) s, and 11 = {O, 1, ... , L - I} be
the phase space. Note that L is the number of intensity values in the image
(for example, 256 for an 8-bit gray-scaled image.) Furthermore, let X( -8) E 11-8
denote the high-resolution image (HI) vector, or the enhanced image vec-
tor of the first modality (e.g. multispectral image) whose element x(s) E 11
is a configuration (intensity value) of a site (pixel) s in the HI. We assume
that X(-8) satisfies the MRF properties with Gibbs potential function Vdx),
i. e.,
(12.24)
to characterize the Gibbs potential function since natural images usually have
I
a smooth texture. Hence, we have
p(YIIX):::: n SES
P(Yl(S) Iz(s),F). (12.28)
Next, let Y 2 (0) E A S be the observed image in the second modality (e. g.
high spatial resolution image). Its observations at sites Si and Sj are statistically
independent given the associated HI X( 0). Hence, we have
P(Y2(0) IX(0)):::: n
SEIi
P(Y2(S) IX(0)). (12.29)
12.4.1.2
Optimum Image Fusion
The maximum a posteriori (MAP) criterion used for solving the above problem
is expressed as
(l2.31 )
(l2.32)
Since P(Y 1 = YI, Y2 = Y2) is independent OfXb it can be omitted and the above
equation reduces to
= "g h,:x [(!J ply, (,) 1'1('))) pry1 = Yl IXj )P(Xj) ] ) (l2.33)
( 0 P(Y2(S) Ix/s»)
(0
SES
x ix exp (- ccs
L Vd Xj »)
Equation (l2.34) can be rewritten as
(l2.35)
where
EI (yl (s) IZj(s) ,F) = -log [p (yl (s) IZj(s), F)]
and
Image Change Detection and Fusion Using MRF Models 297
(12.36)
where
12.4.1.3
Proposed Algorithm
The main objective of this algorithm is to find the HI that is a spatially enhanced
version of the image of the first modality. The solution of (12.36) yields the
most likely HI for given images of the first and second modalities. However,
a direct search for this solution is infeasible due to the enormous number of
possibilities of HIs. In order to find solutions of (12.36) in a reasonable time, we
adopt the Metropolis algorithm (described in Chap. 6) to expedite the search
process. Unlike an exhaustive search technique where posterior probabilities
associated with all possible HIs are calculated and the HI having the maximum
posterior probability is selected, the Metropolis algorithm generates a new HI
through a random number generator whose outcome depends on the current
HI and observed images. Here, a new configuration is randomly proposed and
is accepted with the probability given by
a (Xold,Xnew )
where a (Xold' Xnew) is the acceptance probability for the current stage Xold and
the proposed stage Xnew , T(n) is the temperature, and n is the iteration number.
From (12.37), if a proposed configuration corresponds to a lower energy, it
will be accepted with probability one. However, if a proposed configuration
corresponds to a higher energy, it will be accepted with probability associated
with the difference between energies of Xold and Xnew , and the temperature.
In the early stages of the Metropolis algorithm, the temperature is set high
to ensure that the resulting HI can escape from any local optimum points.
However, in later stages of optimization, the temperature should be low so
that a single solution is obtained. The rate at which this randomness decreases
(i. e. decrease in T) must be carried out properly to ensure the convergence of
the induced Markov chain. Chapter 6 discusses the properties of T(n) that are
298 12: Teerasit Kasetkasem, Pramod K. Varshney
restated here,
1) lim T(n) = 0 ,
n----+oo
L1r
2) T(n) 2: log(n) , (12.38)
where L1 is the maximum value of the absolute energy change when a pixel is
updated and r is a constant defined in Chap. 6.
Theoretically speaking, regardless of the initial HI, the induced Markov
chain eventually converges to the solution of (12.36) if necessary conditions
given in (12.38) are satisfied. However, this convergence may require an infinite
number of iterations, which are not feasible in practice. To actually implement
this algorithm, the Metropolis algorithm must be terminated after a certain
number of iterations. As a result, we must carefully select the initial HI that
is relatively close to the desired solution so that the algorithm terminates in
a reasonable time. Here, we pick the initial HI to be the average of intensity
values of pixels in Y 1 and the maximum likelihood estimate (MLE) of HI based
on Y2, i.e.,
1
Xinitiai = - (Yl + MLE(Y2)) , (12.39)
2
where MLE(·) denotes the MLE of HI based on observation Y 2. The main
reason for employing this procedure is the fact that the MLE of HI for a given
Y 2 usually has high resolution, but the important information cannot be clearly
seen whereas Y 1 has high contrast and the vital information is clearly seen,
but its resolution is very poor. By using the above procedure, we expect the
initial HI to have high contrast making the vital information more visible while
the background and texture information of the high-resolution image is also
present. To use the MLE, the conditional probability of the HI given Y2 must
be estimated. Here, we employ the joint histogram of Y2 and Y 1 • Hence, we
have
number of pixels that have Y 2 = a and Y 1 = b
PY 21X (alb ) = - - - - " - - - - - - - - - : - - - - - - - - - - - (12.40)
number of pixels that have Y 1 = b
necessary conditions for the convergence of the Metropolis algorithm, but after
a large number of iterations, the temperature change is very small due to the
fact that the inverse of the log function (for example, 1/log(400) = 0.167 and
l/log(SOO) = 0.161) is used to determine temperature in (12.34). As a result,
over a short period of time, an induced Markov chain under the Metropolis
algorithm (when parameters are fixed) after a large number of iterations has
similar properties as homogeneous Markov chains. Hence, the average of the
resulting HIs should provide a reasonable estimate of the most likely HI for
the given observations.
In addition, (12.36) depends on several unknown parameters such as the
noise variance and the low-pass filter F as described earlier. Therefore, we
couple parameter estimation with the Metropolis algorithm so that parame-
ter estimation and the search for a solution of (12.36) can be accomplished
simultaneously. Here also, the maximum likelihood estimate (MLE) is chosen
because of its simplicity in implementation. The use of the MLE can sometimes
severely affect the convergence of the induced Markov chain in (12.37) since, in
many cases, the difference between parameters before and after estimation is
too large and may cause the induced Markov chain to diverge. To deal with this
problem, we limit the difference to the rate of n- 1 • With this rate, parameters
are permitted to change significantly in the early stages of the search whereas,
in the later stages, these can only be changed by small amounts to allow the
induced Markov chain to converge. More details of this algorithm have been
presented in Chap. 6.
12.S
Illustrative Examples of Image Fusion
We test our algorithm on two sets of remote sensing data: multispectral and
hyperspectral in the two examples respectively. Visual inspection is used for
performance evaluation of the fused products in both the examples.
12.S.1
Example 1: Multispectral Image Fusion
The IKONOS data set for the Syracuse University campus is used to investigate
the effectiveness of our image fusion algorithm for resolution merging, i. e.,
fused image resulting from images of two different resolutions. Four multi-
spectral images of red, green, blue, and near infrared (NIR) color spectra and
one panchronometric (PAN) image are used in this example. Figure 12.9 and
Fig. 12.10 display false color composites of NIR, red and green spectra of the
multispectral image, and the PAN image, respectively. Clearly, the PAN image
has sharper feature boundaries than the multispectral image, specifically for
road features.
We first apply principal component analysis (PCA) to the 4-band multispec-
tral image to transform the multispectral image into four uncorrelated images.
Then, the image corresponding to the highest eigenvalue (highest power) is
300 12: Teerasit Kasetkasem, Pramod K. Varshney
Fig. 12.9. False color composite of the IKONOS multispectral image of NIR, red and green
color spectra of Syracuse University area. For a colored version of this figure, see the end of
the book
a b
c d
Fig.12.11a-d. Image bands after PCA transformation corresponding to eigenvalues of
a 2537, b 497, c 14, d 3.16
selected for fusion with the PAN image. The main reason for this procedure is
to reduce the computational time since four times the amount of computation
is needed to fuse all four color spectra to the PAN image. Furthermore, since
the principal component corresponding to the highest eigenvalue contains
most of the information of the four band multispectral image, we expect that
very little improvement can be achieved by fusing the rest of the components
with the PAN image. Figure 12.11a-d show the image bands after peA trans-
formation corresponding to eigenvalues 2537, 497, 14 and 3.16, respectively.
As expected, higher variation in data can be observed in the component cor-
responding to higher eigenvalues while low variation in data corresponds to
lower eigenvalues.
The resulting fused image after 500 iterations with a 5 x 5 pixel filter (F
defined earlier) is shown in Fig. 12.12. Here, the same false color composite
is used for display purposes. Clearly, the fused image appears to be sharper
and clearer as compared to the original multispectral image. Furthermore, we
also observe that more details in certain areas of the image are visible, e. g.,
302 12: Teerasit Kasetkasem, Pramod K. Varshney
Fig. 12.12. The false color composite of the resulting fused image assuming a filter size of
5 x 5 pixels. For a colored version of this figure, see the end of the book
Fig.I2.13a,b. False color composite of the quad area of Syracuse University campus. a fused
image, b original image. For a colored version of this figure, see the end of the book
Image Change Detection and Fusion Using MRF Models 303
The street
a b
Fig.12.14a,b. False color composite of the street next to Schine student center. a fused image,
b original image. For a colored version of this figure, see the end of the book
textures on the roofs of the buildings and cars on the roads. Figure 12.13a,b
and Fig. 12.14a,b display both the fused image and the original multispec-
tral image of the quad area of the Syracuse University campus and of the
street next to Schine student center, respectively, for comparison purposes
where Fig. 12.13a and Fig. 12.14a display the fused image and Fig. 12.13b and
Fig. 12.14b display the original multispectral image. We observe that small
objects such as cars and roofs of the buildings in the fused image are more
visible and separable than the original image.
12.5.2
Example 2: Hyperspectrallmage Fusion
Fig. 12.15. Color composite of hyperspectral image (2.04660 Ilm, 1.16690 Ilm, and
0.64910 Ilm) acquired by the HyMap sensor. For a colored version of this figure, see the
end of the book
Fig. 12.16. Digital aerial photograph of the same scene as shown in Fig. 12.15
c d
Fig. 12.17a-d. First four principal components corresponding to eigenvalues of a 4.40 x 106 ,
b 9.57 X 10 5 , C 4.35 x 104, d 1.37 x 104
Fig. 12.18a-d. The false color composite of the resulting fused image assuming a filter size
of 5 x 5. For a colored version of this figure, see the end of the book
306 12: Teerasit Kasetkasem, Pramod K. Varshney
12.6
Summary
In this chapter, the application of MRF models to image change detection and
image fusion was demonstrated. The ultimate goal for both problems is the
same that is to select the best solution for the problem under consideration
under the statistical framework based on the observed data. Different opti-
mization algorithms were employed depending upon the requirement of each
application.
In image change detection, the simulated annealing algorithm was used to
find the optimum solution. Here, we assumed that the change image and the
noiseless image have MRF properties with different energy functions. Further-
more, we assumed that configurations of changed pixels from different images
given configurations of unchanged pixels are statistically independent. Based
on this model, the MAP equation was developed. The simulated annealing
algorithm was employed to search for the optimum. The results showed that
our algorithm performed fairly well in a noisy environment.
For the image fusion problem, the Gaussian MRF model was employed.
The high-resolution remote sensing image was assumed to have pixel-to-pixel
relation with the fused image, i. e., the observed configurations of any two
pixels in the high resolution image were statistically independent when the
fused image was given. Furthermore, the observed low resolution image was
assumed to be the filtered version of the fused image. The Metropolis algorithm
was employed to search for the optimum solution due to the large number of
possible gray levels (256 in gray-scaled image.) The fused images obtained
from our algorithm dearly showed improvement in the interpretation quality
of the features to be mapped from both multispectral and hyperspectral remote
sensing images.
References
Begas J (1986) On the statistical analysis of dirty pictures. Journal of Royal Statistical Society
B 48(3): 259-302
Bremaud P (1999) Markov chains Gibbs field, Monte Carlo simulation and queues. Springer
Verlag, New York
Bruzzone L, Prieto DF (2000) Automatic analysis of the difference image for unsupervised
change detection. IEEE Transactions on Geoscience and Remote Sensing 38(3): 1171-
1182
Burt PJ, Kolczynski RJ (1993) Enhance image capture through fusion. Proceedings of 4th
International Conference on Computer Vision, 7(4): 593-600
Chair Z, Varshney PK (1986) Optimal data fusion in multiple sensor detection systems. IEEE
Transactions on Aerospace and Electrical Systems AES-22: 98-101
Daniel MM, Willsky AS (1997) A multiresolution methodology for signal-level fusion and
data assimilation with applications to remote sensing. Proceedings IEEE 85(1): 164-180
Geman S, Geman D (1984) Stochastic relaxation, Gibbs distributions and the Bayesian
restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence
PAMI-6(6): 721-741
References 307
Hazel GG (2000) Multivariate Gaussian MRF for multispectral scene segmentation and
anomaly detection. IEEE Transactions on Geoscience and Remote Sensing 39(3): 1199-
1211
Hill D, Edwards P, Hawkes D (1994) Fusing medical images. Image Processing 6(2): 22-24
Kasetkasem T (2002) Image analysis methods based on Markov random field models. PhD
Thesis, Syracuse University, Syracuse, NY
Lakshmanan S, Derin H (1989) Simultaneous parameter estimation and segmentation of
Gibbs random fields using simulated annealing. IEEE Transactions on Pattern Analysis
and Machine Intelligence 11(8): 799-813
Lunetta RS, Elvigge CD (eds) (1999) Remote sensing change Detection. Taylor and Francis,
London, UK
Perez P, Heitz F (1996) Restriction of a Markov random field on a graph and multiresolution
statistical image modeling. IEEE Transactions on Information Theory 42( 1): 180-190
Petrovic VS, Xydeas CS (2000) Objective pixel-level image fusion performance measure.
Proceedings of the SPIE, pp 89-98
Reed JM, Hutchinson S (1996) Image fusion and subpixel parameter estimation for auto-
mated optical inspection of electronic components. IEEE Transactions on Industrial
Electronics 43: 346-354
Richards JA, Jia X (1999) Remote sensing digital image analysis: an introduction. Springer-
Verlag, Berlin
Rignot EJM, van Zyle JJ (1993) Change detection techniques for ERS-1 SAR data. IEEE
Transactions on Geoscience and Remote Sensing 31 (4): 896-906
Singh A (1989) Digital change detection techniques using remotely sensed data. Interna-
tional Journal of Remote sensing 10(6): 989-1003
Toet A (1990) Hierarchical image fusion. Machine Vision and Applications 3(1): 1-11
Uner MK, Ramac LC, Varshney PK, Alford M (1997) Concealed weapon detection: an image
fusion approach. Proceeding ofSPIE 2942, pp 123-132
Van Trees HL (1968) Detection, estimation and modulation theory. Wiley, New York
Varshney PK (1997) Distributed detection and data fusion. Springer Verlag, New York
Wiemker R (1997) An iterative spectral-spatial Bayesian labeling approach for unsupervised
robust change detection on remotely sensed multispectral imagery. Proceedings of the
7th International Conference on Computer Analysis ofImages and Patterns: 263-270
Winkler G (1995) Image analysis random fields and dynamic Monte Carlo methods. Springer
Verlag, New York
Color Plate I 309
a b
c d
Fig. l.la-d. Observations of the same area of mixed species subtropical woodlands near
lnjune, central Queensland, Australia, observed using a stereo color aerial photography,
b CASl, c HyMap, and d Hyperion data at spatial resolutions of < 1 m, 1 m, 2.8 m and 30 m
respectively
310 Color Plate II
Fig. I.2a,b. Example of detection of aquatic vegetation using airborne imaging spectrome-
ters: partially submerged kelp beds (bright red) as observed using a 2 m CASI and b 20 m
AVIRIS data
a b
Fig. I.3a,b. Color composite (837 nm, 713 nm and 446 nm in RGB) of fourteen 1 km x
~ 15 km strips of CASI data acquired over the West Alligator River mangroves, Kakadu
National Park, Australia by BaliAIMS (Adelaide). The data were acquired at 1 m spatial
resolution in the visible and NIR wavelength (446 nm-838 nm) region. b Full resolution
image of the west bank near the river mouth with main species communities indicated
(Lucas et al. in preparation)
Color Plate III 311
300
250
200
150
100
50
0
300
300
200
200
100
o 0
Fig. 2.lOa-c. a Input color image. b 3D scatter plot of R, G and B components. c Pseudo color
map of unsupervised classification
312 Color Plate IV
Fig. 3.16. An illustration of the heuristic test to identify the global maximum
Background
Alfalfa
- Com-noull
Com-min
- Com
Grasslp Lure-mowed
_ Hay-windr wed
Oat
Soybean -noull
_ Soybean -min
_ Soybean -clean
_ Wheat
_ Wood
BuildingslGra sffree/Road
tone/Steel Towe
Fig. 4.8. Reference data (ground truth) corresponding to AVIRIS image in Fig. 4.7
Color Plate V 313
Corn-notiD
Soybeans-min
a b c d
700Gr---------------------r=~~~~~
- Background
SOy........... min G.......rrrees - Corn-notill
6000 - GI'B5SI'I'r«s
- SOy.....ns-min
:woo
,ooo L----------.!~-~~~~
0_40 2..45
e
Fig. 9.2a-e. Three images (80 x 40 pixels) centered at a 0.48 11m, b L5911m, and c 2.16 11m.
These images are a subset of the data acquired by the AVIRIS sensor. d Corresponding refer-
ence map for this dataset consisting of four classes: background, corn-notill, grass/trees and
soybeans-min. e Mean intensity value of each class plotted as a function of the wavelength
in 11m
a b
Fig. 11.3a,b. IKONOS images of a portion of the Syracuse University campus. a False
color composite (Blue: 0.45 - 0.52 11m, Green: 0.52 - 0.60 11m, Red: 0.76 - 0.90 11m) of the
multispectral image at 4 m resolution. b Panchronometric image at 1 m resolution
314 Color Plate VI
Fig. 10.1. Color composite from the Landsat 7 ETM+ multispectral image
Fig.Il.lO. False color composite of the HyMap Fig.U.l2. True composite of the digital aerial
image (Blue: 0.6491 }lm, Green: 0.5572 }lm, photographatO.15 m spatial resolution, used
Red: 0.4645 }lm) as the reference image
Color Plate VII 315
Fig. 11.13. The corresponding land cover Fig. 11.15. The resampled ground reference
map produced from the aerial photograph, map at 0.85 m resolution
used as the reference image
.......-.. . • r.
l..,
1
"
..... . •
• ''f' \ " .';". .,. .
.
~.~
-~-.......
~
'~ .4 '," -,_
, j.
..;:.. ,', ' 10.';
..
') I
" . "
1&. ... " - is. "
•• j . ...
Fig. 11.16. Initial SPM at 0.85 m resolution Fig. 11.17, Resulting SPM at 0.85 m resolu-
derived from MLE tion derived from MRF model
a b
Fig. 12.6a,b. Landsat TM false color composite of San Francisco Bay acquired in a April 6th,
1983; b April 11 th 1988
316 Color Plate VIII
Fig. 12.9. False color composite of the lKONOS Fig. 12.12. The false color composite of the
multispectral image of NIR, red and green resulting fused image assuming a filter size
color spectra of Syracuse University area of 5 x 5 pixels
Fig. 12.13a,b. False color composite of the Fig. 12.14a,b. False color composite of the
quad area of Syracuse University campus. a street next to Schine student center. a fused
fused image, b original image image, b original image
Fig. 12.15. Color composite of hyper spectral Fig. 12.18. The false color composite of the
image (2.04660 lIm, 1.16690 lIm, and resulting fused image assuming a filter size
0.64910 lIm) acquired by the HyMap sensor of5 x 5
Index