Patch-based within-object classification

Patch-based Within-Object Classification∗
Jania Aghajanian1 , Jonathan Warrell1 , Simon J.D. Prince1 , Peng Li1 , Jennifer L. Rohn2 , Buzz Baum2
1
Department of Computer Science, University College London
2
MRC Laboratory For Molecular Cell Biology, University College London
1
{j.aghajanian, j.warrell, s.prince, p.li}@cs.ucl.ac.uk 2 {j.rohn, b.baum}@ucl.ac.uk
Abstract
Advances in object detection have made it possible to

collect large databases of certain objects. In this paper we
exploit these datasets for within-object classification. For
example, we classify gender in face images, pose in pedes-
trian images and phenotype in cell images. Previous work
has mainly targeted the above tasks individually using ob-
ject specific representations. Here, we propose a general
Bayesian framework for within-object classification. Im-
ages are represented as a regular grid of non-overlapping
patches. In training, these patches are approximated by a
predefined library. In inference, the choice of approximat-
ing patch determines the classification decision. We pro-
pose a Bayesian framework in which we marginalize over
the patch frequency parameters to provide a posterior prob-
ability for the class. We test our algorithm on several chal-
lenging “real world” databases.
1. Introduction
Figure 1. We address the problem of within-object classification
Recent advances in computer vision have allowed us to
on images captured in uncontrolled environments, where large
reliably detect objects with limited variation in structure within-class variations are present. For example, we classify (a)
such as faces, pedestrians and cars in real time. A typi- gender in face images, (b) phenotype in cell images and (c) pose
cal approach is to use sliding window object detectors such in pedestrian images. These examples were all correctly classified.
as the work of Viola and Jones [19] and [14, 10, 8]. Sliding
window object detectors consider small image windows at
Within-object classification has widespread applications
all locations and scales and perform a binary detection for
including targeted advertising, consumer analysis, and med-
each. The output of a sliding window object detector is a
ical image analysis. Examples include biological classifica-
bounding box around the object of interest.
tion, where we might automatically screen cell cultures for
The success of these techniques allows us to collect large
diseases and gender classification which could be used as a
databases of such objects and it would be useful to subse-
preprocessing step in face recognition.
quently describe their characteristics (attributes). For exam-
A large body of research has investigated the choice of
ple we might classify gender in face images or phenotype
the learning algorithm for particular within-object classifi-
in cell images. This “within-object” classification task has
cation tasks including neural networks [9, 3], support vector
quite different characteristics to other forms of object recog-
machines [17, 16] and adaboost [18, 1]. Most current meth-
nition. All of the examples have a great deal in common and
ods use tailor-made representations specific to the object of
we aim to classify quite subtle differences (see figure 1).
interest. For example Brunelli and Poggio [3], extract ge-
∗ J.A. and S.P. acknowledge the support of the EPSRC ref: ometric features from faces such as pupil to eyebrow ratio,
EP/E013309/1. B.B. acknowledges the support of the AICR ref: 05-341. eyebrow thickness and nose width as input to a neural net-
1125
2009 IEEE 12th International Conference on Computer Vision (ICCV)

Authorized licensed use limited to: UNED- Universidad Estatal a Distancia. Downloaded on August 12,2024 at 18:46:49 UTC from IEEE Xplore. Restrictions apply.
978-1-4244-4419-9/09/$25.00 ©2009 IEEE
Figure 2. Inference. (a) A test image Y is decomposed into a regular patch grid. (b) A large library L is used to approximate each test
image patch. (c) The choice of library patch provides information about the class. (d) Parameters θ associated with each class are used to
interpret these patch choices and (e) used in a Bayesian framework to calculate a posterior over classes.
work to perform gender detection. Saatci and Town use Ac- rately and provides independent information about the class
tive Shape Models [17] to represent faces for gender classi- label. At the core of our algorithm is a predefined library of
fication. Similarly 2D contours and stick figures have been object instances. The library can be considered as a palette
used to represent human body for motion analysis [13] and from which image patches can be taken. We exploit the
action recognition [6] . Domain specific features are also relationship between the patches in the test image and the
used in cell screening. These include aspects like the size, patches in the library to determine the class. Our algorithm
perimeter and convexity of cells [11] as well as the size and can be understood in terms of either inference or generation
shape of the nuclei [4]. and we will describe each in turn.
These techniques have several disadvantages. First, ob- In inference (see figure 2), the test image patch is approx-
ject specific representations cannot be applied to other prob- imated by a patch from the library L. The particular library
lems without major alteration: most techniques have only patch chosen can be thought of as having a different affinity
been applied to a single class. Second, most methods do not with each class label. These affinities are learned during a
exploit the large amounts of available training data (there training period and are embodied in a set of parameters θ as-
are some exceptions, e.g. [12]). Instead they have mostly sociated with each class. The relative affinity of the chosen
been investigated using small databases some of which con- library patch for each class is used to determine a posterior
tain images that are not typical of the real environment. For probability over classes.
example, in gender classification, the FERET database is Alternatively, we can think about generation from this
often used, although it does not contain the variations in model. For example, consider the generative process for
pose, illumination, occlusion and background clutter seen the top-left patch of a test image. The true class label in-
in figure 1. It has been shown that performance of most duces a probability distribution over all the patches in the
methods drops sharply when tested on images captured in library based on the learned parameters for that class. We
uncontrolled environments [15]. choose a particular patch using this probability distribution
In this paper we propose a Bayesian framework and add independent Gaussian noise at each pixel to create
for within-object classification that exploits very large the observed data. In inference we are inverting this genera-
databases of objects and can be used for disparate object tive process using Bayes’ rule to establish which class label
classes. We build a non-parametric generative model that was most likely to be responsible for the observed data.
describes the test image with patches from a library of im- 2.1. Inference
ages of the same object. All the domain specific information
is held in this library: we use one set of images of the ob- Consider the task of assigning a class label C to a test
ject to help classify others. We test our algorithm on large image, where there are K possible classes so C ∈ {1 . . . K}.
real-world databases of faces, pedestrians and human cells. The test image Y is represented as a non-overlapping grid
In section 2 we describe the proposed method. Data col- of patches Y = [y1 ...yP ]. The model will be trained from
lection and parameter selection is described in sections 3.1 I training examples Xc from each of the K classes. Each
and 3.2. In sections 3.3-3.6 we present classification ex- training example is also represented as a non-overlapping
periments using face, cell and pedestrian images. Method grid of patches of the same size as the test data. We denote
comparison and summary is presented in sections 3.7 and 4. the pth patch from the ith training example of the cth class
by xicp (see figure 3a).
2. Methods We also have a library L of images that are not in the
training or test set and would normally contain examples
Our approach breaks the test image into a non- of all classes. We will consider the library as a collection
overlapping regular grid of patches. Each is treated sepa- of patches Ll where l ∈ {1..N } indexes the N possible
1126
Figure 3. (a) The model is trained from I training examples from each of the K classes. Each training image is represented as a non-
overlapping grid of patches denoted by p. (b) We also have a library L of images that are not in the training or test set and contains
examples of all classes. The library is considered as a collection of patches Ll where l ∈ {1..N } indexes the N possible sites. (c) The
parameter θcpl represents the tendency for the patch from library site l to be picked when considering patch p of an example image of class
c. (d) Graphical model representing our method.
sites from which we can take library patches (see figure where Ll is the vectorized pixel data from site l of the li-
3b). These patches are the same size as those in the test brary L. We define the likelihood to be
and training images but may be taken from anywhere in the
P r(yp |θcp• ) = P r(l∗ |θcp• ) = θcpl∗ (4)
library (i.e. they are not constrained to come from a non-
overlapping grid). In other words, the sites l denote every From this it can be seen that the parameter θcpl represents
possible pixel position in the library images. the tendency for the patch from library site l to be picked
The output of our algorithm is a posterior probability when considering patch p of an example image of class c.
over class label C. We calculate this using Bayes’ rule This can be visualized as in figure 3c. A graphical model
P relating all of the variables is illustrated in figure 3d.
p=1 P r(yp |C = c, x•cp )P r(C = c)
P r(C = c|Y, X• ) = 2.2. Training
P r(Y)
(1) In this section, we consider how to use the training data
where we have assumed that the test patches yp are inde- x•cp from the pth patch of every image belonging to the cth
pendent. The notation • indicates all of the values that an class to learn a posterior distribution P r(θcp• |x•cp ) over the
index can take, so X• = {X1 ...XK } denotes the training relevant parameters θcp• . In section 2.3 we discuss how to
images from all of the K classes and x•cp denotes the pth use this distribution to calculate the integral in Equation 2.
patch from all I training images from the cth class. We calculate the posterior distribution over the parame-
Although the likelihood in Equation 1 depends on the ters θcp• using a second application of Bayes’ rule:
library, it is not conditioned on the parameters of the model
θ. We take a Bayesian approach and marginalize over the P r(x•cp |θcp• )P r(θcp• )
P r(θcp• |x•cp ) = (5)
model parameters, so the likelihood terms have the form: P r(x•cp )

To simplify notation, we describe this process for just
P r(yp |C = c, x•cp ) = P r(yp |θcp• )P r(θcp• |x•cp )dθcp• one of the P regular patches and one of the K classes and
(2) drop the indices c and p. Equation 5 now becomes:
where θcp• are all of the parameters associated with the pth P r(x• |θ• )P r(θ• )
patch for the cth class. P r(θ• |x• ) = (6)
P r(x• )
To calculate the likelihood, we first find the index l∗
of the library site that most closely matches the vectorized where x• = x1 . . . xI is all of the training data for this
pixel data from the test patch yp . We are assuming that the patch and this class and θ• = θ1 . . . θN is the vector of N
test patch is a Gaussian corruption of the library patch and parameters associated with each position in the library for
we can find the most likely site to have been responsible this patch and this class.
using maximum a posteriori estimation To calculate the likelihood of the ith training example
xi given the relevant parameters θ• we first find the closest
l∗ = arg max Gyp [Ll ; σ 2 I] (3) matching library patch ˆli where
l
1127
3. Experiments
ˆli = arg max Gx [Ll ; σ 2 I] (7)
i
l 3.1. Databases
The data likelihood is a categorical distribution (one Faces: We harvested a large database2 of images of men
sample from a multinomial) over the library sites so that and women from the web. These were captured in uncon-
P r(xi |θ• ) = P r(ˆli |θ• ) = θl̂i . (8) trolled environments and exhibit wide variation in illumi-
nation, scale, expression and pose as well as partial occlu-
Now consider the entire training data x• . The likelihood sion and background clutter (see figure 1a). Faces were
now takes the form detected using two methods: first, we used a commercial
frontal face detector. Second, we manually labelled two
I
I
N
landmarks. The former method does not localize the faces
P r(x• |θ• ) = P r(xi |θ• ) = θl̂i = (θl )fl (9) accurately, but misses many of the harder non-frontal faces
i=1 i=1 l=1 (it detected about 70% of the faces). The latter method lo-
calizes the images very accurately but includes all examples
where fl is defined as
in the database regardless of their pose or quality.
I
For both methods the images were subsequently trans-
fl = δl̂i =l (10)
formed to a 60x60 template using a Euclidean warp. We
i=1
band-pass filtered the images and weighted the pixels using
and δl̂i =l returns one when the subscripted expression ˆli = l a Gaussian function centered on the image. Each image was
is true and zero otherwise. In other words, fl is the total normalized to have zero mean and unit standard deviation.
number of times the closest matching patch came from li- Cells: The Baum lab RNAi cell phenotype database [2]
brary site l during the training process. contains images of human cancer cells (HeLa-Kyoto) dis-
We also need to define the prior over the parameters θ in playing a large variety of morphological phenotypes af-
Equation 6. We choose a Dirichlet prior as it is conjugate to ter the individual knockdown of approximately 500 genes.
the categorical likelihood so that Of these many morphological changes, we were interested
N specifically in two phenotypes: (i) when the borders of the
Γ( αl )
P r(θ• ) = l (θl )αl −1 (11) cell change significantly in response to the knockdown to
l Γ(αl ) produce a ‘triangular’ phenotype with sharply-edged bor-
l=1
ders, and (ii) when knocking down a gene had no ef-
where Γ denotes a Gamma distribution and {α1 ..αN } are fect on the cell, leaving its phenotypic appearance as non-
the parameters of this Dirichlet distribution. These are triangular/amorphous (‘normal’).
learned from a validation set. Each image contains 3 color channels: W1,W2,W3, each
Substituting the likelihood (Equation 9) and the conju- of which represents a different fluorescent stain. We use the
gate prior term (Equation 11) into Bayes’ rule (Equation 6) W1 channel to find the nuclei: we threshold the image and
we get an expression for the posterior distribution over pa- then use morphological opening to remove noise. We find
rameters which has the form of a Dirichlet distribution: connected regions and take their centroids to represent the
N nucleus position. We place a 60 × 60 pixel bounding box
Γ( (αl + fl ))
P r(θ• |x• ) = l (θl )fl +αl −1 (12) around the center and extract the data from the W2 channel
l Γ(αl + fl ) for classification. Since cells exhibit radial symmetry we
l=1
convert the images to a 60 × 30 polar representation, so
We compute one of these distributions for each of the P that the horizontal coordinate of the new image contains the
patches in the regular grid and for each of the K classes. angle from the nucleus center and the vertical coordinate
represents the distance from the nucleus. This allows us to
2.3. Calculation of Likelihood Integral
easily constrain patches from the library to only match to
Finally, we substitute the posterior distribution over the patches at similar radii without regard for their polar angle.
parameters P r(θcp• |x•cp ) (now resuming use of the indices These radial images were band-pass filtered and normalized
c and p) into Equation 2 and integrate over θcp• to get an to have zero mean and unit standard deviation.
expression for the likelihood1 of observing test data patch Pedestrians: We collected a large database of urban scenes.
yp given that the object class is c : Pedestrians were automatically detected using the method
of [8]. The images were then manually labeled for pose:
fcpl∗ + αl∗
P r(yp |C=c, x•cp ) = (13)
l (fcpl + αl ) 2 The database can be made available upon request. Please email
1 See {http://pvl.cs.ucl.ac.uk/j.aghajanian} for derivation details. {j.aghajanian@cs.ucl.ac.uk}.
1128
Figure 4. (a) To find the best patch size, gender classification is
carried out on the validation set on several patch grid sizes. The
performance peaks at the 10x10 grid resolution. (b) The α param-
eter of the Dirichlet distribution is chosen empirically by testing
the performance of the algorithm on the validation. The perfor-
mance peaks at the value of α = 2.
pedestrians facing front, back, left and right. The bounding

box around each pedestrian was re-scaled to create a 60x120
image. The pedestrian images were band-pass filtered and
normalized as with the face and cell images.
3.2. Experimental Settings Figure 5. Comparison of original images and best approximations
l∗ from library patches.
In this section we use a validation set to investigate the
effect of patch grid resolution and the Dirichlet parameters
{α1 ..αN } for gender classification in the face images. We is intractable. We found experimentally that it was possible
used a training set of 8000 male and 8000 female images, to slightly improve performance by using a soft assignment
a validation set of 400 male and 400 female images, and a of library sites, replacing Equation 10 with
library of 240 images uniformly sampled from both classes.
Figure 4a shows the percentage correct classification as a
Gxi [Ll ; σ 2 I]
function of the patch grid resolution. The results show that fl = N , (14)
2
performance increases as the patch grid gets finer, peaking i j=1 Gxi [Lj ; σ I]
at a 10 × 10 grid (6×6 pixels) and then declining. When the
patches are very small, they are probably not sufficiently and we have done this for all results in the paper. In practice,
informative. For comparison, we also plot results from a we also restrict the possible indices l to a subset correspond-
maximum likelihood approach where we form a point es- ing to a 6×6 pixel window around the current test patch po-
timate of the parameters θcpl . We note that this approach sition in each library image so patches containing eyes are
produces noticeably worse results. only approximated by other patches containing eyes etc.
In figure 5 we verify that 6×6 pixel patches are sufficient 3.3. Gender Classification
by reconstructing real images using the closest patches l∗
from the library. It is still easy to identify the characteristics In this experiment we investigate gender classification
of the images using the approximated versions. for both the manually and automatically detected face
Figure 4b shows the percentage correct classification us- datasets. In each case, we use a training set of 16,000 male
ing 6 × 6 pixel patches as a function of the Dirichlet param- and 16,000 female images. The test set contains 500 male
eters {α1 ..αN } which are constrained to all be the same and 500 female faces and the library is made up of 120 male
value. The results show a significant jump in performance and 120 female images.
when the α value changes from 1 to 2 but then decline. This We achieve an 89% correct recognition rate on the man-
is also a confirmation of the Bayesian inference being ben- ually detected dataset. Figures 6a and b show correctly clas-
eficial, since the maximum likelihood solution can be seen sified female and male examples respectively. Note that the
as a special case of Bayesian inference when α = 1. For images contain large pose variations ranging from -90o to
the rest of the paper we adopt these optimal parameters: we +90o . Figure 6c shows typical examples of male images
use a patch resolution of 6 × 6 and set {α1 ..αN } = 2. misclassified as female. Notice that these images have no
In Equation 7 we defined a hard (MAP) assignment of facial hair and some have long hair. The third person is
training patch xi to library index ˆli . In principle it would pulling a face which was seen more often in female train-
be better to marginalize over possible values of ˆli , but this ing examples. Figure 6d shows typical examples of female
1129
Figure 7. Classification performance was tested per patch. (a)
Sample face image. (b) % correct performance per patch for gen-
der (c) % correct performance per patch for eyewear. (d) Sample
Figure 6. Gender classification performance was 89% on manu-
pedestrian image. (e) % correct performance per patch for clas-
ally detected faces. (a) Correctly classified females. (b) Correctly
sifying pedestrians as facing front vs. back. and (f) % correct
classified males. (c) Males misclassified as female. (d) Females
performance per patch for classifying pedestrians as either facing
misclassified as male. (e) Close up: interesting misclassified case.
front & back vs. facing left & right.
images misclassified as male. Many of these images are par-

or were wearing a cap. Most of the images misclassified
tially obscured or low quality. The fourth image (blown up
as being without glasses (figure 8d) were wearing frameless
in 6e) is particularly interesting. This was tagged as female
reading glasses which are difficult to distinguish.
but we suspect it is a man in a wig! Gender classification
was performed with a similar protocol on automatically de- We compared our results with the performance of 10 hu-
tected images. Here, we achieve 90% correct classification. man subjects on the same test set. Their average perfor-
This dataset shows less variation in head orientation but the mance was 96.79% correct classification. We noted that
position of the face varies more in each image. Examples most of the images misclassified by humans as without
of correctly classified faces were shown in figure 1a. glasses were also wearing frameless reading glasses.
We also tested the classification ability of each patch in- 3.5. Cell Phenotype Classification
dividually. Figure 7b shows the percentage correct classifi-
cation for each patch as a gray level image (the higher the In this section, we apply our algorithm to a second ob-
performance, the lighter the patch). Notice that there are no ject class with completely different properties. We classify
dominant patches with high discriminative power. Instead a human cells from a subset (141 images) of the Baum lab
collective decision is made for classification of gender. RNAi cell phenotype database [2] as being either ‘triangu-
lar’ or ‘normal’. For training 12500 cells from 125 images
3.4. Eyewear Classification were used for each class. The library contained 120 cells
We also investigated the task of determining whether from each class.
people were wearing glasses. The training set for this exper- The first experiment tests the ability of our algorithm to
iment contains 8,000 images with glasses and 8,000 images classify single cells. We used a test set of 500 normal and
without. The library contained 120 images with glasses and 500 triangular cells. The method achieved 70% correct clas-
120 without. The algorithm was tested on 400 images with sification. We note that (i) this is a very difficult task as the
glasses and 400 without. We achieve 84% correct classi- cell images contain significant within class variation (see
fication. Figure 7c shows the percentage correct achieved figure 9) (ii) not all cells in a given image are affected by
based on each patch alone. As expected, there is far more the experimental conditions that cause changes in cell shape
discriminatory power in the top half of the image. We re- so we do not expect perfect performance and (iii) biologists
peated the experiment using only the top half of each image are usually interested in classifying entire images (images)
and achieved 91.2% classification. each of which contains 50-150 cells.
Figure 8a shows images correctly classified as without This motivates our second experiment in which we clas-
glasses despite some images being dark (1,2) or the eye be- sify the entire images. Due to limited amount of data there
ing covered by hair (3-5). Figure 8b shows images correctly were only 16 images (8 from each class) that were not used
classified as wearing glasses, despite the images being very either in training or library. We break these images into 4
bright (4), blurry (5,6), or non-frontal (3). Misclassified im- parts, resulting in 64 subimages which were used in testing.
ages are shown in figures 8c and d. Many of the images mis- We treat each cell within each subimage as providing in-
classified as wearing glasses (figure 8c) have obscured eyes dependent information about the image class. Under these
1130
Figure 8. We achieved 91.2% correct classification of the pres-
ence of eyewear. (a) Correctly classified as without glasses. (b)
Correctly classified as wearing glasses. (c) Without glasses but
misclassified as wearing glasses. (d) Faces with glasses but mis-
classified as not wearing glasses. Figure 9. W2 channels of individual cells that were correctly clas-
sified as (a) normal and (b) triangular. (c) A correctly classified
normal image. (d) A correctly classified triangular image. Inter-
conditions, the algorithm managed to achieve 100% correct estingly, biologists usually classify cells based primarily on the
classification rate by classifying all 64 subimages correctly. W3 channel. Our algorithm seems to be exploiting information
Example classifications for cells are shown in figure 9a-b. that is not particularly salient to human experts.
Subimage classification is shown in Figures 1b and 9c. PP
PP Est.
3.6. Pedestrian Pose Classification True PP Back Front Left Right
PP
Finally, we test our algorithm on classifying pose in Back 77.7% 10.0% 5.6% 6.7%
pedestrian images. In training we use 3000 images of Front 35.6% 53.7% 6.7% 4.0%
pedestrians from each of the four classes (i) facing front, Left 7.3% 6.7% 71.0% 15.0%
(ii) facing back, (iii) facing left and (iv) facing right. We Right 12.0% 8.3% 15.0% 64.7%
use a library of 240 images (60 from each class). We devise
Table 1. Confusion matrix for pedestrian pose classification.
four separate experiments.
In the first experiment we do multi-class classification
using a test set of 1200 images (300 per class). In this ex- In this experiment we achieve 85.3% correct classification.
periment we classify a test image as belonging to one of the We plotted the per patch classification as a grayscale image
four classes. We achieve 67% correct classification overall. in 7f. Unsurprisingly this figure shows that the most dis-
Table 1 shows the confusion matrix where each row shows criminative patches for this task are ones towards the bot-
the true label and each column shows the estimated label. tom of the image: the legs are the most distinctive part of
It is notable that left facing pedestrians are most confused the image to distinguish these classes.
with right facing ones and front facing pedestrians are most
3.7. Comparison to Other Algorithms
confused with back facing ones. Examples of correct and
wrong classifications are shown in Figure 10. For further validation we compare the performance of
In the second experiment we examine binary classifica- our gender classification algorithm with the manually reg-
tion to distinguish only front-facing from back-facing ex- istered dataset to that of support vector machines (SVMs).
amples. We tested on 600 images (300 back, 300 front) and Unfortunately, SVMs were not designed to work with large
we achieve 75% correct classification. This is quite a chal- databases and it is hard to train with the high resolution
lenging task as these two classes are largely distinguishable (60 × 60) images due to the memory requirements. To get
only from the facial area. This is verified when we examine the best out of these methods we have used both (i) the max-
the per patch classification (see figure 7e): patches in the imum feasible number of training images at high resolution
top center of the image are most informative. (4000 images per class) and (ii) a larger training set (16000
In the third experiment we classify test images as ei- images per class) of low resolution images which were sub-
ther facing left or facing right. We achieve 81.2% correct sampled to 21 × 12. This is similar to images used in [16].
classification. Finally we test our algorithm on classifying For the first case (4000 high resolution images per class)
pedestrians as either facing left/right, or facing front/back. a linear SVM and a non-linear SVM with an RBF kernel
1131
discriminate the classes. The algorithm also has a close re-
lationship with bag of words models [5] . The library can
be thought of as a structured set of textons which are used
to quantize the image patches.
In terms of scalability our algorithm is linear with respect
to the size of the library and the training data. For a library
of size m and a training set of size n it scales as O(mn) dur-
ing training and O(m) in testing. The Bayesian formulation
where we marginalize over the parameters guards against
overfitting.
In future work we intend to investigate other visual tasks
such as regression on continuous characteristics (e.g. age),
localization and segmentation using similar methods that
Figure 10. Example results from the pedestrian pose classification exploit a library of patches and a large database of images.
experiment. We show the predicted label for each pedestrian. The
two images marked by a red cross have been misclassified, but the References
remaining images show correctly classified pedestrians. [1] S. Baluja and H. Rowley, “Boosting Sex Identification Perfor-
mance,” IJCV, Vol. 71, pp. 711-119, 2007.
[2] J. Rohn and B. Baum, unpublished data.
achieved 78.8% and 77.8% performance respectively. The [3] R. Brunelli and T. Poggio, “HyperBF Networks for Gender Classi-
SVMs were trained with libsvm and the parameters selected fication,” Image Understanding, pp. 311-314, 1992.
[4] A. Carpenter, T. Jones, M. Lamprecht, C. Clarke, I. Kang, O.
with 3-fold cross validation. When we tested our method Friman, D. Guertin, J. Chang, R. Lindquist, J. Moffat, et al, “Cell-
with only these 4000 training images we achieved 84.6% Profiler: image analysis software for identifying andquantifying cell
which is considerably better than either SVM method. phenotypes,” Genome Biology, Vol. 7, pp. R100, 2006.
For the second case (16000 low resolution images per [5] G. Csurka, C.R. Dance, L. Fan, J. Willamowski and C. Bray, “Vi-
sual categorization with bags of keypoints,” Workshop on Statistical
class) the linear SVM achieved 78.7% performance and the Learning in Computer Vision, ECCV, pp. 1-22, 2004.
non-linear SVM achieved 82.4%. For this dataset we also [6] A. Efros, A. Berg, G. Mori and J. Malik, “Recognizing action at a
tried linear discriminant analysis which achieved a maxi- distance,” ICCV, pp. 726-733, 2003.
mum of 78%. None of these results approach the 89% per- [7] A.A. Efros and W. T. Freeman, “Image quilting for texture synthesis
and transfer,” Proc. SIGGRAPH, pp. 341-346, 2000.
formance achieved by our algorithm. [8] P. Felzenszwalb, D. McAllester and D. Ramanan, “A discrimina-
Finally, we also compared human performance on gen- tively trained, multiscale, deformable part model,” CVPR, pp. 1-8,
der classification. For this purpose 10 subjects were shown 2008.
the same test images as used for our gender classification [9] B. Golomb, D. Lawrence and T. Sejnowski, “Sexnet: A neural net-
work identifies sex from human faces,” NIPS, Vol. 3, pp. 572-577,
experiment. The images were 60 × 60 in size and grayscale 1991.
but did not undergo further preprocessing. The average hu- [10] D. Hoiem, A. Efros and M. Hebert, “Putting Objects in Perspective,”
man performance was 95.6%. Although our best perfor- CVPR, Vol. 2, pp. 3-15, 2006.
[11] T. Jones, A. Carpenter, D. Sabatini and P. Golland, “Methods for
mance is 7% lower than this, we conclude that some of the
high-content, high-throughput image-based cell screening,” MIAAB
test images are genuinely difficult to classify. Workshop on Microscopic Image Analysis, pp. 65-72, 2006.
[12] N. Kumar, P. Belhumeur and S. Nayar, “FaceTracer: A Search En-
4. Summary and Discussion gine for Large Collections of Images with Faces,” ECCV, pp. 340-
353, 2008.
In this paper we have proposed a general Bayesian [13] M. Leung and Y. Yang, “First Sight: A Human Body Outline Label-
framework for classifying within-object characteristics. ing System,” PAMI, pp. 359-377, 1995.
Our algorithm uses a generic patch-based representation [14] S. Li and Z. Zhang, “Floatboost learning and statistical face detec-
therefore it can be used on several object classes without tion,” PAMI, Vol. 26, pp. 1112-1123, 2004.
[15] E. Mäkinen and R. Raisamo, “An experimental comparison of gen-
major alterations. We demonstrate good performance on der classification methods,” Pattern Recognition Letters, Vol. 29,
‘real world’ images of faces, human cells and pedestrians. pp. 1544-1556, 2008.
The algorithm has a close relationship with non- [16] B. Moghaddam and M. Yang, “Learning Gender with Support
parametric synthesis algorithms such as image quilting [7] Faces,” PAMI, pp. 707–711, 2002.
[17] Y. Saatci and C. Town, “Cascaded Classification of Gender and Fa-
where patches from one image are used to model others. cial Expression using Active Appearance Models,” AFGR, Vol. 80,
Our algorithm works on exactly the same principles - all pp. 393-400, 2006.
the knowledge about the object class is embedded in the li- [18] G. Shakhnarovich, P. Viola and B. Moghaddam,“A unified learning
framework for real time face detection andclassification,” AFGR,
brary images. This accounts for why the algorithm works
pp. 14-21, 2002.
so well in different circumstances. If we have enough li- [19] P. Viola and M. Jones, “Rapid Object Detection Using a Boosted
brary images they naturally provide enough information to Cascade of Simple Features,” CVPR,Vol 1, pp. 511-518, 2001.
1132

Patch-based within-object classification

Uploaded by

Patch-based within-object classification

Uploaded by

Patch-based Within-Object Classiﬁcation∗

Advances in object detection have made it possible to

2009 IEEE 12th International Conference on Computer Vision (ICCV)

pedestrians facing front, back, left and right. The bounding

images misclassiﬁed as male. Many of these images are par-

You might also like