Active Pictorial Structures
- Definition
- Gaussian Markov Random Field
- Model
- Cost Function and Optimization
- Fitting Example
- References
- API Documentation
We highly recommend that you render all matplotlib figures
inline the Jupyter notebook for the best
menpowidgets experience.
This can be done by running
%matplotlib inline
in a cell. Note that you only have to run it once and not in every rendering cell.
1. Definition
Active Pictorial Structures (APS) is a statistical deformable model of the shape and appearance of a deformable object class.
It is a generative model that takes advantage of the strengths, and overcomes the disadvantages, of both Pictorial Structures (PS) [3] and Active Appearance Models (AAMs) [2].
APS is motivated by the tree-based structure of PS and further expands on it, as it can formulate the relations between parts using any graph structure; not only trees. From AAMs, it borrows the use of the Gauss-Newton algorihtm in combination with a statistical shape model. The weighted inverse compositional algorithm with fixed Jacobian and Hessian that is used for optimization is very fast. In this page, we provide a basic mathematical definition of APS. For a more in-depth explanation, please refer to the relevant literature in References and especially [1].
A shape instance of a deformable object is represented as s=[ℓ1T,ℓ2T,…,ℓLT]T=[x1,y1,…,xL,yL]T, a 2L×1 vector consisting of L landmark points coordinates ℓi=[xi,yi],∀i=1,…,L. Moreover, let us denote by A(I,s) a patch-based feature extraction function that returns an M×1 feature vector given an input image I and a shape instance s. The features (e.g. SIFT) are computed on patches that are centered around the landmark points.
2. Gaussian Markov Random Field
Let us define an undirected graph between the L landmark points of an object as G=(V,E), where V={v1,v2,…,vL} is the set of L vertexes and there is an edge (vi,vj)∈E for each pair of connected landmark points. Moreover, let us assume that we have a set of random variables
X={Xi}, ∀i:vi∈V which represent an abstract feature vector of length k extracted from each vertex vi, i.e. xi, ∀i:vi∈V. We model the likelihood probability of two random variables that correspond to connected vertexes with a normal distribution
p(Xi=xi,Xj=xj∣G)∼N(mij,Σij), ∀i,j:(vi,vj)∈E
where mij is the 2k×1 mean vector and Σij is the 2k×2k covariance matrix. Consequently, the cost of observing a set of feature vectors {xi}, ∀i:vi∈V can be computed using a Mahalanobis distance per edge, i.e.
∀i,j:(vi,vj)∈E∑([xixj]−mij)TΣij−1([xixj]−mij)
In practice, the computational cost of computing this cost is too expensive because it requires looping over all the edges of the graph. Inference can be much faster if we convert this cost to an equivalent matrical form as
(x−m)TΣ−1(x−m)
This is equivalent to modelling the set of random variables X with a Gaussian Markov Random Field (GMRF) [4]. A GMRF is described by an undirected graph, where the vertexes stand for random variables and the edges impose statistical constraints on these random variables. Thus, the GMRF models the set of random variables with a multivariate normal distribution p(X=x∣G)∼N(m,Σ), where m=[m1T,…,mLT]T=[E(X1)T,…,E(XL)T]T is the kL×1 mean vectors and Σ the kL×kL overall covariance matrix. We denote by Q the block-sparse precision matrix that is the inverse of the covariance matrix, i.e. Q=Σ−1. By applying the GMRF we make the assumption that the random variables satisfy the three Markov properties (pairwise, local and global) and that the blocks of the precision matrix that correspond to disjoint vertexes are zero, i.e. Qij=0k×k, ∀i,j:(vi,vj)∉E. By defining Gi={(i−1)k+1,(i−1)k+2,…,ik} to be a set of indices for sampling a matrix, we can prove that the structure of the precision matrix is
Q=⎩⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎧∀j:(vi,vj)∈E∑Σij−1(G1,G1)+∀j:(vj,vi)∈E∑Σji−1(G2,G2), ∀i:vi∈V,Σij−1(G1,G2), ∀i,j:(vi,vj)∈E,0,at (Gi,Gi)at (Gi,Gj) and (Gj,Gi)elsewhere
Using the same assumptions and given a directed graph (cyclic or acyclic) G=(V,E), where (vi,vj)∈E means that vi is the parent of vj, we can show that
(x−m)TΣ−1(x−m)=∀i,j:(vi,vj)∈E∑(xi−xj−mij)TΣij−1(xi−xj−mij)
is true if
Q=⎩⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎧∀j:(vi,vj)∈E∑Σij−1+∀j:(vj,vi)∈E∑Σji−1, ∀i:vi∈V,−Σij−1, ∀i,j:(vi,vj)∈E,0,at (Gi,Gi)at (Gi,Gj) and (Gj,Gi)elsewhere
where mij=E(Xi−Xj) and m=[m1T,…,mLT]T=[E(X1)T,…,E(XL)T]T. In this case, if G is a tree, then we have a Bayesian network. Please refer to the supplementary material of [1] for detailed proofs of the above.
3. Model
An APS [1] is trained using a set of N images {I1,I2,…,IN} that are annotated with a set of L landmarks. It consists of the following parts:
Shape Model
The shape model is trained as explained in the Point Distributon Model section. The training shapes {s1,s2,…,sN} are first aligned using Generalized Procrustes Analysis and then an orthonormal basis is created using Principal Component Analysis (PCA) which is further augmented with four eigenvectors that represent the similarity transform (scaling, in-plane rotation and translation). This results in
{s¯,U}
where U∈R2L×n is the orthonormal basis of n eigenvectors (including the four similarity components) and s¯∈R2L×1 is the mean shape vector. We define S∈R2L which generates a new shape instance given the shape model's basis, an input shape and the shape parameters vector as
S(s,p)=s+Up
where p=[p1,p2,…,pn]T are the parameters' values. Similarly we define the set of functions Si∈R2L, ∀i=1,…,L that return the coordinates of the ith landmark of the shape instance as
Si(s,s)=s2i−1,2i+U2i−1,2ip, ∀i=1,…,L
where s2i−1,2i denotes the coordinates' vector of the ith landmark point and U2i−1,2i denotes the 2i−1 and 2i rows of the shape subspace U.
Another way to build the shape model is by using a GMRF structure. Specifically, given an undirected graph Gs=(Vs,Es) and assuming that the pairwise locations' vector of two connected landmarks follows a normal distribution as [xi,yi,xj,yj]T∼N(mijs,Σijs), we formulate a GMRF. Thus, this can be expressed in matricial format as shown in the GMRF paragraph as p(s∣Gs)∼N(s¯,Σs). After obtaining the GMRF precision matrix Qs, we can invert it and apply PCA on the resulting covariance matrix in order to obtain a linear shape model. Note that as shown in [1], it is more beneficial to directly apply PCA rather than using a GMRF for the shape model.
Appearance Model
Given an undirected graph Ga=(Va,Ea) and assuming that the concatenation of the appearance vectors of two connected landmarks follows a normal distribution, we form a GMRF that can be expressed as p(A(I,s)∣Ga)∼N(a¯,Σa) where a¯ is the M×1 mean appearance vector and Qa=Σa−1 is the M×M precision matrix that is formulated as shown in the GMRF paragraph. During the training of the appearance model, the low rank representation of each edgewise covariance matrix Σija is utilized by using the first m singular values of its SVD factorization. Given a¯ and Qa, the cost of an observed appearance vector A(I,s) corresponding to a shape instance s=S(s¯,p) in an image I is
∥∥A(I,S(s¯,p))−a¯∥∥Qa2=[A(I,S(s¯,p))−a¯]TQa[A(I,S(s¯,p))−a¯]
Deformation Prior
This is similar to the deformation model of Pictorial Structures [3]. Specifically, we define a directed (cyclic or acyclic) graph between the landmark points Gd=(Vd,Ed) and model the relative locations between the parent and child of each edge with a GMRF. We assume that the relative loaction between the vertexes of each edge follows a normal distribution [xi−xj,yi−yj]T∼N(mijd,Σijd), ∀i,j:(vid,vjd)∈Ed and model the overall structure with a GMRF that has a 2L×2L block-sparse precision matrix. Qd computed as shown in the GMRF paragraph. Note that the mean relative location vector is the same as the mean shape vector. This spring-like prior model manages to constrain extreme shape configurations that could be evoked during fitting with very bad initialization, leading the optimization process towards a better result. Given s¯ and Qd, the cost of observing a shape instance s=S(s¯,p) is
∥∥S(s¯,p)−s¯∥∥Qd2=∥∥S(s¯,p)−S(s¯,0)∥∥Qd2=S(0,p)TQdS(0,p)
where we use the properties S(s¯,0)=s¯+U0=s¯ and S(s¯,p)−s¯=s¯+Up−s¯=S(0,p).
4. Cost Function and Optimization
Given a test image I, the optimization cost function of APS is
argpmin∥∥A(I,S(s¯,p))−a¯∥∥Qa2+λ∥∥S(s¯,p)−s¯∥∥Qd2
The minimization of this function with respect to the shape parameters p is performed with a variant of the Gauss-Newton algorithm. The hyper-parameter λ controls the weighting between the appearance and deformation costs, which give the generative nature of the model it can greatly determine the performance. An optimal value of this hyper-parameter can be found using a validation dataset. The optimization procedure can be applied in two different ways, depending on the coordinate system in which the shape parameters are updated: (1) forward and (2) inverse. Additionally, the parameters update could be carried out either in an additive or compositional manner. The compositional update has the form S(s¯,p)←S(s,p)∘S(s¯,Δp)−1=S(S(s¯,−Δp),p)=S(s¯,p−Δp). Consequently, due to the translational nature of the motion model, the compositional parameters update is reduced to the parameters subtraction, as p←p−Δp, which is equivalent to the additive update.
Weighted Inverse Compositional with fixed Jacobian and Hessian
By using this compositional update of the parameters and having an initial estimate of p, the cost function of APS is expressed as minimizing
argΔpmin∥∥A(I,S(s¯,p))−a¯(S(s¯,Δp))∥∥Qa2+λ∥∥S(s¯,p)−S(s¯,Δp)∥∥Qd2
with respect to Δp. With some abuse of notation due to a¯ being a vector, a¯(S(s¯,Δp)) can be described as
a¯(S(s¯,Δp))=⎣⎢⎢⎢⎢⎡m1a(S1(s¯,Δp))⋮mLa(SL(s¯,Δp))⎦⎥⎥⎥⎥⎤
where mia,∀i=1,…,L is the mean appearance vector per patch. In order to find the solution we need to linearize around Δp=0 as
{a¯(S(s¯,Δp))≈a¯+Ja¯∣p=0ΔpS(s¯,Δp)≈s¯+JS∣p=0Δp
where JS∣p=0=JS=∂p∂S=U is the 2L×n shape Jacobian and Ja¯∣p=0=Ja¯ is the M×n appearance Jacobian
Ja¯=∇a¯∂p∂S=∇a¯U=⎣⎡∇m1aU1,2⋮∇mLaU2L−1,2L⎦⎤
where U2i−1,2i denotes the 2i−1 and 2i row vectors of the basis U. Note that we make an abuse of notation by writing ∇mia because mia is a vector. However, it represents the gradient of the mean patch-based appearance that corresponds to landmark i. By substituting, taking the partial derivative with respect to Δp, equating it to 0 and solving for Δp we get
Δp=H−1[Ja¯TQa(A(I,S(s¯,p))−a¯)+λHSp]
where
Ha¯=Ja¯TQaJa¯HS=JSTQdJS=UTQdU}⇒H=Ha¯+λHS
is the combined n×n Hessian matrix and we use the property JSTQd(S(s¯,p)−s¯)=UTQdUp=HSp. Note that Ja¯, Ha¯, HS and H−1 can be precomputed. The computational cost per iteration is only O(nM) which is practically a multiplication between an n×M matrix and a M×1 vector that leads to a very fast fitting algorithm.
Forward
The cost function in this case takes the form of minimizing
argΔpmin∥∥A(I,S(s¯,p+Δp))−a¯∥∥Qa2+λ∥∥S(0,p+Δp)∥∥Qd2
with respect to Δp. In order to find the solution we need to linearize around p, thus using first order Taylor expansion at p+Δp=p⇒Δp=0 as
{A(S(s¯,p+Δp))≈A(S(s¯,p))+JA∣p=pΔpS(0,p+Δp)≈S(0,p)+JS∣p=pΔp
where JS∣p=p=JS is the 2L×n shape Jacobian JS=∂p∂S=U and JA∣p=p=JA is the M×n appearance Jacobian
JA=∇A∂p∂S=∇AU=⎣⎢⎢⎡∇A(I,S1(s¯,p))U1,2∇A(I,S2(s¯,p))U3,4⋮∇A(I,SL(s¯,p))U2L−1,2L⎦⎥⎥⎤
where U2i−1,2i denotes the 2i−1 and 2i row vectors of the basis U. By substituting, taking the partial derivative with respect to Δp, equating it to 0 and solving for Δp we get
Δp=−H−1[JATQa(A(S(s¯,p))−a¯)+λHSp]
where
HA=JATQaJAHS=JSTQsJS=UTQsU}⇒H=HA+λHS
is the combined n×n Hessian matrix with getting into account that JSTQsS(0,p)=UTQsUp=HSp. HS can be precomputed but JA and H−1 need to be computed at each iteration. Consequently, the total computational cost is O(nM2+nM+n3) which is much slower than the cost of the weighted inverse compositional algorithm with fixed Jacobian and Hessian.
5. Fitting Example
6. References
[1] E. Antonakos, J. Alabort-i-Medina, and S. Zafeiriou. "Active Pictorial Structures", IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
[2] I. Matthews, and S. Baker. "Active Appearance Models Revisited", International Journal of Computer Vision, vol. 60, no. 2, pp. 135-164, 2004.
[3] P. F. Felzenszwalb and D. P. Huttenlocher. "Pictorial Structures for Object Recognition", International Journal of Computer Vision (IJCV), 61(1):55-79, 2005.
[4] H. Rue and L. Held. "Gaussian Markov Random Fields: Theory and Applications", CRC Press, 2005.