Disparity Estimation

Disparity Estimation by Graph Cut/Belief Propagation for Rectified Image Pair

Abstract:

Disparity estimation is an image labeling problem. It is modeled by Markov Random Field (MRF), and the energy minimization task is solved by some popular global optimization methods, i.e. Graph Cut and Belief Propagation. If we assume the image pair (triple) comes from rectified stereo, then the disparity estimation turns to be a 1-d searching process. Below we give some results by applying a Graph Cut-based method proposed by Boykov, Veksler and Zabih (BVZ) [4-5], its source code can be downloaded from Kolmogorov's web page. Meanwhile, some respective results are given by applying Belief Propagation-based method proposed by Felzenszwalb and Huttenlocher (FH), its source code can be downloaded from Felzenszwalb's web page. More comparison and evaluation of various stereo matching algorithms and source codes can be seen at the Middlebury Stereo Vision Page.

Max-flow and Min-Cut Problem

A flow network G(V, E) is formally defined as a fully connected directed graph where each edge (u,v) in E has a positive capacity c(u,v) >= 0. Two special vertices in a flow network are designated the source s and the sink t respectively. A flow in G is a real-valued function f: VxV --> R that satisfies the following three properties: 1) capacity constraints, for every u,v in V, f(u,v) <= c(u,v), 2) skew symmetry, for every u,v in V, f(u,v) = -f(u,v), 3) flow conservation, for all every u in (V -{s,t}), sum f(u,v) = 0 of every v in V. The value of a flow, |f| is defined as the total flow out of the source in the flow network G.

The max-flow problem is to find the flow of maximum value on a flow network G. A s-t cut or simply cut of a flow network G is a partition of V into S and T = V-S such that s in S and t in T. For a given flow f, the net flow across the cut (S,T) is defined as f(S,T) = sum f(x,y) of every x in S, y in T. Using a similar notation the capacity of a cut (S,T) is defined as c(S,T) = sum c(x,y) of every x in S and y in T. A minimum cut of a flow network is a cut whose capacity is the least over all the s-t cuts of the network.The polynomial algorithms for the single-source single-sink max-flow problem can be divided into two classes, algorithms based on the Ford Fulkerson method [1] and those based on the "Push-Relabel" method [2].

The intuitive idea behind the Ford-Fulkerson method is that starting with zero flow i.e. f(u,v) = 0 for every u,v in V, the flow can be gradually increased by finding a path from s to t along which more flow can be sent. Such a path is called an augmenting path, and once it has been found, the flow can be augmented along this path. The process if repeated, must end after a finite number of iterations after which no augmenting paths between s and t exist anymore. A typical algorithm of this type maintains for a given flow f, the residual graph of G, called G_f whose topology is identical to G but whose edge capacities stores the residual capacity of all the edges, given that there is already some flow in them. The search for an augmenting path at the ith iteration is done on the current residual graph G_f_i. Once an augmenting path is found, the maximum amount of flow that can be sent down it, f_incr must saturate at least one of the edges of this augmented path. The new flow at the end of the iteration will be f_i + f_incr. Dinic algorithm [3] uses breadth-first search to find the shortest paths from s to t on the residual graph.

The intuitive idea behind the Push-Relabel algorithms is to associate a notion of height along with all the nodes in the network. The height of the source and sink are fixed at |V| and 0 respectively while at the start all other vertices are at height 0. The algorithm starts by sending flow down from the source and the amount of flow sent, saturates all the outgoing edges. All intermediate nodes have a buffer or a reservoir that can store excess flow. Nodes with positive excess flow are said to be overflowing nodes. Overflowing nodes try to push the excess flow downhill. However when an overflowing node finds the edges to its neighbours at the same height as itself saturated, it increments its own height, a process which is called "relabeling". This allows it to get rid of the excess flow. The algorithm terminates when none of the nodes in V are overflowing. Often excess flow accumulated in the interior nodes are sent back to the source by relabeling these nodes with height beyond |V|.

The Labeling Problems Modelled with Markov Random Fields (MRFs)

Markov Random Fields is a generative model often used in Image Processing and Computer Vision to solve labeling problems. A Markov Random Field(MRF) consists of three sets, a set S of sites, a neighbourhood system N and a set (also called field) of random variables F. The neighbourhood system N = { N_i | every i in S } where each N_i is a subset of sites of S which form the neighbourhood of site i. The random field F = { F_i | for every i in S } consists of random variables F_i that take on a value f_i from a set of lables L = { l1; l2; ...}. A particular set of labels, often denoted by f ( which can be thought of as the joint event { F1 = f1; F2 = f2; ...} ) is called a configuration of F. The probability of a particular configuration is proportional to the sum of clique potential V_C over all the cliques in N. The clique potential is obtained from prior probabilities of a particular labeling of the sites in the clique C.

Markov Random Fields are used to model labeling problems where an optimal labeling is desired. From a probabilistic perpective, one wishes to estimate the configuration f based on observed data D (could be noisy or incomplete) that maximises the likelihood function, P(D|f). Using Bayes Theorem, this likelihood function can be expressed as an energy function E(f) and the maximum a posterieri (MAP) estimate of f should maximize this energy function.

Different MRF's differ in a choice of the neighbourhood system and the prior probabilities. In vision and image processing, where S, the set of sites often coincides with the set of regular grid of pixels and voxels, 4-neighbourhood or 8-neighbourhood systems on a 2D grid or 6-neighbourhood or 26-neighbourhood systems on a 3D grid are common. Morever, often a labeling with the following properties is desired; it should be locally constant or smooth but should also allow for discontinuities (at region boundaries). When one chooses clique potentials to make the desired labeling piecewise continuous, the resulting MRF is called Generalized Potts Model MRF (GPM-MRF).

The Multi-Labeling Solved by Multi-Way Cut

Boykov et. al. [4] solve various computer vision problems which can be posed as energy minimization problems. Some examples of such problems are image restoration, stereo, multiple view reconstruction, and object segmentation. As we know, these tasks can all be classified as labeling problems where visual constraints are typically contextual and reflect a choice of priors in the neighbourhood systems. Finding the most likely labeling translates to optimizing an energy function. They solve this energy minimization problem by transforming it into a multi-way cut problem on a graph. However the multi-way cut problem is NP-Hard and they propose an approximation algorithm that produces a cut which minimises the original energy function in a strong sense.

In vision and image processing, the labeling problem looks specific since those labels often tend to vary smoothly within the image, except at some kind of region boundaries where discontinuities are allowed. The fact that a particular pixel label depends on the labels of its neighbours allows modeling the optimization problem as a MRF. Boykov et.al. [5] show that finding the most likely labeling for some given data, is equivalent to seeking the MAP (maximum a posteriori) estimate. In order to solve a particular vision problem, a suitable neighbourhood system needs to be chosen and prior probabilities for particular labelings in the neighbourhood system need to be assigned. Different choices for these entities yield different classes of energy functions and hence the term energy model is used to refer to this problem domain specific choice.

Two common energy models are Potts Interaction Energy Model and Linear Interaction Energy Model. Let's describe the Pott Energy Model as an example. The graph G contain two kinds of vertices: p-vertices (pixels or voxels which are the sites in the associated MRF) and l-vertices (which coincide with the labels and will be terminals in the graph cut problem). All the edges present in the neighbourhood system N are edges in G. These edges are called n-links. Edges between the p-vertices and the l-vertices called t-links are added to the graph. The t-links are assigned weights based on the data term while n-links are assigned weights based on the interaction (smoothness) term. While n-links are bi-directional, t-links are uni-directional, leaving the source and entering the sink.

In the multiple label case, the multiway cut should leave each p-vertex connected to one l-vertex. This ensures that every multi-way cut which separates all terminals, must correspond to a valid labeling ie. a configuration of the associated MRF. The minimum cost multiway-cut will minimize the energy function where the severed n-links would correspond to the boundaries of the labeled regions. The approximation algorithm which finds this multiway cut, is called the "alpha-expansion" algorithm [5] (another is called "alpha-beta swap" algorithm).

Graph Cut-based Stereo matching (BVZ)

The problem of recovering an accurate disparity image can be posed as an energy minimization problem using the same MRF framework. Disparity in a stereo pair tends to vary within a fixed range and can be discretely sampled. It also tends to vary smoothly over the image except at depth discontinuities. Energy models like the Pott Energy can incorporate such contextual information within the MRF framework. Occlusion occur when 3D points are visible in only one of the stereo image pairs and are typically found near depth-discontinuities in the scene. Sites in the MRF for this hard problem do not represent image pixels but pair of pixels which can potentially correspond. The set of labels is {0,1} where 0 indicates either of the pixels are occluded and 1 indicates that the pair of pixels are matching. Besides of the usual data and smoothness terms, a new term called the occlusion penalty term is added, which imposes a penalty for making a particular pixel p in the stereo image pair P1 or P2 occluded. Minimizing the energy function is still NP-Hard but an approximate algorithm based on "alpha-expansion" or "alpha-beta-swap" computes a local minimum within a constant factor of the global minimum by solving max-flow on an associated graph.

"Tsukuba".

"Rena".

"Akko&Kayo".

Bayesian Network and Markov Random Field in Graphical Models

State-of-art computer vision algorithms use prior knowledge about the statistics of typical surfaces to infer the three–dimensional (3D) shape of a scene from ambiguous, local image measurements. Probabilistic graphical models provide a powerful, general framework for designing systems like these. In this approach, graphs are used to decompose joint distributions into a set of local constraints and dependencies. Such modular structure provides an intuitive language for expressing domain–specific knowledge, and facilitates transfer of modeling advances to new applications. Once a problem has been formulated using a graphical model, a wide range of efficient algorithms for statistical learning and inference can then be directly applied.

Several different formalisms have been proposed that use graphs to represent probability distributions. For example, directed graphical models, or Bayesian networks, are widely used in artificial intelligence to deduce causal, generative processes. Special cases of interest include hidden Markov models (HMMs) and continuous state space models. Alternatively, undirected graphical models, or Markov random fields (MRFs), provide popular models for the symmetric dependencies arising with spatial or image data. An MRF is a graph in which the nodes represent variables and arcs represents compatibility relations between them.

Belief Propagation

In this column, belief propagation (BP) is one of powerful tools for learning low-level vision problems, such as motion analysis, unwrapping phase images, stereo matching, inferring shape and reflectance from photograph, or extrapolating image detail. The key idea of BP is simplified Bayes Net. It propagates information throughout a graphical model via a series of messages sent between neighboring nodes iteratively.

J. Pearl [6] described local message propagating schemes for inferring the states of the hidden nodes in these types of networks. The algorithm consists of simple local updates that can be executed and are guaranteed to converge to the correct probabilities. The term belief update is used to describe the scheme for computing the function of probabilistic belief. His algorithm is equivalent to schemes proposed independently in a wide range of fields including information theory, signal processing and optimal estimation. It was proved that Pearl’s algorithm would work for the singly connected graphs. In many applications especially image processing applications, however, the networks are loopy.

Pearl’s algorithm was described for directed graphs, but undirected graphs are more useful in our application cases. Every directed graphical model can be transformed into an undirected graphical model before doing inference.There are two types of BP methods for MRFs: max-product and sum-product. The max-product BP algorithm is motivated by finding a labeling with maximum posterior probability, or equivalently with minimum energy. It can be used to approximate the MAP solution to MRF problems. The max-product BP algorithm works by passing messages around the graph defined by the four-connected image grid. The method is iterative, with messages from all nodes being passed in parallel. The sum-product BP algorithm can be used to approximate the posterior probability of each label for each pixel. As with the max-product algorithm, the sum-product method works by passing messages around the graph defined by the neighborhood system N.

Belief Propagation-based Stereo matching (FH)

Many low–level computer vision tasks have a common form: gather local evidence, then propagate that evidence across space. BP provides machinery to solve such spatial inference problems in factor graphs. In the constructed factor graph, each hidden node denotes a pixel or a patch of the image, and each observed node connected to the hidden node has the observed image values (pixels or patches). Stereo matching, in which images captured at two offset locations are used to estimate scene depths, is typical of this type of inference problem. The disparity estimation problem can be modeled with MRFs and then solved by the BP algorithms [7].

Under the assumption of diffuse reflectance, 3D scene locations produce the same intensities in different cameras. Let ds denote a candidate disparity at location (is, js) in the left, or reference, image. In the simplest case, we use a Gaussian noise process of variance to account for any deviations of the right camera intensity, Ir(is+ds, js), at the position of true disparity is+ds, from the intensity at the corresponding position in the left camera, Il(is, js). Then, the likelihood of disparity ds at image site s is defined as a Gaussian of template matching error between Ir(is+ds, js) and Il(is, js). Since many objects have smooth surfaces, it is reasonable to assume that the differences between disparities at neighboring positions are Gaussian distributed with mean zero. However, this fails to account for the very large disparity differences observed at object contours. For this reason, researchers often place truncated Gaussian compatibility functions on the depth differences between neighboring pixels.

Since the BP algorithm is computationally expensive, some algorithmic techniques are proposed in [7] to substantially improve the running time of the loopy BP approach. One of the techniques reduces the complexity of the inference algorithm to be linear rather than quadratic in the number of possible labels for each pixel. Another technique speeds up and reduces the memory requirements of BP on grid graphs. A third technique is a multi-grid method that makes it possible to obtain good results with a small fixed number of message passing iterations, independent of the size of the input images. Taken together these techniques speed up the standard algorithm by several orders of magnitude.

"Tsukuba".

"Rena".

"Akko&Kayo".

References

1. L. Ford and D. Fulkerson, "Flow in Networks", Princeton University Press, 1962.

2. A.V Goldberg and R. E. Tarjan, "A new approach to the maximum-flow problem", Journal of the Association for Computing Machinery, vol 35, no. 4, pp 921-940, Oct 1988.

3. Y. Dinitz, "Algorithm for solution of a problem of maximum flow in networks with power estimation", Soviet Math. Dokl., vol 11, pp 1277-1280, 1970.

4. Y. Boykov and V. Kolmogorov, "An Experimental Comparison of Min-Cut/Max-Flow Algorithms for Energy Minimization in Vision", IEEE T-PAMI, vol 26, no.9, pp 1124-1137, Sept 2004.

5. Y. Boykov, O. Veksler and R. Zabih, "Faster approximate energy minimization via graph cuts", IEEE T-PAMI, vol 23, no.11, pp 1222-1239, Nov 2001.

6. J. Pearl, "Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference". Morgan Kaufmann Publishers, San Mateo, CA, 1988.

7. Pedro F. Felzenszwalb and Daniel P. Huttenlocher, "Efficient Belief Propagation for Early Vision", International Journal of Computer Vision, Vol. 70, No. 1, October 2006