On Efficient Bayesian Scene Interpretation

Embargo until
Date
2016-02-05
Journal Title
Journal ISSN
Volume Title
Publisher
Johns Hopkins University
Abstract
Scene understanding, including object recognition, is perhaps the most challenging task in computer vision. Deep convolutional neural networks (CNNs) have received a flurry of interest in the past few years due to their superior performance. However, deep networks are computationally expensive and without efficient implementation on high performance computing systems not as practical as older methods. Furthermore, CNNs do not benefit from the human's visual selective attention and top-down contextual feedback connections. The human visual system makes extensive use of contextual information to facilitate and refine object detections; object detection and recognition based only on intrinsic features of target objects is not usually sufficient for reliable inference. In this thesis, we use a model-based approach to incorporate top-down contextual information, and analyze scenes in a coarse-to-fine fashion inspired by the visual selective attention property. In addition to disambiguating object detection, the space of objects and their poses can be searched more efficiently by taking advantage of the contextual relations between different scene entities. We present a new approach to efficiently search the space of objects and their poses using a Bayesian method called ``Entropy Pursuit'', where contextual relations between object instances and other scene entities are incorporated via a prior model. Using the entropy pursuit approach we collect bits of information about the scene sequentially by greedily selecting patches whose analysis provide the most informative in an information-theoretic sense. As proof of concept we use the entropy pursuit method for multi-category object recognition in table-setting scenes. We have investigated the possibility of generating a scene interpretation by processing only a fraction of patches from an input image. Our results confirm the hypothesis that we can identify an accurate interpretation by processing only a fraction of patches if the right patches are selected in the right order. We can save computation time by processing only a fraction of patches.
Description
Keywords
Scene Interpretation, Object Detection, Convolutional Neural Networks, Statistical Inference, Stochastic Approximation, MCMC.
Citation