Amitabha Mukerjee

Conceptual Description of Visual Scenes from Linguistic Models
Mukerjee, Amitabha, Kshitij Gupta, Siddharth Nautiyal,
Mukesh P. Singh and Neelkanth Mishra
Journal of Image and Vision Computing,
Special Issue on Conceptual Descriptions, v.18 (March 2000)

Abstract

As Model-based vision moves towards handling imprecise descriptions like {\tt the long bench is in front of the tree}, it has to confront questions involving widely variable shapes in unclear positions. Such descriptions may be said to be ``conceptual'' in the sense that they provide a loose set of constraints permitting a range of instantiations for the scene. One of the validations of a computational system's ability to handle such descriptions is provided by immediate visualization, which tells the user whether the bench is of the right shape and has been positioned correctly. Such a visualization must handle impreciseness in Shape and Spatial Pose, and, for dynamic vision, Object Articulation and Motion Parameters as well. The visualization task is a concretization which consists of generating an ``instance'' of the scene / action being described.
The principal requirement for concretizing the conceptual model is a large visual database of objects and actions, along with a set of constraints corresponding to default dependencies in the domain. In our work, the resulting set of constraints are combined using multi-dimensional fuzzy functions called continuum fields (potentials). A set of experiments was conducted to determine the parameters of these continuum fields. An instance is generated by identifying minima in the continuum fields involved in generating the shape, position and motion. These are then used to create default instantiations of the objects described. The resulting image/animation may be considered to be the ``most likely'' visualization, and if this matches the linguistic description, the continuum fields selected are a good model for the conceptual content in the linguistic model of the scene. We present examples of scene reconstruction from conceptual descriptions of urban parks.

A visual scene reconstruction based on the natural language input: "The man goes to the woman and gives the flower to the woman." Given fuzzy geometric models that define a ``man'', ``woman'', ``flower'', the system has to identify possible constraints on the goal position and motion type for "goes", generating a feasible trajectory that should also be reasonably smooth and executable with the person's gait and motion type, obtaining arm motions and body stances for both agents during " give", determining type of grasp by woman for flower, etc. Subsequently, presented with a dynamic visual scene it can test if it matches such a model.
Full paper: [ Gzipped Postscript 840160 bytes]

Latest update: September 18, 1999