Prof. Mukherjee is on medical leave. If you need to contact him, please contact his brother Mr. Ashis Mukherjee at +919868948570.


amitabha mukerjee


department of computer science and engineering
indian institute of technology, kanpur
kanpur- 208 016, india.
amit [at]

ph.d student opportunities


Research highlights: Learning from scratch

The primary interest in our research group is in developmental cognition and AI - how an infant or a robot might acquire concept-like structures. In the first three months of life, the human baby acquires control of its limbs, being able to swat at objects starting 6 weeks, and reach for them soon after, though grasping (of simple ball-like objects) is not mastered till 5 months or so. Our primary hypothesis is that by repeating actions and noting which solutions seem to give good results, one may internalize certain correlations between parameters, which reduce the degrees of freedom (dimensionality) of the decision space. In our work, we investigate this initial motor learning which maps vision with motor tasks, and the consider how such a map leads to notion of objects. oke this paradigm, which reduces dimensions by restricting the search to a manifold in the decision space, in a diverse range of problems: concepts of containment, visual motion planning, associating words, metaphorical transfer, learning mechanics, and the acquisition of syntax.

Visuo-motor discovery

Expertise is known to involve the acquisition of {\em chunks}, which are compact representations of the input that preserve what is functionally salient. The process is often implicit: performing a task repeatedly, we fuse those aspects of the input that are correlated, so that solutions lie along some low-dimensional surface in the input space. Thus, for the task of visuo-motor learning, we discover such manifolds that correlate visual images with motor poses; the resulting low-dimensional manifold enables planning and representation of both the peripersonal space (body schema) and the extrapersonal space (cognitive map).

(a) (b) (c)

(a) Visuo-motor learning: i. take a sample of images at random poses spanning the motor space. ii. find similar images and compute local tangent neighbourhoods. iii. stitch these into a visuo-motor map - a manifold. (b) gross motion planning: i. from this map, remove nodes that overlap with body or obstacle. ii identify goal poses as images that capture target object (white circle) in palm. plan gross motions within non-colliding nodes (click image to see). Fine-motion : at contact regions. (c) Unknown robot modeling: Same process based on unannotated images of an unknown robot, without any prior information on geometries or kinemantics, without camera calibration, modeling both robot and its task space. (click to see motion) A part of the semantics of the target object (shape and location) is represented as the contact region in this visuo-motor map.

A particular focus of this work is relating concepts that are acquired using sensory data to to language (both at the word learning and syntax levels) . This involves using computer vision to discovering patterns in the input, robotics to explore feasible actions in that space, and natural language processing to map these to language.

Spatial structure discovery

As the agent moves, its precepts change. By creating a manifold as in the above case, one may create a non-metric analog of the ambient space. Rotation at a point results in a loop on this manifold (place cells), and looking at the same direction from different parts of space creates a similar set of images (head orientation).

the symbol for "in". At its semantic pole is an image schema (a generative classifier) - a function with two arguments based on a distribution on the visual angle. it has learned the association of this schema with the linguistic unit "in" from co-occurring language.


Further, spatial relation with respect to an object A is interpreted as such as the portion of this visual space occupied by A. If we consider such perceptual signatures for a number of objects, we find that for some of them, the object A occupies a full rotation in the visual space. The agent recognizes this cluster as a special case - "containment". Babies show sensitivity to containment 2 months onwards, and this relationship generalizes (becomes more schematized or abstract) by six months, when they can distinguish the relationship independent of the participating objects.

Subsequently, with increasing sensitiivty to the sounds of language and an ability to parse word boundaries (around 10-14 months), the agent may be able to map this concept of containment to a sound (or text) as in the symbol "in".

Mapping to language

However, mapping such a perceptual structure to language is problematic. Different languages carve up the space of spatial relations in different ways. Also, with increasing exposure to language, the mental model of a lexical item (the image schema) itself changes. One of the objectives of this work is to trace this ontogenetic evolution in the image schema of containment relations.

In this work, we demonstrate the process via a computational simulation based on a simple video (based on Heider-Simmel). We consider how an early learner may map prior perceptual notion for object names ("ball", "square") and landmarks ("box") and relations ("in") into one of several languages based on adult descriptions.

Symbolic unit discovery: The system stabilizes its fragile perceptual schema via interaction with others and mapping to language. It considers co-occurring commentaries by a number of adults. All lexical items uttered in sentences co-occurring with containment situations are candidates for association. We use a mutual information measure for association, and discover that in, into, inside are the words with strongest association with full containment. Thus we learn a symbol (in the sense of cognitive grammar) with both phonological and semantic poles. Similar grammars are learned also for Hindi.

The discovered symbol has two arguments slots for {tr, lm}, and a function that discriminates if a given spatial situation between them is [IN] or not. On the other hand, when hearing the word "in", a visualization can also be generated by this schema, by imagining a landmark which has a visual angle.

phrase structure discovered from untagged corpus, using the [Solan/Edelman:2002] algorithm ADIOS.

symbolic composition: symbolization for "in the box", with one argument slot free, for the trajector

Discovering syntactic constructions (Symbolic composition): Subsequently, the system starts to look at the patterns of tokens that appear in the same commentary. Several construction are found to frequently co-occur with containment situations, e.g. the {circle|big square} moves into the box. These constructs are then associated with containment.

We can now recognize synonyms using this construction, e.g. someone says "the big block goes into the box" - by comparing with the visual image, we can recognize the "big block" as a synonym for "big square". At this stage, some pronomial anaphora are also discovered as polysemies ("it", "them", "each other").

Meaning enrichment and metaphor: Once such language constructions are known, we can identify these in novel text, without co-occurring perceptual input. This is how most words are learned.

E.g. the construction "in the X", has X as the container. From the large Brown corpus, we now find a number of tokens appearing in this position - this tells us that these tokens are acting as a container. However, by looking at the object classes these tokens are coming from, we can identify that these are not direct spatial containment. Hence, we gradually extend the meaning to include conventionalized metaphorical extensions; the semantic pole is enriched with containers such as time ("in the nineties") or group ("in the team").

In terms of building AI systems, the work proposes that scaling up to human-like capabilities may not be possible by hand-coding the knowledge: on the other hand, such unsupervised approaches for learning schemas may be is more scalable, though they require many more datasets with co-occurring language.

An associated problem I am looking at involves discovering design symbols based on exploring design spaces. Here good solutions lie on manifolds since many design parameters are inter-related (as the strength increases, both width and height may go up). This is fundamental to acquiring the tacit knowledge that underlies many types of design decisions.

Select Publications

Visuo-motor learning on manifolds (path)

Semantically-driven language acquisition (containment)

Dynamic attention models

Theory of Mind

Word learning in 3D scenes

Cognitive models in design


Other interests: Hands-on learning in schools


publications   students   storytelling science  literary  iitk birds  book excerptise 


cse home page    center for robotics    iitk home   

Update: Oct 2009