Computer Vision @ IIT Kanpur

CV-IITK logo





Welcome to the homepage of the Computer Vision Group of IIT Kanpur. We are a group of faculty and students working on exciting problems on the recently very popular area of Computer Vision and applied Machine Learning as well as on the intersection of those with Signal Processing and Robotics. We are interested primarily in the research problems and general directions given below, but are also adaptable and receptive to new interesting problems that may come up in the near future.
This webpage is under construction so keep checking back for more information.

Vision and Language

The progress in visual and textual processing and understanding has happened in relatively distinct threads, traditionally. More recently vision and language methods and algorithms have been combined towards applications such as image captioning and visual question answering. Eg. the image on the left would be captioned automatically as A dog chasing a ball or a question could be posed with that image as What color ball is the dog chasing? with a possible answer as 'white'. We are interested in such problems where the complementarity of vision and language models is exploited and novel algorithms and problems are designed to address relevant challenges and applications.

Face and Human Analysis

Visual data is increasing at a very high rate — everyone has a camera in her/his pocket and an internet connection to share picture and videos. Most of such human generated visual data has, in turn, humans as the main subjects. Hence, analysis and understanding of human centered visual data is an important part of Computer Vision. We are interested in many problems focusing on humas such as (i) facial analysis: predicting identity, emotions, intent from faces, (ii) human attributes prediction: the kind of clothes the person is wearing, the accessories the person is wearing, (iii) pose estimation, (iv) action/activity prediction.

Human Behavior Analysis

Our research direction on human behavior analysis lies in the intersection of Computer Vision, Signal Processing and Machine Learning. Human behavior is inherently multimodal, and hence requires combining information from other modalities (speech or language, for example) with vision. Through the confluence of these techniques, our goal is to provide a quantitative understanding of individual, group and social human behavior, in domains, such as, media, education, and health.

Perception/Vision for Robotics

Today, robots are used to perform challenging tasks that were not possible few years ago because of limited computational and sensor resources. In order to perform these complex tasks, robots need to sense and understand the environment around them. Depending upon the task at hand, robots are often equipped with different sensors to perceive their environment. Two important categories of perception sensors mounted on a robotic platform are: (i) Range Sensors — 3D/2D lidars, radars, sonars, etc. (ii) Cameras — perspective, stereo, omnidirectional, etc. With the recent advancements in these sensing technologies, the capabilities of robots to perform difficult tasks has been greatly extended. The computer vision group in IIT Kanpur is interested in research problems related to sensing for robotics applications. One such example is autonomous navigation of robots where techniques from computer vision are used for localization of robots and for obstacle detection and classification.

Assistive Computer Vision

In this research direction, we investigate the scope of computer vision for assisting human beings in day-to-day life. We develop algorithms for a class of related problems in computer vision. This area is becoming more and more practical with the popularity of wearable cameras (eg. google glasses) and lightweight computing devices (eg. mobile phones). Our use case centers around a wearable or portable camera capturing the world around as still images or video streams. The goal is to provide appropriate inputs to the human to help in a specific set of tasks. For instance, a visually challenged person uses a wearable camera to know the surroundings. The automated understanding of the content from the images/videos, is then used as an alternate input to enrich the interaction with the external world. For example, this person can read a text or sign, locate a specific object of interest, anticipate the pose of the object for manipulation, know the identity of a person around and appreciate the facial expressions of the people around.