Visual Attention for Object Detection - A Computational Model

-Shubham Tulsiani

Mentor - Prof. Amitabha Mukherjee

ABSTRACT

Object detection is a very relevant field in Computer Vision. One of the current challenges faced in this area is to design reliable and fast methods which achieve this objective. Many such methods tend to rely on a computationally inexpensive pre- selection step which helps limit our search to specific regions of the image. When humans see an image, they tend to focus only on some part of the image. Inspired by this, a popular approach for this pre-selection step is to model the Human Visual Attention and preselect only those regions where humans are likely to direct their attention to.

In this work, we build a computational model of Visual Attention for the specific task of object detection by combining the various cues which guide our attention. We further analyse its success in preselecting the regions with the target object and its efficiency to predict the areas where humans would look while performing a similar task.


METHODOLOGY IN BRIEF

To develop this model of Visual Attention, we combined the cues obtained by various mechanisms (top-down and bottom-up) to guide attention. The top-down cues that were considered were Contextual cues and Target Object Feature based maps. We used saliency maps to extract the bottom-up information in the image.
We selected these aspects to build the attention map because it is natural to assume that they guide our atention in a Search task. When searching for a rope, we tend to focus on long, thin, coiled objects in the visual field.(Feature Cues)
The relevance of contextual cues can be seen from the fact that when searching for a car (given an image of the road), we tend to ignore the sky.
Also, saliency is important in guiding our attention as we tend to focus more on objects which 'stand out' from their background.

The Combined Model

We obtained saliency, context and feature based maps for our test images. We then normalised and combined these in various ratios and tested these on our dataset so that we can come up with the ratios which give the best results. We expected that combining these cues would give us a better model of Visual Attention than any of these individually could.

Testing the Model

Given an image, the model gave an Attention Map with higher values indication higher likelihood of those regions being focussed upon. To determine the efficiency of our model, we took into account the top 10 and 20 % of the image and had the following criteria in mind -

A Sample Output

IMPORTANT RESULTS

Links to Project Resources

Key References and Code Sources

- Shubham Tulsiani