Visual Attention for Object Detection

Visual Attention for Object Detection - A Computational Model

-Shubham Tulsiani

Mentor - Prof. Amitabha Mukherjee

ABSTRACT

Object detection is a very relevant field in Computer Vision. One of the current challenges faced in this area is to design reliable and fast methods which achieve this objective. Many such methods tend to rely on a computationally inexpensive pre- selection step which helps limit our search to specific regions of the image. When humans see an image, they tend to focus only on some part of the image. Inspired by this, a popular approach for this pre-selection step is to model the Human Visual Attention and preselect only those regions where humans are likely to direct their attention to.

In this work, we build a computational model of Visual Attention for the specific task of object detection by combining the various cues which guide our attention. We further analyse its success in preselecting the regions with the target object and its efficiency to predict the areas where humans would look while performing a similar task.

METHODOLOGY IN BRIEF

To develop this model of Visual Attention, we combined the cues obtained by various mechanisms (top-down and bottom-up) to guide attention. The top-down cues that were considered were Contextual cues and Target Object Feature based maps. We used saliency maps to extract the bottom-up information in the image.
We selected these aspects to build the attention map because it is natural to assume that they guide our atention in a Search task. When searching for a rope, we tend to focus on long, thin, coiled objects in the visual field.(Feature Cues)
The relevance of contextual cues can be seen from the fact that when searching for a car (given an image of the road), we tend to ignore the sky.
Also, saliency is important in guiding our attention as we tend to focus more on objects which 'stand out' from their background.

The Combined Model

We obtained saliency, context and feature based maps for our test images. We then normalised and combined these in various ratios and tested these on our dataset so that we can come up with the ratios which give the best results. We expected that combining these cues would give us a better model of Visual Attention than any of these individually could.

Testing the Model

Given an image, the model gave an Attention Map with higher values indication higher likelihood of those regions being focussed upon. To determine the efficiency of our model, we took into account the top 10 and 20 % of the image and had the following criteria in mind -

Success as Pre-selection in Object Detection
For the model to be implemented as a pre-selection step in object detection, it should have the target object in the 10 or 20 % region it selects from the given image. To determine the accuracy, one should measure the percentage of target objects in the selected region.

Success in predicting Human Fixation
To estimate how successful our model has been in mimicking human attention, we measured the percentage of fixations which lie in the region selected. This was done separately on positive stimuli (image containing target object) and negative stimuli (image without target object). Out of the various combinations which were tested for positive stimuli, the best of those was then tested on negative stimuli.

A Sample Output

IMPORTANT RESULTS

Our Combined Model obtained a 91.7% accuracy (with 20% threshold) in predicting regions for human fixations which is higher than the previous models. We also got a significantly high (80%) accuracy with 10% threshold wheres the previou works had this as 65%

As expected, the accuracy of the combined model was higher than the single source models.