Visual Attention for Object Detection - A Computational Model
ABSTRACT
Object detection is a very relevant field in Computer Vision. One of the current
challenges faced in this area is to design reliable and fast methods which achieve this
objective. Many such methods tend to rely on a computationally inexpensive pre-
selection step which helps limit our search to specific regions of the image. When
humans see an image, they tend to focus only on some part of the image. Inspired
by this, a popular approach for this pre-selection step is to model the Human Visual
Attention and preselect only those regions where humans are likely to direct their
attention to.
In this work, we build a computational model of Visual Attention for the specific
task of object detection by combining the various cues which guide our attention. We
further analyse its success in preselecting the regions with the target object and its
efficiency to predict the areas where humans would look while performing a similar
task.
METHODOLOGY IN BRIEF
To develop this model of Visual Attention, we combined the cues obtained by various mechanisms (top-down and bottom-up) to guide attention.
The top-down cues that were considered were Contextual cues and Target Object Feature based maps. We used saliency maps to extract the bottom-up
information in the image.
We selected these aspects to build the attention map because it is natural to assume that they guide our atention in a Search task. When searching for a rope, we tend to focus on long, thin, coiled objects in the visual field.(Feature Cues)
The relevance of contextual cues can be seen from the fact that when searching for a car (given an image of the road), we tend to ignore the sky.
Also, saliency is important in guiding our attention as we tend to focus more on objects which 'stand out' from their background.
The Combined Model
We obtained saliency, context and feature based maps for our test images. We then normalised and combined these in various ratios and tested
these on our dataset so that we can come up with the ratios which give the best results. We expected that combining these cues would give
us a better model of Visual Attention than any of these individually could.
Testing the Model
Given an image, the model gave an Attention Map with higher values indication
higher likelihood of those regions being focussed upon. To determine the efficiency
of our model, we took into account the top 10 and 20 % of the image and had the
following criteria in mind -
A Sample Output
For the model to be implemented as a pre-selection step in object detection, it should
have the target object in the 10 or 20 % region it selects from the given image. To
determine the accuracy, one should measure the percentage of target objects in the
selected region.
To estimate how successful our model has been in mimicking human attention, we
measured the percentage of fixations which lie in the region selected. This was done
separately on positive stimuli (image containing target object) and negative stimuli
(image without target object). Out of the various combinations which were tested for
positive stimuli, the best of those was then tested on negative stimuli.
IMPORTANT RESULTS
Links to Project Resources
Key References and Code Sources
- Shubham Tulsiani