Turnitin

Visual Attention for Object Detection : a Computational Model Shubham Tulsiani(Y9574) Mentor : Prof. Amitabha Mukherjee ABSTRACT Object detection is a very relevant field in Computer Vision. One of the current challenges faced in this area is to design reliable and fast methods which achieve this objective. Many such methods tend to rely on a computationally inexpensive pre- selection step which helps limit our search to specific regions of the image. When humans see an image, they tend to focus only on some part of the image. Inspired by this, a popular approach for this pre-selection step is to model the Human Visual Attention and preselect only those regions where humans are likely to direct their attention to. In this work, we build a computational model of Visual Attention for the specific task of object detection by combining the various cues which guide our attention. We further analyse its success in preselecting the regions with the target object and its efficiency to predict the areas where humans would look while performing a similar task. Contents 1 Introduction 4 2 Saliency Driven Guidance 5 3 Context Based Cues 6 4 Target Object Feature Maps 7 5 Previous Work 8 6 Methodology 8 6.1 The Human Fixation Data . . . . . . . . . . . . . . . . . . . . . . . . . 8 6.2 Combining the bottom-up and top-down cues . . . . . . . . . . . . . . 8 6.3 Determining the Success of the Model . . . . . . . . . . . . . . . . . . . 9 6.3.1 Success as Pre-selection in Object Detection . . . . . . . . . . . 9 6.3.2 Success in predicting Human Fixations . . . . . . . . . . . . . . 9 6.4 Differences in Methodology from Previous Works . . . . . . . . . . . . 9 7 Results 10 7.1 Results of the Individual Models for Positive Stimuli . . . . . . . . . . 10 7.2 Negative Stimuli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 7.3 Results on Individual Images . . . . . . . . . . . . . . . . . . . . . . . . 11 7.3.1 Positive Stimuli . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 7.3.2 Negative Stimuli . . . . . . . . . . . . . . . . . . . . . . . . . . 14 7.4 Accuracy for Detecting Objects . . . . . . . . . . . . . . . . . . . . . . 16 7.5 Comparison with Previous Results . . . . . . . . . . . . . . . . . . . . 16 7.6 Results with 5% Threshold . . . . . . . . . . . . . . . . . . . . . . . . . 17 8 Observations and Conclusions 20 8.1 A possible Pre-selection Step . . . . . . . . . . . . . . . . . . . . . . . . 20 8.2 High Efficiency in Predicting Fixations . . . . . . . . . . . . . . . . . . 20 8.3 Single Source vs Combined Model . . . . . . . . . . . . . . . . . . . . . 20 8.4 Positive vs Negative Stimuli . . . . . . . . . . . . . . . . . . . . . . . . 20 9 Scope for Future Work 20 10 Acknowledgements 21 11 References 21 1 Introduction Humans perform tasks that require visual guidance eg. having to search for a particu- lar object very regularly and thus have a developed cognitive mechanism which allows them to perform this efficiently. When humans guide their attention to something, they are focussing on some stimuli over others and thus, a model of Visual Attention would allow us to preselect some regions of the image and find applications in many areas of Computer Vision. There are various schools of thought when it comes to models of human visual atten- tion. One advocates the importance of bottom-up stimulus (eg. intensity. colour) to drive our attention. This approach is task independent and not memory based. The claim is that we are more likely to focus on objects which ’stand out’ from their back- ground. One of the most widely used models for bottom-up saliency is the Itti-Koch model[1]. On the other hand, some point out the relevance of various top-down aspects of the stimuli to drive our attention. These task dependent factors incude, amongst others, context and feature based cues. For instance, given the task of searching for a pedes- trian in an image, we are likely to focus on the road despite there being salient objects in the sky. Also, while searching for a car, we would call upon our knowledge of what a car looks like and would drive our attention to objects similar to it rather than objects similar to a pole. In our work, we have only considered Context based and Feature based cues amongst the top-down influences which drive our attention. We develop a model of attention which combines the above influences and use it as a pre-selection step for object detection. We also study the consistency between its predictions and actual human fixations while searching for pedestrians to explore the question of which of the above mentioned features is the most dominant in driving our attention. 2 Saliency Driven Guidance This captures the bottom-up aspects of the stimuli which guide our attention. It does not take into account any knowledge about the object being searched for and is a task independent factor. We take into account the basic features (orientation, colour, intensity at various scales) to determine the saliency of each region in the image and associate a high probability of attention being guided to those regions with a high saliency score. We have used the Itti-Koch model[1] to compute the saliency map of an image. We combine it with the other top-down influences to predict the human attention map. 3 Context Based Cues In these we try to account the fact that we do not focus our attention on the road while searching for a cloud or in the sky while looking for a pedestrian. When we see an image, we immediately understand the scene and accordingly select regions where object may be present. This information is independent of the saliency or the resemblance to the target object. Contextual cues are object driven and experience dependent and thus the model which predicts these has to learn them over a training set. The contextual model we use is provided by Torralba[5] and we trained it using the LabelMe dataset[3]. While learning, 10 random crops of the image are generated having a uniform distribution of target object locations. Then, it learns the correspondence between object location and global features, we trained it on a set of around 600 images for cars, pedestrians and trees. 4 Target Object Feature Maps It is natural that we focus on objects which have features similar to the target object eg. thin, black objects while searching for snakes. The resemblance to the target object is computed in the feature map which we later use for combing with other cues. Computing the feature maps requires knowledge about the target object eg. its shape and this is learnt over various training images. We used the model in [4] to compute feature maps after training it for pedestrians and cars on over 100 images. The training images were from the LabelMe dataset[3]. Also, we a used a set of images and corresponding feature maps from [5] (based on the Dalal and Triggs detector) to test our combined model with the one in [5]. 5 Previous Work A lot of work has previously been done on modelling human visual attention, many of them using only bottom-up ([6]) or only top-down factors to predict human attention ([7]). We were inspired to combine top-down and bottom-up features by the work in [8] and we have based our work along the lines of [5]. It is important to note that the agreement among human observers’ fixations is higher in goal-dependent tasks (eg target object search), even in case the target is absent [5] which is why testing the model against data for goal-dependent tasks gives us a reliable measure of the performance. It has been seen that the combined model performs better in modelling human atten- tion in search tasks than any of the single source models ([5] and [2]). It has also been observed that among the single source influences, saliency based maps are the least effective for search tasks [5] with an accuracy of around 76% [6]. We seek to verify these with our model and also compare its efficiency with the model in [5]. 6 Methodology 6.1 The Human Fixation Data We used the fixation data for humans who had been asked to search for pedestrians in various images. These image stimuli and the human fixations are available in [5]. This is the same data used to determine the success of the model in [5] with which we will compare our results. Since the data was for pedestrian search, we also trained our model for the same task before checking its accuracy in predicting regions for human fixations. 6.2 Combining the bottom-up and top-down cues We obtained saliency, context and feature based maps for our test images. Before combining them, we normalise these maps to matrices having value between 0 and 1 with each pixel represented by a number. We checked various possible combinations for combining these three maps - weighted addition and multiplication on three set of weights assigned to each map. This is given more elaborately in the results section. Also, we checked the performance of the single source models (Context, Object Fea- tures and Saliency) on the above dataset. 6.3 Determining the Success of the Model Given an image, the model gave an Attention Map with higher values indication higher likelihood of those regions being focussed upon. To determine the efficiency of our model, we took into account the top 10 and 20 % of the image and had the following criteria in mind - 6.3.1 Success as Pre-selection in Object Detection For the model to be implemented as a pre-selection step in object detection, it should have the target object in the 10 or 20 % region it selects from the given image. To determine the accuracy, one should measure the percentage of target objects in the selected region. 6.3.2 Success in predicting Human Fixations To estimate how successful our model has been in mimicking human attention, we measured the percentage of fixations which lie in the region selected. This was done separately on positive stimuli (image containing target object) and negative stimuli (image without target object). Out of the various combinations which were tested for positive stimuli, the best of those was then tested on negative stimuli. 6.4 Differences in Methodology from Previous Works The dataset and single source models that we have used are similar to those in [5], so it is important to highlight the differences in the methodology of the two. The points where the approach in our work differ from [5] are - • Before combining the maps, we have normalised them so that we get a better estimate of the relative importance of various features which is not possible with the maps having different scales. • In [5], the Features-Context-Saliency have been combined in the ratio 85-10-5 whereas we have combined them in various ratios though we found the ratio 60- 25-15 to be appropriate as it functions well for positive stimuli and does not go awry because of the false alarms in negative stimuli. • In [5], the authors have also used a Context Oracle instead of Context Map for some models to get a higher accuracy. This oracle gives a context map based on the data of where human fixations lie in the specific image. We have not used this oracle as it is not a computational model for context and cannot be applied to cases where the human fixations are not known. 7 Results 7.1 Results of the Individual Models for Positive Stimuli The testcases for positive stimuli comprised of 20 images with a total of 411 fixations. We tested these by the three single source models and the various combinations of the combined models. The success was determined by the number of fixations that were present in the selected 10 or 20% region. The results were as follows - Model Fixns. In 10% Fixns. in 20% Single Source - Saliency 238 299 Single Source - Context 158 336 Single Source - Features 310 346 Combined : 85-10-5 (Weighted Addition) 322 374 Combined : 85-10-5 (Weighted Multiplication) 321 373 Combined : 60-25-15 (Weighted Addition) 328 378 Combined : 60-25-15 (Weighted Multiplication) 329 377 Combined : 40-35-25 (Weighted Addition) 324 373 Combined : 40-25-35 (Weighted Multiplication) 322 375 The best performance was the models which combined these cues in the ratio 60-25-15 (features-context-saliency). We tested this model on a set of negative stimuli. We took the one with weighted multiplication because that is the technique used in [5] so, it would be meaningful to compare. All the further analysis was done on this computational model. Comparison between Models 7.2 Negative Stimuli We also tested the combined model on the a set of negative stimuli of 10 images with 228 recorded fixations. For 10% threshold, 149 of them were in the region and for 20%, it increased to 197. The comparison between performance for negative stimuli is as follows - 7.3 Results on Individual Images 7.3.1 Positive Stimuli 7.3.2 Negative Stimuli 7.4 Accuracy for Detecting Objects As can be seen from the images of the positive stimuli, the combined model always selects the region of the image with a pedestrian. Thus, it has 100% efficiency in retaining the target object in the threshold region for 10 as well as 20%. Thus, this model can definitely be used as a pre-selection step. 7.5 Comparison with Previous Results In this section, we focus on the results regarding the efficiency as a model of visual attention i.e. percentage of fixations which lie in the selected region. It can be seen that the results obtained by our model are better than those in [5]. Comparison of Results 7.6 Results with 5% Threshold To be really effective as a preprocessing step, we would like to have a reasonably high accuracy even when a very small region of the image (5%) is selected. We tested our model with only a 5% threshold on a set of 20 positive stimuli. In 18 of those, the model selected regions having pedestrians and gave bad results only in 2 cases. Also, 284 fixations out of 411 (69.1%) were present in the threshold region. This result is better than the results in [5] with 10% threshold. Thus, our model is very effective even if it has to select a very small portion of the image and can be used in scenarios where processing only a small region of the image is computationally permissible. Images computed for 5% threshold 8 Observations and Conclusions This work of ours throws some light on various aspects of object recognition and models of visual attention. Also, some of the aspects of our work like its high efficiency are worth noting. 8.1 A possible Pre-selection Step We note that in all the positive stimuli, the target object is present in the threshold regions. Thus, this model can be effectively used as as a pre-selection step for object detection. This work also cements the belief the models of visual attention can have a high degree of success in many vision related computations. 8.2 High Efficiency in Predicting Fixations Our model has an 80% accuracy (for 10% threshold) and about 90% accuracy (for 20% threshold) in predicting regions where humans would look in a search task. It also has a reasonably high accuracy for the negative stimuli. The results we have obtained are better than the results obtained in [5] (even though we have not used the context oracle). This is the result of the changes mentioned in our methodology and selection of the correspondingly accurate ratios. The ratios which give the highest efficiency suggest that Feature based cues are given the highest importance followed by Contextual cues. 8.3 Single Source vs Combined Model The accuracy of our combined model is definitely better than any of the Single Source models. This supports the claim that our attention is guided by a combination of these and not just one of them. Also, the performances of the Single Source models verify the claim often found in literature that Saliency based models are least efficient for predicting human fixations in a search task. 8.4 Positive vs Negative Stimuli It can be seen that the performance of our model is significantly lower for negative stimuli than for positive stimuli. This is consistent with the results in [5]. This so because these models assign a higher weight to the feature map and in absence of a target, this leads to what is called ’false alarm’ because the image is normalised (so if all values were initially ow in feature map, it would be scaled up). 9 Scope for Future Work We can try to develop visual attention models for other tasks besides object detection. Also, to have more accurate models for visual attention, we must study what are the influential features for negative stimuli and in-cooperate them to obtain models of attention which perform equally well on both. This is a fast developing field of Computer Science and without doubt these would be taken care of in the years to come. Some specific improvements/changes that can be done over our work are - • Using more recently developed feature based models which might give a higher accuracy. • Enabling the working of the model in real time by using parallel processors and possibly splitting the image in chunks. • Generalising to a model of visual attention which is applicable to other tasks besides search. 10 Acknowledgements I would like to thank Antonio Torralba for making the Context model code in [5], Eye movement dataset, LabelMe dataset and its Matlab Toolbox publicly available. I am also grateful to Ankit Awasthi for his valuable feedback regarding the approach to be used for obtaining the Context Maps. 11 References (1) Koch & Itti (2001) , Computational modelling of visual attention Nature Reviews and Neuroscience, 2001, Vol 2; Part 3, pages 194- 204 (2) Ankit Awasthi, Keerti Choudhary Top Down Attentional Guidance during Visual Search, 2010 (3) LabelMe - Bryan Russell, Antonio Torralba and William T. Freeman http://labelme.csail .mit. edu/ [The dataset provided was regularly used for this work] (4) A Simple object detector with Boosting - Antonio Toralba http://people.csail.mit.edu/torralba/ shortCourseRLOC /boosting/boosting.html (5)Krista Ehinger, Barbara Hidalgo-Sotelo, Antonio Torralba, Aude Oliva Modeling search for people in 900 scenes Visual Cognition, Vol. 17, 945-978 Project Page - http://cvcl.mit.edu/searchmodels/ (6 )Bruce N, Tsotsos JK. Saliency Based on Information Maximization. Advances in Neural Information Processing Systems 2006;18: 155–162. (7) Antonio Torralba, Aude Oliva, Monica Castelhano, John Henderson Contextual guidance of eye movements and attention in real-world scenes: The role of global features on object search Psychological Review, Vol. 113, No. 4. (October 2006), pp. 766-786 (8) Sharat Chikkerur - A computational model of Visual Attention http://www.sharat.org/project - Referred for the project proposal 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

turn off anonymous marking

Turnitin Originality Report

report.pdf By Vidur Kumar