This is a preview of the print version of your report. Please click "print" to continue or "done" to close this window.

done

or Cancel

Document Viewer
 
Similarity Index
17%
What's this?
Similarity by Source
Internet Sources:
16%
Publications:
13%
Student Papers:
0%
exclude quoted exclude bibliography exclude small matches download print
mode:

Facial Expression Classification using Visual Cues and Language Abhishek Kar Advisor: Dr. Amitabha Mukerjee {akar,amit Begin Match to source 19 in source list: http://www.security.iitk.ac.in/IITKHACK04/cfp.pdf}@iitk.ac.in Department of Computer Science and Engineering, IIT KanpurEnd Match Abstract In this work, we attempt to tackle the problem of correlating language with facial expressions to learn in an unsupervised manner, the adjectives used to describe emotions. The problem is divided into two parts - i) a supervised method for classifying facial emotions using visual cues and ii) an unsupervised algorithm to extract keywords out of commentary on videos depicting facial expressions. We use a method based on Gabor filters and Support Vector Machines to classify emotions into 7 categories - Anger, Surprise, Sadness, Happiness, Disgust, Fear and Neutral. We explore various dimensionality reduction and feature selection methods like PCA and AdaBoost. Begin Match to source 11 in source list: http://people.cs.uu.nl/robby/student/2011_e_jan/The Extended Cohn-Kanade databaseEnd Match [7] Begin Match to source 11 in source list: http://people.cs.uu.nl/robby/student/2011_e_jan/is usedEnd Match for testing Begin Match to source 11 in source list: http://people.cs.uu.nl/robby/student/2011_e_jan/theEnd Match algorithms. We achieve an accuracy of 94.72% for a 7-way forced choice SVM classifier after feature selection using Adaboost which is a significant improvement over many previous successful approach based on PCA and LDA and Local Binary Patterns. In the next step, we obtain commentary on 40 videos depicting 4 emotions - Anger, Sadness, Happiness and Surprise and cluster keywords obtained using a maximum co-occurrence method to discover descriptors for these emotions. 1 Introduction Facial expression analysis has been a long standing problem in computer vision Begin Match to source 25 in source list: Shaogang Gong. with applications in Human Computer Interaction, videoEnd Match summarization, effective Begin Match to source 25 in source list: Shaogang Gong. indexingEnd Match of videos and finding lower di- mensional embedding for facial actions. Consider the task of summarizing a video by the facial expres- sions of the subject (see Figure 1). Facial expression classification would enable us not only to achieve the above task but also efficiently answer queries such as ’Frames where Subject A is smiling’. Facial expression categorization systems are also used in personal robots like Sony’s Aibo, ATR’s RoboVie and CU animator. A major part of facial expression classification involves defining a robust vocabulary for Begin Match to source 8 in source list: http://mplab.ucsd.edu/~marni/pubs/Bartlett_JMM06.pdffacial actions. The Facial Action Coding System(FACS)[End Match 4] has been Begin Match to source 8 in source list: http://mplab.ucsd.edu/~marni/pubs/Bartlett_JMM06.pdftheEnd MatchBegin Match to source 20 in source list: http://humansensing.cs.cmu.edu/projects/dyncascades/dyncascades.pdfstate of the art inEnd Match man- ual coding Begin Match to source 20 in source list: http://humansensing.cs.cmu.edu/projects/dyncascades/dyncascades.pdfof facialEnd Match expressions Begin Match to source 20 in source list: http://humansensing.cs.cmu.edu/projects/dyncascades/dyncascades.pdfandEnd Match have been Begin Match to source 20 in source list: http://humansensing.cs.cmu.edu/projects/dyncascades/dyncascades.pdfwidely usedEnd Match to train supervised classifiers to recognize emotions in humans. Begin Match to source 22 in source list: http://mplab.ucsd.edu/~ting/pdfs/wu2010CVPR4HB.pdfIn this work weEnd Match try to explore Begin Match to source 22 in source list: http://mplab.ucsd.edu/~ting/pdfs/wu2010CVPR4HB.pdfthe use of gabor filtersEnd Match for feature extraction and subsequent classification by SVMs to classify images into 7 basic emotion categories - Begin Match to source 24 in source list: Shaogang Gong. Neutral, Anger, Disgust, Fear,End Match Happy, Begin Match to source 24 in source list: Shaogang Gong. Sadness and Surprise. WeEnd Match investigate various dimensionality reduction and feature selection methods like PCA and AdaBoost to efficiently represent the data. We further extend our work to discovering keywords describing four different emotions - Anger, Sadness, Happiness and Surprise in an unsupervised manner from video commentary data. Begin Match to source 1 in source list: Jeffrey F. Cohn. 2 Previous Work There has been substantial effort devoted to automatic facial image analysis over the past decade.End MatchBegin Match to source 1 in source list: Jeffrey F. Cohn. The pioneering work of Black and Yacoob [3] recognizes facial expressions by fitting local parametric motion models to regions of the face and then feeding the resulting parameters to a nearest neighbor classifier for expression recognition.End Match Hoey [5] presented Begin Match to source 1 in source list: Jeffrey F. Cohn. a multilevel Bayesian network to learnEnd Match in a weakly supervised manner the dynamics of facial expression. De la Torre et al. [2] proposed a geometric-invariant clustering algorithm to decompose a stream of one person’s facial behavior into facial gestures. Shan et al. [8] explore the use of local binary patterns (LBP) for the given task. Bartlett et al. [1] proposed a method based on gabor filters and AdaSVMs for the same purpose. Their approach has proved to be one of the 1 Figure 1: Video summarization using facial expression clustering most successful ones and it has been adapted into a commercial facial expression categorization system called the Computer Expression Recognition Toolbox [6]. Our work is mainly inspired by the work of Bartlett et al. as it provides a simple and highly con- figurable method with a lot of opportunity for experimentation. Their reported accuracy is around 90% and it stands amongst the top methods for expression classification. Moreover it is possible to build a realtime system for emotion recognition using this approach. Correlation of language with facial expressions using computational methods has not been explored till date to the best of our knowledge. 3 Methodology 3.1 Face Detection We use the Viola-Jones [9] face detector to find the face in the image. It is the state of the art face detection algorithm in use and performs exceptionally on a number of dataset. The detector is based on haar cascade classifiers. Each classifier uses rectangular haar features to classify the region of the image as a positive or negative match. { +1 vi ≥ t fi= i (1) −1vithe Fourier transform of theEnd Match image Begin Match to source 4 in source list: http://en.wikipedia.org/wiki/Gabor_filterandEnd Match multiplying with Begin Match to source 4 in source list: http://en.wikipedia.org/wiki/Gabor_filterthe Fourier transform of theEnd Match Gabor filter in Begin Match to source 4 in source list: http://en.wikipedia.org/wiki/Gabor_filtertheEnd Match frequency domain. This aids in reducing computation time. In this work, we use Begin Match to source 8 in source list: http://mplab.ucsd.edu/~marni/pubs/Bartlett_JMM06.pdfa bank ofEnd Match 72 Begin Match to source 8 in source list: http://mplab.ucsd.edu/~marni/pubs/Bartlett_JMM06.pdfgabor filters (8 orientations and 9 spatial frequencies 2:32 pixels per cycleEnd Match in half octave steps). The 48x48 face patch is convolved with all the 72 gabor filters to obtain a feature of size 48x48x72 = 165888 per image. Begin Match to source 4 in source list: http://en.wikipedia.org/wiki/Gabor_filterFrequency and orientation representations of Gabor filters are similar to those of the human visual system, and they have been found to be particularly appropriate for texture representation and discrimination. InEnd Match this problem, we need to find a feature set that appropriately models the orientation information of various facial units like the lips, eyebrows, eyes etc. The bank of local gabor filters used in our approach succeeds in achieving this with an added advantage of frequency information. 3.3 Dimensionality Reduction/Feature Selection 3.3.1 Adaptive Boosting The AdaBoost algorithm iteratively Begin Match to source 3 in source list: http://www-stat.stanford.edu/~hastie/Papers/samme.pdfcombines many weak classifiers to approximate the Bayes classifier. Starting with the unweighted training sample, the AdaBoost builds a classifier, for example aEnd Match decision stump or Begin Match to source 2 in source list: http://www.stat.lsa.umich.edu/~jizhu/pubs/Zhu-SII09.pdfa classification tree that produces class labels. If a training data point is misclassified, theEnd Match influence Begin Match to source 2 in source list: http://www.stat.lsa.umich.edu/~jizhu/pubs/Zhu-SII09.pdf(weight) of that training data point isEnd Match adjusted Begin Match to source 2 in source list: http://www.stat.lsa.umich.edu/~jizhu/pubs/Zhu-SII09.pdf(boosted).End Match In next iteration, Begin Match to source 2 in source list: http://www.stat.lsa.umich.edu/~jizhu/pubs/Zhu-SII09.pdfa second classifier is built using the new weights, which are no longer equal. Again, misclassified training data have their weightsEnd Match adjusted Begin Match to source 2 in source list: http://www.stat.lsa.umich.edu/~jizhu/pubs/Zhu-SII09.pdfand the procedure is repeated.End MatchBegin Match to source 3 in source list: http://www-stat.stanford.edu/~hastie/Papers/samme.pdfA score is assigned to eachEnd Match weak Begin Match to source 3 in source list: http://www-stat.stanford.edu/~hastie/Papers/samme.pdfclassifier, and the final classifier is defined as the linear combination of the classifiers from each stage.End Match In our problem, we select the best features (best weak learners) obtained by applying adaboost for every class using one-vs-rest strategy for every class. We do this by binning the best weak-learners obtained in all the iterations for every one versus rest sub-problem and picking the top ’k’ features. The final feature set is taken to be the union of best features obtained for every sub-problem. Begin Match to source 18 in source list: http://paper.ijcsns.org/07_book/200602/200602C15.pdf3.3.2 Principal Component Analysis Principal Component analysisEnd Match of data Begin Match to source 18 in source list: http://paper.ijcsns.org/07_book/200602/200602C15.pdfis aEnd Match dimensionality reduction Begin Match to source 18 in source list: http://paper.ijcsns.org/07_book/200602/200602C15.pdftechniqueEnd Match wherein the data dimension is reduced by mapping it into its eigen-space. In the process the top-k eigenvectors are chosen to reflect the directions of maximum variability of the data. In practice, the lower dimensional representation is calculated by a singular value decomposition of the data followed by extraction of its top-k eigenvectors (where k is chosen by the amount of energy to be retained) and mapping to the top-k eigenvector space. In this problem, we tried various values of k ranging from 10 to 359 and optimized for the maximum classification accuracy. The final best dimensionality was found to be 60. It is interesting to note that the Facial Action Coding System[4] has 64 action units that are used to code various emotions. We can perhaps conclude that the intrinsic dimensionality of the data is 60 and PCA succeeds in mapping the data to this dimension. 3.4 Classification 3 Begin Match to source 17 in source list: http://www.utica.edu/academic/institutes/ecii/publications/articles/A04CCCB9-D778-B3D4-3C9A98DB4B0F99D1.pdf.4.1 Support Vector Machines Support Vector Machines (SVMs) areEnd Match primarily binary classifiers Begin Match to source 17 in source list: http://www.utica.edu/academic/institutes/ecii/publications/articles/A04CCCB9-D778-B3D4-3C9A98DB4B0F99D1.pdfthatEnd Match find Begin Match to source 17 in source list: http://www.utica.edu/academic/institutes/ecii/publications/articles/A04CCCB9-D778-B3D4-3C9A98DB4B0F99D1.pdftheEnd Match best separating hyper- plane by maximizing the margins from the support vectors. Support vectors are defined as the data points that lie closest to the decision boundary. SVMs can be extended to multiclass problems by two Figure 5: Classification using SVM methods - One versus All and One vs One classifiers. In the latter method, ( n 2) SVMs are trained for each combination of classes and the final class label is decided using majority voting. A similar approach can also be used for the first case with all n classifiers. We use both methods for classification. We use a novel way of choosing the final class in the One versus All case. If a feature vector is classified in multiple classes, we choose the class with the maximum margin from the decision boundary. If a feature vector is classified no single class, we take the class that has the minimum margin from the decision boundary. This results in a significant improvement over the One vs One case. We also use stacked 1 vs rest and 1 vs 1 classifiers and a novel way of combining the Adaboost features for use with SVMs and present the results in the following section. 3.4.2 Bayesian Classification We use baseline classifier based on data reduction and classification using a Bayesian classifier. We fit multivariate normal distributions to all the classes using Maximum Likelihood Estimation and classify based on posteriors. 4 Dataset and Results The primary database used for testing the results of the algorithms was the Cohn-Kanade+ database. It consists of 593 posed sequences from 123 subjects. All images are of dimensions 640x490 and have annotated tracked AAM feature points. Each sequence starts with a neutral expression and terminates with the peak expression. All the peak expressions in the database are FACS coded and 327 of the 593 sequences are emotion labeled. There are 8 expressions present in the database: Angry, Disgust, Fear, Happy, Sadness, Surprise, Contempt, Neutral. We ignored the contempt emotion as it has been to be confusing in literature and ends up hampering the accuracy. We took an average of 50 neutral expressions in the database to keep the number of samples of all emotions roughly equal. Emotion Frequency Neutral 50 Angry 45 Disgust 59 Fear 25 Happy 69 Sadness 28 Surprise 83 Figure 6: Images from the CK+ dataset Figure 7: Frequency of emotions in CK+ dataset We detected faces in the CK+ dataset and resizing them. This was done in OpenCV because of its speed and the inbuilt Viola Jones face detector. The resized images were then read in Matlab and only the peak expressions of image sequences containing emotion labels were used to compute the gabor magnitude rep- resentations. Finally, a feature set of 359x165888 elements was obtained. We reduced it to 60 dimensions using PCA. A linear SVM with a was trained on the PCA reduced data. We used both one-versus-one and one-versus-all SVMs from the libsvm library and MATLAB. We also had to modify the MATLAB SVM implementation to obtain the margin for the one-versus-one case. The results are compiled in table 1. All reported accuracies are obtained using 10 fold crossvalidation. The discrepancy between accuracies of 1 vs 1 and 1 vs rest may be attributed to the better technique used to decide the final class in the latter. In the Adaboost approach, we found the best features corresponding to every one vs. all combina- tion (total 7) and took the union of all the features to form the dimension reduced feature vector. In our implementation, we used 300 iterations of Adaboost to find the top features and obtained a final set of 175 features after union. These were further classified using SVMs and the baseline Bayesian approach. In another approach that we call Adaboost-SVM, we used only those features to train and test the SVM for class x that were found to be the best by the Adaboost iteration for class x vs. rest. We also used stacked classifiers wherein if a test image was classified into more than one classes in the 1 vs. rest approach, we used a 1 vs. 1 SVM on top of it to disambiguate between the classes. We present the results for all these approaches in table 1. It is worth noting that the Adaboost-SVM method gives almost the same accuracy as SVM followed by feature selection using SVM. The major difference is that while the latter takes the union of all features and reduces the feature set to 175, the former uses Classifier SVM SVM Adaboost- Stacked Bayes Features (1 vs 1) (1 vs rest) SVM SVM PCA 71.08% 72.19% – 69.66% 80.45% None 75.39% 88.87% – – – Adaboost 80.43% 94.72% 94.61% 92.91% 86.64% Table 1: Accuracy of classifiers using 10 fold cross validation Emotion None Adaboost Neutral 97.5% 98.05% Angry 91.65% 95.26% Disgust 98.04% 99.72% Fear 96.1% 98.04% Happy 98.6% 98.89% Sadness 94.16% 94.99% Surprise 97.78% 99.17% Table 2: 10 fold cross-validation accuracies of single emotions with and without feature selection different set of features for training different SVMs and uses feature vector sizes between 20-30. Thus the Adaboost-SVM method might be preferred over the traditional method if speed is a requirement. 5 Correlation with Language As the next step to our work, we obtained commentary on 40 videos made from image sequences of people depicting various emotions. The methodology used for obtaining the commentaries was the following: Subjects were shown a demo video of a posed expression. They were asked to comment on the next video. No specific directions were provided as to what to comment on in the video. 60 such responses were recorded in English and subsequently transcribed. The 40 videos on which commentary was obtained was taken from the test set from the previous step and the labels assigned to them were labels predicted by our algorithm and not the truth labels. Extraneous words like articles and prepositions were stripped from the transcribed responses and just the keywords were retained. The keywords had derivatives like -ed, -es, -ing etc. Thus. Levenshtein edit distance was used to match keywords. The co-occurrence values in the same emotion for different pairs of keywords was obtained and they were grouped by emotion. Table 4 shows some keywords discovered for the four emotion categories. 6 Conclusions Begin Match to source 23 in source list: http://gps-tsc.upc.es/imatge/pub/ps/image_comm_pardas_bonafonte.pdfIn this work, we have developedEnd Match a facial Begin Match to source 23 in source list: http://gps-tsc.upc.es/imatge/pub/ps/image_comm_pardas_bonafonte.pdfexpression recognitionEnd Match algorithm that classifies into 7 emotion categories. Our method performs better than a number of previous approaches and is a significant improvement over the classic approach of using PCA + LDA. Our method uses Adaboost which is a very Approaches Best Accuracy Our method 94.72% Gabor filter + AdaSVM[1] 93.3% Boosted LBP + SVM (linear)[8] 91.1% Gabor filter + SVM[6] 90.1% PCA + LDA[1] 80.7% Table 3: Accuracies of various methods on the CK+ dataset Emotion Happiness Sadness Anger Surprise Keywords Happy, Smile, Delight, Joy Distress, Unhappy, Sad, Gloomy, Sleepy, Grief, Sorrow Anger, Curious, Frown, Furious, Irritate, Ill temper Amaze, Surprise, Shock, Astonishment, Stupefy, Awe, Bewilderment Table 4: Keywords discovered for various emotion categories slow algorithm and takes a long time to converge. Our feature set is also very large (165888 features per image) and thus required considerable computational power. On the other hand, this method a very good recognition accuracy on the CK+ dataset. We have also developed a method to discover adjectives describing basic emotion categories in an unsupervised manner. We can hypothesize that this is how we learn to interpret various emotions. We are exposed to many expressions as we grow up and we pick up words describing each category and form associations. This gives more importance to the visual system in the task of expression recognition. We may also hypothesize that it is language that helps us to form subcategories in each emotion category depicting various levels of the same emotion. Intensities associated with different adjectives can lead us to associate intensities with different depictions of the same basic emotion. We would like to extend this work for more robust recognition keywords from the commentaries. In the current work, almost all the test images were classified correctly and the subjects recognized the emotions correctly. Thus a simple maximum co-occurence method gave sufficiently good results. In case of misinterpretation of emotions by subjects, we would like to use a graph min-cut based method to clusters the keywords. References [1] Begin Match to source 9 in source list: http://iris.usc.edu/~mmsiddiq/isr09.pdfMarian Stewart Bartlett, Gwen Littlewort, Mark Frank, Claudia Lainscsek, Ian Fasel, and Javier Movellan. Recognizing facial expression: Machine learning and application to spontaneous behavior.End MatchBegin Match to source 16 in source list: http://users.soe.ucsc.edu/~dgray/dgray-pets2007.pdfComputer Vision and Pattern Recognition, IEEE Computer Society Conference on, 2:End Match 568–573, Begin Match to source 16 in source list: http://users.soe.ucsc.edu/~dgray/dgray-pets2007.pdf2005.End Match [2] Begin Match to source 14 in source list: http://users.cecs.anu.edu.au/~aasthana/ACII09/ACII09.pdfF. De la Torre, J. Campoy, Z. Ambadar, and J.End Match F. Conn. Begin Match to source 14 in source list: http://users.cecs.anu.edu.au/~aasthana/ACII09/ACII09.pdfTemporal segmentation of facial behavior. InEnd Match Computer Vision, Begin Match to source 14 in source list: http://users.cecs.anu.edu.au/~aasthana/ACII09/ACII09.pdf2007.End Match ICCV Begin Match to source 5 in source list: http://figment.csee.usf.edu/~vmanohar/WACV09_Expression_Spotting.pdf2007. IEEE 11th International Conference on,End Match pages Begin Match to source 5 in source list: http://figment.csee.usf.edu/~vmanohar/WACV09_Expression_Spotting.pdf1 –8, 2007.End Match [3] Begin Match to source 5 in source list: http://figment.csee.usf.edu/~vmanohar/WACV09_Expression_Spotting.pdfF. De la Torre, Y. Yacoob, and L. Davis. A probabilistic framework for rigid and non-rigid appearance based tracking and recognition.End Match In Begin Match to source 5 in source list: http://figment.csee.usf.edu/~vmanohar/WACV09_Expression_Spotting.pdfAutomatic Face and Gesture Recognition,End MatchBegin Match to source 21 in source list: http://www.caa.tuwien.ac.at/cvl/people/julianstoettinger/publications/christian-liensberger_julian-stoettinger_martin-kampel-color-skin-detection.pdf2000. Proceedings. Fourth IEEE International Conference on, pagesEnd Match 491 –498, Begin Match to source 21 in source list: http://www.caa.tuwien.ac.at/cvl/people/julianstoettinger/publications/christian-liensberger_julian-stoettinger_martin-kampel-color-skin-detection.pdf2000.End Match [4] Begin Match to source 12 in source list: http://tcw2.ppsw.rug.nl/~lambert/projects/miami/taxonomy/node168.htmlP. Ekman and W. V. Friesen. Facial Action Coding System. Consulting Psychologists Press, Stanford University, Palo Alto, 1977.End MatchBegin Match to source 13 in source list: Nicolas Moënne-Loccoz. [5] J. Hoey. Hierarchical unsupervised learning of facial expression categories. In Detection andEnd Match Recog- nition Begin Match to source 13 in source list: Nicolas Moënne-Loccoz. of Events in Video, 2001.End Match Proceedings. IEEE Workshop on, pages 99 –106, 2001. [6] G. Littlewort, J. Whitehill, Tingfan Wu, I. Fasel, M. Frank, J. Movellan, and M. Bartlett. The com- puter expression recognition toolbox (cert). Begin Match to source 6 in source list: http://www.citeulike.org/user/msoleyIn Automatic Face Gesture Recognition and Workshops (FG 2011), 2011 IEEE International Conference on,End Match pages 298 –305, Begin Match to source 6 in source list: http://www.citeulike.org/user/msoleymarch 2011.End Match [7] Begin Match to source 11 in source list: http://people.cs.uu.nl/robby/student/2011_e_jan/P. Lucey, J.F. Cohn, T. Kanade, J. Saragih, Z. Ambadar,End Match and I. Matthews. Begin Match to source 15 in source list: Xiaohua Huang. The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression. InEnd MatchBegin Match to source 6 in source list: http://www.citeulike.org/user/msoleyComputer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on,End Match pages 94 –101, Begin Match to source 6 in source list: http://www.citeulike.org/user/msoleyjune 2010.End Match [8] Begin Match to source 10 in source list: http://portal.acm.org/citation.cfm?id=261522&dl=ACM&coll=Portal&CFID=73412712&CFTOKEN=45863460Caifeng Shan, Shaogang Gong,End Match and Begin Match to source 10 in source list: http://portal.acm.org/citation.cfm?id=261522&dl=ACM&coll=Portal&CFID=73412712&CFTOKEN=45863460Peter W. McOwan. Facial expression recognition based on local binary patterns: A comprehensive study. Image and Vision Computing,End Match 27(6):803 – 816, 2009. [9] Begin Match to source 7 in source list: http://users.soe.ucsc.edu/~dph/mypubs/bmvc_final.pdfPaul Viola and Michael Jones. Rapid object detection using a boosted cascade of simple features. Computer Vision and Pattern Recognition, IEEE Computer Society Conference on, 1:511, 2001.End Match 3 4 5 6 7