Active Learning for Text Classification

Ongoing Project in the course CS 365 by Ankit Bhutani (ankitbhu@iitk.ac.in) under the supervision of Professor Amitabh Mukherjee

Motivation:

With the rapid growth of online information, text classification has become necessary for handling and organizing text data. Text classification techniques are used to classify news stories, to find information on the internet, and to guide a user's search through hypertext. Human effort for classification and labelling of text documents into different categories is both expensive as well as time-consuming and as such labelled data is generally short in supply whereas there van be volumes of unlabelled data. Thus, it makes intuitive sense to be able to learn a classifier based on few labelled examples and to augment it's "knowledge" using the avaialable unlabelled data.

For the process of augmenting the classifier's knowledge using unlabelled data, we can follow several approaches, one of which is Active Learning. Active learning is a form of machine learning in which the learning algorithm is able to interactively query the user (or some other information source) to obtain the desired outputs at new data points. More details about the manner in which I am employing this active learning approach shall be available soon on this webpage only.

Related work:

A lot of work has gone on, especially in the last 2 decades on semi-supervised text classification. (Nigam et al.,2000) proposed a method using Expectation Maximization (EM) and Multinomial Naive Bayes (MNB) which is a fast semi-supervised learning method but pointed out that EM may decrease the performance of MNB when the dataset contains multiple subtopics in one class. They proposed a Common Component(CC) method using EM to address this problem. (Chawla & Karakoulas, 2005) observed that while CC may improve the AUC of naive Bayes given a small number of labeled data, it may significantly underperform naive Bayes given larger labelled data. (Su, Shirabad and Matwin, 2011) proposed a method named semi-supervised frequency estimate (SFE) and showed that SFE significantly and consistently improves the AUC and accuracy of MNB, while EM+MNB can fail to improve the AUC of MNB.

The above works are based in a framework where the learning process is passive in the sense that we have some labelled data and a lot of unlabelled data whose labels we cannot obtain. On the other hand is the approach of active learning.

Active Learning approaches the same problem in a different way. Unlike the EM or the SFE setting, the active learner can request the true class label for certain unlabelled documents it selects. However, each request is considered an expensive operation and the point is to perform well with as few queries as possible. Some of the methods used in active learning for text have been Query by Committee QBC (Freund et al.,1997), EM+Pool Based Active Learning (McCullum and Nigam, 1998), Active Learning using SVMs (Tong and Koller, 2001) and (Yang et al.,2009).

The current project aims to combine the two ideas effectively as described below.

Proposed approach:

The basic approach would be to combine the approach used by (McCullum and Nigam, 1998) and (Su, Shirabad and Matwin, 2011). To be precise, the former uses Active Learning + EM and the latter uses SFE in a passive learning scenario. I plan to use Active Learning + SFE. If time permits, other methods for query selection can be experimented.

References:

  1. Nigam, Kamal, McCallum, Andrew, Thrun, Sebastian, and Mitchell, Tom M. Text classification from labeled and unlabeled documents using em. Machine Learning, 39(2/3):103-134, 2000.
  2. Chawla, Nitesh V. and Karakoulas, Grigoris J. Learning from labeled and unlabeled data: An empirical study across techniques and domains. J. Artif. In- tell. Res. (JAIR), 23:331-366, 2005.
  3. Freund, Y.; Seung, H.; Shamir, E.; and Tishby, N. 1997. Selective sampling using the query by committee algorithm. Machine Learning 28: 133-168.
  4. MCCALLUM, A. K. AND NIGAM, K. 1998. Employing EM in pool-based active learning for text classification. In Proceedings of ICML-98, 15th International Conference on Machine Learning (Madison, WI, 1998), 350–358.
  5. S. Tong and D. Koller. Support vector machine active learning with applications to text classification. Proceedings of the 17th International Conference on Machine Learning, pages 401-412, June 2000.
  6. Bishan Yang, Jian-Tao Sun, Tengjiao Wang, and Zheng Chen. Effective multi-label active learning for text classification. In KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 917-926, New York, NY, USA, 2009. ACM.
  7. Jiang Su, Jelber Sayyad-Shirabad, and Stan Matwin. Large Scale Text Classification using Semi-supervised Multinomial Naive Bayes. Proceedings of the International Conference on Machine Learning.