Font Size: a A A

Clustering guided multi-label text classification

Posted on:2013-03-31Degree:Ph.DType:Dissertation
University:The University of Texas at DallasCandidate:Ahmed, Mohammad SalimFull Text:PDF
GTID:1458390008473749Subject:Computer Science
Abstract/Summary:
With the advent of social networking and mobile computing, there has been an enormous increase in the amount of data over the Internet. And, a significant portion of this data is in text form. That being the case, an effective automated system needs to be developed that can perform classification of text data so that knowledge can be efficiently extracted from the data. However, text data is usually multi-label in nature. In other words, a single text document may be associated with multiple class labels simultaneously.;Here, we propose a solution to the problem of multi-label text classification. There are a number of challenges when classifying multi-label text data. For example, to determine how many labels should be associated with a particular text data instance. Another challenge associated with text data is its high and sparse dimensionality. Finally, we also need to consider the highly shared feature space across multiple class labels that makes it difficult to find features that are specific to any single class label.;We have named our proposed approach SISC (Semi-supervised Impurity based Subspace Clustering). SISC partitions the data into soft subspace clusters where the dimension weight vectors are different for each cluster. The clustering is soft in nature as each data point may belong to multiple clusters at the same time. Use of such an approach is targeted towards the previously mentioned two specific characteristic of text data---multi-labelity and high and sparse dimensionality. Although SISC, in its base form, is a multi-class text classification approach, SISC-ML (SISC-MultiLabel), an extension of SISC, is provided to explicitly handle the multi-labelity. We have also extended SISC to consider the correlation among different class labels present in the training data set. We have further utilized the correlation within the training data to propose a post processing step to refine the probability assigned to each of the test instances by SISC. Empirical evaluation on real world benchmark multi-class and multi-label text data sets and comparison with other state-of-the-art text classification as well as subspace clustering algorithms show that SISC provides superior performance for multi-label text data classification in a multi-class setting.
Keywords/Search Tags:Text, Data, Classification, SISC, Clustering
Related items