Clustering guided multi-label text classification

Posted on:2013-03-31

Degree:Ph.D

Type:Dissertation

University:The University of Texas at Dallas

Candidate:Ahmed, Mohammad Salim

Full Text:PDF

GTID:1458390008473749

Subject:Computer Science

Abstract/Summary:

With the advent of social networking and mobile computing, there has been an enormous increase in the amount of data over the Internet. And, a significant portion of this data is in text form. That being the case, an effective automated system needs to be developed that can perform classification of text data so that knowledge can be efficiently extracted from the data. However, text data is usually multi-label in nature. In other words, a single text document may be associated with multiple class labels simultaneously.;Here, we propose a solution to the problem of multi-label text classification. There are a number of challenges when classifying multi-label text data. For example, to determine how many labels should be associated with a particular text data instance. Another challenge associated with text data is its high and sparse dimensionality. Finally, we also need to consider the highly shared feature space across multiple class labels that makes it difficult to find features that are specific to any single class label.;We have named our proposed approach SISC (Semi-supervised Impurity based Subspace Clustering). SISC partitions the data into soft subspace clusters where the dimension weight vectors are different for each cluster. The clustering is soft in nature as each data point may belong to multiple clusters at the same time. Use of such an approach is targeted towards the previously mentioned two specific characteristic of text data---multi-labelity and high and sparse dimensionality. Although SISC, in its base form, is a multi-class text classification approach, SISC-ML (SISC-MultiLabel), an extension of SISC, is provided to explicitly handle the multi-labelity. We have also extended SISC to consider the correlation among different class labels present in the training data set. We have further utilized the correlation within the training data to propose a post processing step to refine the probability assigned to each of the test instances by SISC. Empirical evaluation on real world benchmark multi-class and multi-label text data sets and comparison with other state-of-the-art text classification as well as subspace clustering algorithms show that SISC provides superior performance for multi-label text data classification in a multi-class setting.

Keywords/Search Tags:

Text, Data, Classification, SISC, Clustering

Related items

1	Research Of Text Clustering And Classification Method Based On Genetic Annealing Algorighms
2	Research On Several Models In Text Classification And Clustering
3	Research And Realization Of Clustering Guided Web Chinese Text Classification Based On SVM
4	Research On The Application Of Text Classification And Clustering In Network Secutiry Operation System
5	Research On Web Text Clustering And Classification Algorithm
6	Clustering And Classification Of Data And Text Using Such Technologies As Neural Network
7	Research Of Text Classification And Clustering Based On Hybrid Parallel Genetic Algorithm
8	Research On Text Classification Method Based On FCM Clustering
9	Design And Implementation Of Chinese WEB Documents Clustering And Classification System
10	Research On Key Problems In Text Classification And Clustering