Font Size: a A A

Research On Key Problems In Text Classification And Clustering

Posted on:2008-04-26Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z YangFull Text:PDF
GTID:1118360215483679Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Text classification and clustering is one of the most valuable technologies in "Text Information Processing" area that arouses extensive study of the organization, management and processing of large amount of text data, which locates required information swiftly, accurately and comprehensively. Text classification and clustering, the key technology in organizing and processing large mount of text data, can solve the problem of information disorder and explosion to a great extent. Moreover, text classification and clustering will be widely applied as the technical basis of information filtering, information retrieval, search engine, text database, and digital library and so on. With the advent of the information era, text classification and clustering are becoming highlighting. They have become not only the tastes of scientists but also the interests of governments and forces. The governments and industrial communities of many countries/regions are so keen on text classification and clustering techniques that they have invested a great deal of money on relevant research.In this dissertation, three problems to be dealt with are thoroughly investigated, which includes the development of processing algorithms for huge amount of text data; improving the performance of text classification techniques; promoting the traditional text information clustering techniques to "understanding" level. The main contributions of this dissertation are summarized as follows:Firstly, we discussed the applications of statistical model in text classification field. At the beginning, we studied the application of conventional Bayesian method in text classification field. Then an improved weighted Bayesian method was proposed. As following, the method that combines labeled and unlabeled data using transductive inference is discussed. At last, we probed into the application of character-level statistical method in huge text classification field. Moreover, solutions towards online spam filtering and short messages classification tasks have been explored through this section. Experimental result proves that these Easy-to-Use methods can not only learn from labeled and unlabeled data, but also achieve trade-off between processing accuracy and speed.Secondly, we discuss the applications of classifiers ensemble in text classification field. In the first place, the E+V (Error-Variance) decomposition is outlined. On the basis of rigorous proof about this theory, a |V| index is proposed that can reflect the ensemble performance. In addition, we formulated the majority voting problem as an optimization problem with linear constraints, and then the theoretical upper and lower bounds for performance obtained by combining classifiers through majority voting are illustrated. At last, we discuss two possible approaches to reach the theoretical upper bound by combining classifiers through majority voting: 1) selective ensemble; 2) ensemble based on the optimization of |V|. Besides, the resulting technology has been successfully implemented in our spam filtering system.At last, we investigated the applications of the nonlinear methods in text information clustering field. We discussed how to promote the traditional text information clustering techniques to "understanding" level. By using manifold analysis, we primarily studied the distribution of Chinese words in a continuous semantic space, which is useful for further study of feature selection based on semantic distance. Then we investigated short messages clustering based on WordNet. Experimental result proves that these methods can reflect the internal relation of texts.
Keywords/Search Tags:text classification and clustering, statistical model, classifier ensemble, manifold learning, spam filtering, short messages
PDF Full Text Request
Related items