Font Size: a A A

A Study On Text Categorization Based On Machine Learning

Posted on:2009-05-21Degree:DoctorType:Dissertation
Country:ChinaCandidate:K WuFull Text:PDF
GTID:1118360242483552Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development and maturity of information technology, especially the Internet-related technology, people can obtain more and more information. Faced with a deluge ofinformation, on the one hand, people have a desire for fast, accurate and comprehensive ac-cess to information. On the other hand, information stays in an unexpected ways and thuslooks disorderly . How to organize and manage information as effectively and efficiently aspossible is the focus of information processing. Consequently, text categorization has exten-sive attention, and become one of the most important tasks in natural language processing.This thesis covers feature selection, large-scale text classification and cross-language textclassification. We have attempted to resolve the three issues. the first is how efficient andaccurate classification, the second is the use of large-scale data and the last is text categoriza-tion in the multi-language environment, that is , how to exploit training corpus in a languageto categorize documents in another language.The main contributions of this thesis are as follows:(1) Multi-class feature selection algorithm in a probabilistic way is applied to text cate-gorization. Compared with the traditional feature selection algorithms, such as informationgain andχ~2 statistics, which consider each feature alone, the algorithm can pick out a goodfeature set based on the structure risk minimization of linear support vector machines. In ourexperiments, three common multi-class classifiers are used to test the algorithm. Experimen-tal results show that the algorithm is effective over text data.(2) Different voting strategies of K nearest neighbors (K-NN) are applied to text cate-gorization and are combined with Min-Max modular network to handle large-scale text data.Usually, similarity cumulative voting strategy is adopted in text data. This is very similar tothe inverse distance voting strategy. In this thesis, different voting strategies of K-NN in themachine learning field are introduced into text categorization and further are applied to Min-Max modular network for large-scale text data processing. Experimental results show thatthe methods with Gaussian voting strategy are better than the methods with other strategies.(3) A hyperplane data decomposition is applied in Min-Max modular support vectormachine for text categorization. When Min-Max modular network is used to handle large-scale data, there are usually three problems to be studied. The first is what classifier to ensemble, the second is pruning of redundant modules and the third is data decomposition.In this thesis, some research is done on the last problem, that is, an application of a hyper-plane data decomposition to text categorization. Traditional data decompositions usually userandom strategy and clustering division strategy. However, random decomposition may un-dermine the spatial structure of data. If a clustering method is utilized to decompose originaldata, a large amount of computing resources would be consumed. Hyperplane data decom-position method can to some degree avoid the above-mentioned shortcomings. Experimentalresults validate the effectiveness of the hyperplane data decomposition in text data.(4) For the first time, the use of bilingual lexicon in cross-language text categorizationis proposed. Multilingual analysis, usually requires some additional bilingual resources tofill the gap between two languages. These bilingual resources may be bilingual lexicon,large-scale parallel corpus or automatic machine translation, etc. However, there is littleresearch on the use of bilingual lexicon in cross-language text categorization. This thesisproposes the use of this bilingual resources to study this problem. Also, a cross-languagenaive Bayes algorithm is proposed. We leverage bilingual electronic dictionary to extendtraditional naive Bayes algorithm to a cross-language naive Bayes algorithm. Preliminaryexperimental results show the effectiveness of the proposed algorithm.(5) A refinement framework for cross-language text categorization is proposed theessence. The limited coverage of bilingual lexicon may affect the performance of the re-sulting classification. Consequently, this thesis proposes the use of the original corpus in atarget language to refine the initial labels from the transferred model via a bilingual lexicon.Preliminary experimental results show that the proposed framework is effective.
Keywords/Search Tags:Text categorization, Feature selection, Min-max modular network, Near-est neighbor, Support vector machines, Cross-language text categorization, Naive Bayes
PDF Full Text Request
Related items