Font Size: a A A

Research And Improvement To Text Classification Algorithm

Posted on:2011-10-22Degree:MasterType:Thesis
Country:ChinaCandidate:J Q GongFull Text:PDF
GTID:2178330332988145Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Text categorization can provide information retrieval more efficient searching strategies and good query results. With the rapid growth of the information resources on intemet, information processing has become more and more important.The automatic method of text classification based on machine learning was becoming main stream after 1980s. It has advantages of the short period, high efficiency, and high consistency of the results. Though automatic text classification has so many merits, the accuracy of its results is not satisfied till now. Text classification has a wide range of applications with the rapid development of the internet. The current research is mainly focused on improving the accuracy of the text classification results.The paper gives a detailed introduction of key techniques of automatic text categoriation including the text classification system. Then Bayes classifier model and algorithm including the text information expressing, extraction and classification method are analyzed. Moreover, to overcome the shortcoming of Naive Bayes classification method on independence hypothesis, dispersion of mutual information (DMI) is used to show the relevant of each characteristic and amalgamate similar ones. The process is as follows:distilling original character words from training text set, wiping off stop words, removing different meanings, and doing dimension reduction (DR) based on the relevant of each characteristic in local domain using DMI. Comparing to the vector before dimension reduction, the resulting vector has fewer character words of low frequency, more character words of high frequency.The high frequency is strengthened, the number of character words is reduced, and the dimension is redced. It has stronger association with the belongings class and better representation, than the previous one. So the DR object is well achieved.Based on these, an improved text classification method which can improve the efficiency of the classification is proposed. In addition, the accept or reject policy during DR, such as threshold selection and its foundation, is ulterior studied. Experiments are made and the results shows that the improved text classification model can be well applicable to the text classification and can improve the performance of the existing one.
Keywords/Search Tags:Text Categorization, Naive Bayes, Independence Hypothesis, Mutual Information
PDF Full Text Request
Related items