Research And Improvement To Text Classification Algorithm

Posted on:2011-10-22

Degree:Master

Type:Thesis

Country:China

Candidate:J Q Gong

Full Text:PDF

GTID:2178330332988145

Subject:Computer application technology

Abstract/Summary:

Text categorization can provide information retrieval more efficient searching strategies and good query results. With the rapid growth of the information resources on intemet, information processing has become more and more important.The automatic method of text classification based on machine learning was becoming main stream after 1980s. It has advantages of the short period, high efficiency, and high consistency of the results. Though automatic text classification has so many merits, the accuracy of its results is not satisfied till now. Text classification has a wide range of applications with the rapid development of the internet. The current research is mainly focused on improving the accuracy of the text classification results.The paper gives a detailed introduction of key techniques of automatic text categoriation including the text classification system. Then Bayes classifier model and algorithm including the text information expressing, extraction and classification method are analyzed. Moreover, to overcome the shortcoming of Naive Bayes classification method on independence hypothesis, dispersion of mutual information (DMI) is used to show the relevant of each characteristic and amalgamate similar ones. The process is as follows:distilling original character words from training text set, wiping off stop words, removing different meanings, and doing dimension reduction (DR) based on the relevant of each characteristic in local domain using DMI. Comparing to the vector before dimension reduction, the resulting vector has fewer character words of low frequency, more character words of high frequency.The high frequency is strengthened, the number of character words is reduced, and the dimension is redced. It has stronger association with the belongings class and better representation, than the previous one. So the DR object is well achieved.Based on these, an improved text classification method which can improve the efficiency of the classification is proposed. In addition, the accept or reject policy during DR, such as threshold selection and its foundation, is ulterior studied. Experiments are made and the results shows that the improved text classification model can be well applicable to the text classification and can improve the performance of the existing one.

Keywords/Search Tags:

Text Categorization, Naive Bayes, Independence Hypothesis, Mutual Information

Related items

1	The Research And Implement Of Naive Bayes Text Classification Algorithm
2	Text Categorization Based On Naive Bayes Method
3	The Research And Application Of Text Categorization Arithmetic In Spam Filtering
4	The Research Of Multi-layer Hidden Naive Bayes Algorithm Based On Mutual Information
5	A Study On Text Categorization Based On Machine Learning
6	The Study Of Naive Bayes Text Classification System Based On Artificial Intelligence
7	The Study Of Chinese Text Categorization Based On Na(?)ve Bayes
8	Research And Improvement Of Automatic Text Classification Algorithm Based On The Vector Space Model
9	The Study On Feature Selection Methods For Automatic Text Categorization
10	Data Mining Systems And Their Applications - Improve The Performance Of The Naive Bayes Text Classifier, Associated Characteristics