Font Size: a A A

Research Of Text Classification Algorithm Based On Comparative Feature

Posted on:2009-07-06Degree:MasterType:Thesis
Country:ChinaCandidate:J ZhaoFull Text:PDF
GTID:2178360245453580Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of computer, communication and network as well as the popularity of the Internet, the number of electronic documents is on the increase. In order to utilize the non-structural data resource efficiently, there is a great demand for high effcient content-based text searching, consulting, and filtering systems. Text mining is a key factor to the construction of these systems.Text classification as an important part of text mining has been a study subject for con-cern. Now many methods have been applied to this field, such as Naive Bayes, SVM, KNN, Neural Network, etc. Among these methods, Naive Bayes, with the prior information, could provide a pattern and a handling method in the process of uncertain logic. Naive Bayes is high efficient and easy to be operated, so it has been widely used. Neural Network is a very popu-lar method nowadays with the ability of learning and the character of fault-tolerance, and it need not make assumptions on the probability model. However, in the application of text classification, Naive Bayes can not reflect semantic relations and the accuracy of general Neural Network is not high. Directed to these problemes, the main contributions of the thesis are summarized as follow:(1) After the introduction of several common algorithms in text classification, the thesis is focused on two classical algorithms-Naive Bayes and Self-Organizing Feature Map. By consulting documents and doing experiments, the two algorithms are analyzed and compared in detail.(2) The thesis puts forward the concepts of comparative feature and comparative thresh-old by combining the two algorithms, based on the idea of "divide and rule", then a new text classification algorithm based on comparative feature is proposed. The analysis, design and operating of this algorithm are introduced in detail.(3) The thesis analyzes and compares the respective characteristic of Chinese corpus and English corpus as well as the problems of pretreatment, and shows the methods and results of the pretreatment of English corpus. At the same time, the analysis and comparison on differ-ent results from the three algorithms working on the two corpuses are presented.(4) The thesis analyzes the three algorithms on Chinese corpus and English corpus, compares the newly proposed text classification algorithm based on comparative feature with the tradional Naive Bayes algorithm and Self-Organizing Feature Map effectively.Experimental results show that text classification algorithm based on comparative feature has gained a satisfactory effect. It is a highly efficient algorithm for text classification.
Keywords/Search Tags:Text Classification, Naive Bayes, Self-Organizing Feature Map, Comparative Feature, Comparative Threshold
PDF Full Text Request
Related items