Font Size: a A A

Data Mining Systems And Their Applications - Improve The Performance Of The Naive Bayes Text Classifier, Associated Characteristics

Posted on:2004-01-27Degree:MasterType:Thesis
Country:ChinaCandidate:L J ZhangFull Text:PDF
GTID:2208360095450961Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Text categorization is the task of automated assignment of natural language texts to predefined categories based on their content. With the rapid growth of online digital text data, text categorization has become one of the key techniques for handling and organizing text data. Text categorization techniques can be used to classify news stories, to find interesting information on the WWW, and to guide a user's search through hypertext.Naive Bayes(NB) classifier has long been considered a core methodology in text categorization mainly due to its simplicity and computational efficiency. The main drawback of Naive Bayes classifier is that it uses single words as features, and it assumes that the probability of appearance/absence for a word is independent from the presence/absence of any other word given that the class is known. In fact, this assumption is, quite obviously, not verified in practice, so the performance of the Naive Bayes Classifier is reduced. If the assumption can be relaxed to a certain extent, then we can enhance the performance of Naive Bayes Classifier.In order to relax the strength of the assumption, we propose the concept of association terms. An association term is a set of single words(these single words are called as primitive terms) that occurrence together frequently. Association terms make better features than primitive terms. In many cases, an association term describes a concept better than its component words. In other cases, the concept is described only by it, not by its component words.We devised an simple algorithm based on Apriori, which can be used to mining association terms. In order to reduce the size of feature space and improve the efficiency and effectiveness of classification, two algorithms were proposed to select useful association terms: redundancy pruning algorithm and feature selection by information gain algorithm. By using association terms as features, we built our text classifier: NBAT(Naive Bayes text classifier by Association Terms).NBAT was run on the ten most populated categories from Reuters21578 dataset. The result of experiment showed that the performance of the Naive Bayes text classifier can be improved by using association terms. When using association terms as features, the increase in macro-averaged break-even point was 6.9% and inmacro-averaged Fl measure was 12.2%.Finally, we point out the weakness of our algorithm. The main weakness of our algorithm is the low mining efficiency, especially when the size of feature space is very large or the minimum support threshold is very low. A possible solution to this problem was suggested: using some association terms mining algorithm without generating candidate frequent itemsets, such as FP Tree etc. At the last, we presented some research work in this field that should be done in the future.
Keywords/Search Tags:Data mining, Text mining, Text Categorization, Naive Bayes, Association Term
PDF Full Text Request
Related items