Font Size: a A A

Information Filtering Technologies Based On Heuristic Rules And Text Classification

Posted on:2008-08-27Degree:MasterType:Thesis
Country:ChinaCandidate:Y L WangFull Text:PDF
GTID:2178360242474603Subject:Computer applications
Abstract/Summary:PDF Full Text Request
As the rapid growth of the World Wide Web, the Internet has already become an important window of contacting world for the people, and we can acquire from the Web very rich information and services. But we are being disturbed by lots of bad information on the Web, such as speech on counteraction, violence and fetish spread by some evil organizations, erotic content, which exist in BBS, Blog, junk emails or erotic websites. How to detect and filter the bad information on the Web has become an important issue in the information filtering field. Most bad information on the Internet appears in the form of text, so bad text filtering techniques is a main research direction in the bad information filtering domain.This thesis first introduces the basis knowledge about the information filtering, for example: the conception of the information filtering, the sort of the information filtering, the difference between the information filtering and other information processing technologies, as well as some common information filtering models. Secondly, it introduces the text preprocessing technology in detail: splitting Chinese words, deleting stop words and selecting features. Thirdly, it introduces the all kinds of automatic text classification algorithms, such as Naive Bayes (NB) classifier, KNN (K Nearest Neighbor) algorithm and SVM (Support Vector Machine) classifier etc.In this paper, we introduce the Discriminative Naive Bayes (DNB) text classifier. In this section, we firstly introduce the common two models of the Bayes (NB) text classifier and the process of the two-category Naive Bayes classifier. Then we introduce the advantage of the Discriminative Bayes classifier and the process of the Discriminative Bayes classifier. Finally we introduce the Discriminative Naive Bayes (DNB) text classifier, and use the Discriminative Naive Bayes (DNB) text classifier in the text information filtering field. For the problem of text filtering, text classifiers assign a given document to one of two predefined categories, i.e. healthy or bad document. This paper proposes the Discriminative Naive Bayes (DNB) text classifier.Finally, this paper proposes a multilevel model of filtering bad information based on heuristic rules and text classifier. This paper introduces the information filtering model based on heuristic rules, and designs various filtering rules as for different forms of bad information on the Web. We compare the information filtering model based on heuristic rules with the information filtering model based on text, then propose a multilevel model of filtering bad information based on heuristic rules and discriminative naive Bayes text classifier(RDNB). For given a test document, the model accomplishes firstly rough filtering using heuristic rules, then uses the Discriminative Naive Bayes classifier to perform precise filtering i.e. to classify the document. The experimental results show that the model can achieve higher Precision and F1 measure.
Keywords/Search Tags:Heuristic rules, Text classification, Discriminative Na(?)ve Bayes, Information filtering
PDF Full Text Request
Related items