Font Size: a A A

Bbs Spam Filtering Model Based On Word Co-ocurrence

Posted on:2010-07-05Degree:MasterType:Thesis
Country:ChinaCandidate:C ChenFull Text:PDF
GTID:2198360332957852Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology, a variety of network application services, BBS system (Bulletin Board Systems) for the General Internet users open up the free expression of the freedom of expression of space, with large amount of information resources. Develop an effective BBS search engine that helps people get more knowledge and information. In massive BBS information, there are lots of spam information, for example, advertising bothers most users. Although you can filter through the establishment of the rule to be excluded or artificial, as the deliberate and arbitrariness and information as well as the cost of manual operation is too large, so it is difficult to guarantee that all BBS system information in the canonicalized and meaningful. In the common search engine contains information in the BBS, with lots of spam that affects the search results.To overcome these problems, the subject came up with an approach based on words of co-occurrence of a vector space model for information filtering in the search engine framework text filtering. The main topics of research are:(1) The erection of a BBS, including information collection, BBS web page processing, indexing and retrieval module for the BBS information search engine.(2) Proposed a based on word co-occurrence frequency vector space model calculation in the text between the title and content feature vector correlation method.(3) Campared the model based on word co-occurrence vector space with the model based on semantic similarity of HowNet through the experiment.In the same training set and test set, this paper adopts the model of the text of the relevance of results is better than no semantic analysis, based on HowNet semantic similarity calculation, while carrying out the case of semantic analysis, based on Hownet semantic similarity calculation is slightly better than this model. In this paper, the model has little overhead and have the advantages of self-learning performance, the model can be used for information retrieval, information filtering, natural language processing research has broad application prospects.
Keywords/Search Tags:the word co-occurrence, vector space model, spam, information filtering, information retrieval, BBS
PDF Full Text Request
Related items