Font Size: a A A

Research On Chinese Spam Filtering Based On Semantic Body And Text Clustering

Posted on:2013-12-27Degree:MasterType:Thesis
Country:ChinaCandidate:P WangFull Text:PDF
GTID:2248330374955605Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, E-mail has become one of indispensablecommunication methods in the daily life. At the same time, Spam as a false derivativeof e-mail has also fast developed. Especially in recent years, the rapid development ofChinese e-commerce and the mobile Internet have promoted the increase of text-basedenterprise mail and phone mail numbers. In addition, the Chinese is more complexthan the English, so abroad spam filtering technology is not applicable to filterChinese spam.At present, content-based anti-spam technology mainly adopts the keyword-basedand semantically irrelevant spam filtering methods, such as Bayesian, case-basedmethod, and text classification method, which are lack of precise quantitativedescription for the idea expressed by the e-mail content. However, the new-type spamdisguises as normal mails by using synonyms and near-synonyms, so the traditionalmethods can hardly distinguish spam and normal mails. Therefore, on the basis ofsemantic similarity and the advantages of the semantic analysis based on HowNet, thispaper proposes a Chinese spam filtering method based on semantic body and textclustering. The thesis mainly studies two aspects:1. Extract features from the new-type spam with synonyms and near-synonyms.After word segmentation and removing stop words, each word in the set of theremaining words get only one meaning through word sense disambiguation. Wordsense disambiguation is In favor of e-mail feature extraction; each word in the wordset form a lexical chain, and some lexical chains with the same or similar meaningcombine with a lexical chain, then only one word in the chain is selected by usingTFIDF as the representative of the lexical chain. Finally, the stipulating number ofwords is extracted as the characteristics of the e-mail, which is named as SemanticsBody. The simulation results show that: this method of feature extraction can get moreaccurate results.2. After getting the semantics body, the paper uses text clustering algorithm basedon semantic distance to filter the spam. Firstly, the mail collection is executed the firsttext clustering based on similarity calculation of HowNet. Then in order to avoid theinfluence of input order of E-mail on the clustering results, the clustering results isexecuted for the second time as the second text clustering to make clustering results more accurate. Finally,the clustering results is used to filter spam.The experiments prove that this method have a good effect of filtering to the newtype spam with synonyms or near-synonyms. The proposed method is more objectivecompared with traditional spam filtering methods in the judgment of the messagecontent. This method has much more greater advantage in recalling rate when thespam content of meaning expression is unknown.
Keywords/Search Tags:Semantic body, word similarity, semantic distance, text clustering, HowNet, Chinese spam
PDF Full Text Request
Related items