Font Size: a A A

An Implement Of Improved TF-IDF Algoirthm And It’s Application In Junk Email Identification

Posted on:2013-01-15Degree:MasterType:Thesis
Country:ChinaCandidate:X Z SongFull Text:PDF
GTID:2248330371985158Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Internet technology into the21st century information age, it makes the production anddissemination of information has become unprecedented convenience. However, Internettechnology is a double-edged sword, it is precisely because this kind of informationproduction and dissemination of aspects of convenient also led to the proliferation ofspam. Identify spam and be excluded from a broad array of information is increasinglybecoming one of the hot issues of the computer field. At the same time e-mail business as oneof the most important business in the Internet technology, spam constant interference. As aresult, the need to find a practical method to identify spam and separation, in order to protectthe normal communication and work requirements.In this paper, based on the improvement of the TF-IDF (term frequency-The inverse thedocument frequency) algorithm spam identification strategy. The strategy is based on moreextensive application in the field of search engine TF-IDF algorithm improvement, thealgorithm selected characteristic words of the spam is not comprehensive, and feature wordsto distinguish between not enough and other issues, and introduces features between thevarious types of distribution, and content, location, weight considerations. The mainimprovement strategies in this article are the following:(1) through the introduction of information entropy coefficient on the TF-IDF algorithmweights feature correction;(2) Secondly, for the traditional TF-IDF algorithm content and location of the right tore-consider the shortage, the introduction of the value of the location and content of the rightto amend in the IDF calculation process;(3) This paper introduces the concept of the independence of the coefficient ofcorrelation parameters to measure the characteristics of entry of the sub-category.(4) Finally, according to the binary classification of spam identification characteristics,simplifying the corresponding parameters of the IDF values.The comparison indicates that the corpus of data, improved TF-IDF algorithm than thetraditional TF-IDF algorithm in the recall rate, error rate, and F1values of the indicators haveimproved greatly. Further, we introduce the theory of support vector machines in machine learning, withimproved TF-IDF algorithm to establish a classification model to identify spam.The modelconsists of three main modules: Training module, test module and statistics module. Theywere by the word of the message text, extraction and screening of the feature entry,conversion data model similarity to achieve the training of the system, determine theclassification of the unknown e-mail and mail statistics. Collection through the use of the testmessage in the corpus, the system test, the experiment proved to us to achieve the Chinesespam identification system can basically most of the spam identification and isolation. Spamidentification system based on the traditional TF-IDF algorithm and Tencent have usedcompared to the significant improvement of the basic realization of the separation of the userspam filter to protect the user’s normal communications work needs.
Keywords/Search Tags:E-mail, TF-IDF Algorithm, Information Entropy, Spam Identification System
PDF Full Text Request
Related items