Font Size: a A A

Text Clustering With Noise And Application In Anti-spam

Posted on:2013-02-06Degree:MasterType:Thesis
Country:ChinaCandidate:X ZhouFull Text:PDF
GTID:2248330371981005Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology, the text data is growing exponentially. In order to obtain the intrinsic relationship between the data and implied information, text mining technology emerges as the times require.Cluster analysis has a very important role in text mining and has an important feature of data mining, the paper will discuss the text clustering method with interference information.Traditional text mining methods first represent text into a vector space model; secondly, documents are converted to vector form by using the TFIDF weights.Finally calculate the text similarity in the vector space model. Traditional vector space model don’t consider the conceptual similarity between the words, thus affecting the accuracy of the data clustering. To solve the problem, a method of similarity for Chinese based on the HowNet model and semantics of the inner product is proposed.However, this method is not appropriate to the problem of spam. Because in order to escape the filter of the mail, when finishing editing spam, spam senders will use some methods such as finding and replacing the sensitive keywords by another or inserting symbols or changing orders of words or altering words to phonetic.But readers can understand the text. Traditional methods will take a series of pretreatment measures, which will filter out the interference information and cause less accuracy of similarity. Ultimately the methods lead to poor quality of clustering effect.In this paper, a method based on Needleman-Wunsch algorithm is proposed to measure the similarity among the spam mail, in which the texts usually contain a lot of noises. Based on the proposed similarity measurement, an efficient clustering algorithm based on Needleman-Wunsch algorithm is devised. Finally text clustering is completed.Compared with the vector space model, when using the Needleman-Wunsch algorithm to compute the text similarity, the method avoids the process of segmentation, reduces the semantic loss, and retains all the text information, so that the quality of the clustering is ensured;By preprocessing the content of the document into Chinese characters, English strings and symbol strings, the data sparseness problem is alleviated, the number of comparisons of the characters is reduced,thereby speeding up the processing speed.Compared by simulation with traditional clustering algorithm, the clustering quality and efficiency are greatly improved.That shows that the proposed clustering algorithm is suitable for spam clustering, and then provides a valid e-mail spam filtering technology. The specific idea is that spam and legitimate e-mail are clustered by using the method proposed in the paper. According to the document similarity values, they are clustered into different categories. Finally the spam and legitimate mail are determined.
Keywords/Search Tags:text similarity, text clustering, Needleman-Wunsch algorithm, non-metricmethod, spam
PDF Full Text Request
Related items