Font Size: a A A

A Comparative Study Of Chinese Character Feature And Word Feature In SMS SPAM Filtering

Posted on:2012-02-24Degree:MasterType:Thesis
Country:ChinaCandidate:O P FengFull Text:PDF
GTID:2178330335460292Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
For several years, short message service, SMS for short, has extremely developed in telecommunication markets of various countries. But with huge short messages, SMS spam becomes increasingly prosperous and disturbs more and more people's working and lifing.Quantity and dimension of word feature are very large, so word feature have to do process of feature selection, otherwise this will seriously influence filter effect of SMS spam. In Chinese text, Chinese Word Segmentation is an indispensable step to extract word feature. Problem of Chinese Word Segmentation is complicated arithmetic, large amount of calculation and computational resource. These defects mean that using word feature needs superior computing equipment and long process time. But these two conditions usually are rarely sufficed in practical application of SMS spam filtering.To aim at defects of word feature, this experiment apply character feature in Chinese SMS spam filtering. Dimension of character feature is restricted and extracting character feature does not have to do Chinese Word Segmentation. Advantages of Chinese character feature could effectively economize computational resource, reduce amount of calculation, shorten process time and sort out to deal with application problem of using word feature in SMS spam filtering.This experiment uses four text classifiers, three feature selections and five spatial dimensions. Classifiers are Bernoulli Bayes classifier, Multinomial Bayes classifier, Radial Basis Function Support Vector Machine classifier and C4.5 Decision Tree classifier. Feature selections are Odds Ratio, Information Gain and Mutual Information. Spatial dimensions are 500,1000,1500,2000 and whole dimension. Through the use of different classifiers, feature selections and spatial dimensions, this experiment totally constructs 52 filtering conditions. This experiment severally uses Chinese character feature and word feature to get 104 filter results.Results shows that on the conditions of C4.5 Decision Tree classifier, Bemoulli Bayes classifier of low spatial dimension and feature vector space of Mutual Information, filtering effects of Chinese character feature are better than effect of word feature and on other conditions, filtering effects of Chinese character are weakly poorer than word feature.Through analyzing results, on condition of limited computing equipment and process time, using with Chinese character feature and C4.5 Decision Tree or Bernoulli Bayes classifier in SMS spam filtering could obtain pretty good filtering effects.
Keywords/Search Tags:Chinese character feature, word feature, SMS spam filtering, text categorization
PDF Full Text Request
Related items