Font Size: a A A

Research On Key Technologies Of Image Spam Filtering

Posted on:2014-05-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:P LiFull Text:PDF
GTID:1268330392472591Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Email has become an indispensable communication tool in our daily life.However, it has also become a convenient way for some people with ulteriormotives to send advertising, pornographic materials, malicious frauds, reactionaryideology and rhetoric in recent years. Nowadays, text-based filters have grown insophistication and effectiveness for filtering spam emails. Since2006, in response,spammers have adopted a number of countermeasures to circumvent thesetext-based filters. Currently, one of the most popular spam construction techniquesinvolves embedding text messages into images. It is also with deformable characters,different kinds of noise to defeat the filters furthermore, which poses a newchallenge for spam researchers. Image spam emails are more hidden, and consumemore network bandwidths, computing and storage resources, at the same time bringgreater security risk to the community. It has been the urgent moment for itseffective filtering. In order to prevent the further proliferation of image spam, wemake some researches on the key issues according to the different characteristics ofspam images, as well as the actual application requirements.Through analysis of the generation and sending ways of image spam, we knowthat spam images are always sent in batch. And the spam images from the samesource are often generated by the same template, and therefore commonly have thesimilar strucutre and regions. According to this feature, this paper analyzes the mainproblems in near-duplicate image detection(NDII), and proposes a novel schemecombing the neighborhood information of single local feature and the globalgeometric consistency of multi-local features for improving the accuracy ofnear-duplicate image detection. Firstly, we construct the geometric contextualinformation of image local features to enhance the distinctiveness of visual word.Then, we propose to verify the global geometric consistency of subset-of-featuresfor improving the accuracy of retrieval results furthermore. Experimental resultsshow that the proposed method can improve the accuracy of NDII prominently,which has a positive meaning for image spam filtering with sample images.One of the most important features of spam images is that it often containslarge amounts of text. Therefore, by the same way for filtering text-based spam, wecan also judge that whether the email image contains certain sensitive keywords.This paper proposes a new approach for image keyword spotting using visual phraseof character primitives. Firstly, maximally stable extremal regions are extractedfrom a given image, and then normalized to be our character primitives. Theprimitives of the same keyword are often within the same phrase. Then, we propose to measure the similarity with element similarity and geometric structureconsistency. This method does not require the processes of image binarization,layout analysis and text area localization. And it is more flexibly and robust.Otherwise, this paper proposes a method based on geometric blur descriptorsfor image keywords spotting in cluttered scenes. It can reduce the impact of noiseinterference with Gaussian variable kernels for image blurring. Firstly, we get theinitial correspondences of local feature points with geometric blur, and filter out themismatches by layout analyis. Because there often exist Chinese characters sharingthe same radicals, we propose to use the ratio of the area of the no-match featurepoints in the sample image to that of the whole image to further improve thematching accuracy. The experimental results show that our method can recognizeand spot the keyword images with high accuracy. And it has better anti-interferencefunctions for the noise used in spam images.Spam images are various. Different kinds of spam images are often withdifferent types of features. Furthermore, false positive will bring greater losses foremail users, and it is also tolerant to false negative to some extent in practice.Therefore, this paper proposes to use both local and global features for spam imagesdescription, and proposes to use cascade of classifiers for hierarchical filtering ofdifferent types of spam images. To avoid the false positives, we propose to useclassification entropy to indicate the multi-times of judgement or normal images.The experimental results show that we can not only reduce the false positive ratio offilters as much as positible, but also enhance the accuracy ratio.Spam images are commonly with many background noise components fordefeating spam filters. Therefore, the presence of background noise can beconsidered as an indication that an email is spam. According to this feature, thispaper proposes to obtain the noise feature image using wavelet transform, and thenthe method for noise measurement and classification by connected componentanalysis in the noise feature images is given. This technique is intended to be usedas a specific module of spam filter, whose output could indicate the “amount” and“type” of noise in email images. Since noise could also be present in legitimateimages, the results of noise analysis can not give the certainty that an email is spam.But it can be taken as an indication of the tricks which were introduced to defeatagainst OCR tools.
Keywords/Search Tags:image spam, spam image, near-duplicate image detection, sensitiveimage keyword spotting, hierarchical filtering, noise
PDF Full Text Request
Related items