Font Size: a A A

Image Spam Filtering Technology Research

Posted on:2010-11-26Degree:MasterType:Thesis
Country:ChinaCandidate:M C WanFull Text:PDF
GTID:2208360275983196Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
The continuously growing spam phenomenon poses a serious threat to peoples'online communications. To avoid such a threat and to stay in control of the spreading of the spam, researchers have already proposed many effective ways to detect spam. Currently, identifying spam by text content features is the main approach and has been applied to most of anti-spam systems. To circumvent text-based anti-spam filters, spammers embed spam message into images and send them as attachments, which is called"image spam". Image spam fails almost all of anti-spam tools which discriminate spam by text content. This dissertation analyzes this new kind of spam in detail and discusses how to filter them.First of all, this dissertation provides an overview on the state of art in image spam detection research, including the difficulties of detecting image spam, the features of image spam which have been used to recognize them, the classification algorithms which have been proposed in image spam filtering and the corresponding evaluation methods. While most of the research efforts carried on this field are focused on identifying image spam by the characteristics of images so far, how to "efficiently" define the problem and to characterize a spam image "effectively" are still remain unsolved to this day. Therefore, this dissertation focuses on this central problem of defining and abstracting effective features of image spam.Secondly, as the text contained in images can provide important information for image spam filtering, this dissertation proposes a new method to detect corners of text in images. The algorithm extracts edges with Color-Roberts algorithm and threshold segmentation at first. Then, it employs a circular template to gain corner information. Most of noise in image is eliminated by edge extraction and threshold segmentation. Besides, the circular makes the algorithm is insensitive to the orientation of texts. Experiment shows that the new algorithm has a better performance when compared to SUSAN algorithm and can obtain the angle magnitude of corners contemporarily. Combining the corner information gained by the proposed corner detection algorithm, an improved text region localization algorithm-Edge Classification based Text-Region Localization (ECTL) is proposed in this dissertation. The basic idea of ECTL is to discriminate non-text edges with some selected features of edges such as corner features. The experiment shows that ECTL can identify 96% of text contained in images and the precision can reach up to 97.6%.Thirdly, this dissertation proposes two approaches for identifying image spam. The first method is to discriminate image spam by means of the text region features and some properties of images files, which extracts text regions with ECTL algorithm and can identify more than 98% of image spam. The second approach is identifying image spam with some color and corner features, which does not require text region localization. According to the experiments on real benchmark datasets, the first method performs slightly better than the second one, but the time for features extraction is longer. To extract text region features and image properties of every image will spend 400ms, while extracting color and corner features needs 112ms. So far, these two algorithms have been implemented as functional modules and been integrated into our AONE anti-spam system, which is a prototype machine developed by ourselves, for spam filtering techniques research.
Keywords/Search Tags:image spam, spam detection, image feature, corner detection, text region localization
PDF Full Text Request
Related items