Font Size: a A A

Machine learning for image spam detection: From server to client solution

Posted on:2011-08-09Degree:Ph.DType:Dissertation
University:Northwestern UniversityCandidate:Gao, YanFull Text:PDF
GTID:1448390002452230Subject:Engineering
Abstract/Summary:
Spam has become a public hazard of email users around the world. While spammers are earning significant amount of money by sending spam emails in massive fashion, globally they cause a lot of economical loss to both individual and enterprise users due to the waste of valuable network resources. While spam filtering technologies have been significantly advanced, malicious spammers are constantly creating sophisticated new weapons in their arms race with anti-spam technologies, the latest of which is image spam.;Image spam is a type of email spam that embeds text content into graphical images to bypass traditional spam filters based on statistics of text characters. Ensuring that the embedded text content be readable, image spammers leverage a set of image processing technologies to vary the visual content of individual messages, e.g., by changing foreground colors, backgrounds, font types, or even rotating and adding artifacts to the images. Thus, they pose great challenges to conventional spam filters since we need to partly resolve visual recognition problems, which are in general difficult to address.;To effectively detect spam images, it is desirable to apply image content analysis technologies to identifying them on both server side and client side. Due to the fundamentally adversarial behavior from image spammers, we extensively employ various machine learning technologies, ranging from unsupervised cluster analysis, semi-supervised or supervised classification, to more interactive active learning algorithms, to effectively analyze the statistics of visual features. Hence we are able to achieve a comprehensive solution for spam filtering to meet with different kinds of system and usage requirements. Compared to previous works, which mostly filter the spam images on the client side, we present a more desirable comprehensive solution which embraces both server side filtering and client side detection to effectively mitigate image spam.;On the server side, depending how much human labor we may expend to collect labeled data, we design and investigate several different image spam systems. In particular, when there are no manual labeling efforts, we proposed a nonnegative sparsity induced similarity metric for cluster analysis of spam images. When there is limited number of labeled data, we propose a spam filtering system based on a novel semi-supervised algorithm, namely regularized discriminant EM (RDEM), which effectively utilizes the scarce labeled image data and the manifold structure of the unlabeled data for classification analysis. Last but not least, when we have accumulated enough labeled data, we can further leverage supervised machine learning algorithms such as probabilistic boosting tree (PBT) to build a fully automated classifier for identifying spam images.;On the client side, we employ the principle of active learning where the learning machinery guides the users to label as few images as possible while maximizing the classification accuracy. In our exploration, we systematically present our study of two active learning algorithms, which are based on a SVM and a Gaussian process classifier respectively. Semisupervised algorithm RDEM and supervised algorithm PBT can also apply to the client side when more labeled data or large amount of labeled data can be collected.;The server side filtering identifies suspicious spam sources and further analysis can be performed to identify the real sources and block them from the beginning. For those spam images which survived the server side filtering, our active learner on the client side will further guide the users to interactively and efficiently filter them out. Our experiments on an image spam data-set collected from the email server of our department demonstrate the efficacy of the proposed comprehensive solution.
Keywords/Search Tags:Spam, Server, Client, Machine learning, Solution, Data, Email, Users
Related items