Font Size: a A A

A sublexical unit based hash model approach for spam detection

Posted on:2010-06-11Degree:Ph.DType:Dissertation
University:The University of Texas at San AntonioCandidate:Zhang, LikeFull Text:PDF
GTID:1448390002489500Subject:Computer Science
Abstract/Summary:
This research introduces an original anomaly detection approach based on a sublexical unit hash model for application level content. This approach is an advance over previous arbitrarily defined payload keyword and 1-gram frequency analysis approaches. Based on the split fovea theory in human recognition, this new approach uses a special hash function to identify groups of neighboring words. The hash frequency distribution is calculated to build the profile for a specific content type. Examples of utilizing the algorithm for detecting spam and phishing emails are illustrated in this dissertation. A brief review of network intrusion and anomaly detection will first be presented, followed by a discussion of recent research initiatives on application level anomaly detection. Previous research results for payload keyword and byte frequency based anomaly detection will also be presented. The drawback in using N-gram analysis, which has been applied in most related research efforts, is discussed at the end of chapter 2. The importance of text content analysis to application level anomaly detection will also be explained. After a background introduction of the split fovea theory in psychological research, the proposed sublexical unit hash frequency distribution based method will be presented. How human recognition theory is applied as the fundamental element for a proposed hashing algorithm will be examined followed by a demonstration of how the hashing algorithm is applied to anomaly detection. Spam email is used as the major example in this discussion. The reason spam and phishing emails are used in our experiments includes the availability of detailed experimental data and the possibility of conducting an in-depth analysis of the test data. An interesting comparison between the proposed algorithm and several popular commercial spam email filters used by Google and Yahoo is also presented. The outcome shows the benefits of the proposed approach. The last chapter provides a review of the research and explains how the previous payload keyword approach evolved into the hash model solution. The last chapter discusses the possibility of extending the hash model based anomaly detection to other areas including Unicode applications.
Keywords/Search Tags:Hash model, Detection, Sublexical unit, Approach, Spam, Application level
Related items