Font Size: a A A

Research And Implementation Of Classification Algorithm Based On Message Content And User Behavior Relationship

Posted on:2017-10-22Degree:MasterType:Thesis
Country:ChinaCandidate:H Z SongFull Text:PDF
GTID:2348330485486053Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Email is more and more important in human communication. While it brings convenience to people, also makes people have to spend a lot of time to deal with a lot of mails. With the popularity of e-mail, people have to spend more and more resources of human and financial on dealing with e-mails. Therefore, to construct a new effective email classification algorithm becomes particularly urgent.The research on the paper focuses on the problem of the mail classification. Imbalanced data sets is the key point of the problem. In recent years,the classification of the imbalance date set is a popular issue.Imbalanced data sets is refers to the different categories of data in a huge number of differences. In the process of classification, unbalanced data gathering caused classifier in favor of the categories with more number. For the categories with small number what we pay more attention,the classifier didn't work well. At present there are two popular solutions: changing the data distribution and adjustment of the classification algorithm. Combining the two methods, this paper proposed a multilevel classifier algorithm. This algorithm combines E-mail content and user behavior relationship. The algorithm filtered by themselves, continuously reduce the imbalance of sample in the final stage finally realize the relative balance of data. In addition, the current e-mail classification algorithm is generally for the e-mail content, ignoring the role of the e-mail address in the mail classification, in fact, the same message sent to us by different people, since the relationship between the sender and the recipient, these Mail will be treated differently. Therefore, in this paper, full consideration of the e-mail address information, combined with user behavior and the relationship between the content of the message classification.In the implementation process of the algorithm, I used a lot of traditional classification machine learning algorithm, such as the proportion of naive Bayes, support vector machines, random forests algorithm and so on. Training on the use of e-mail address generated classification model, based on the combination of multi-level e-mail message content classification implements the imbalance mail classification, and achieved relatively good results.
Keywords/Search Tags:email classification, Unbalanced data, Multi-level classifier, Confidence, Random forest, SVM, naive Bayes
PDF Full Text Request
Related items