Font Size: a A A

Mail Content Classification Method Based On SVM

Posted on:2014-01-07Degree:MasterType:Thesis
Country:ChinaCandidate:H SuFull Text:PDF
GTID:2248330398950223Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
The emails sent to leaders include very plenty of messages in army, which provide useful and helpful information to other departments and offices for working summary and improving service ability. Artificial classification of emails has not satisfied the requirement of practical work as the soar in the number of emails. So, it is very significant that realizing the automatic classifying of emails (ACE) in reducing the workload and improving the efficiency of all departments and offices.Basing on the information characteristic of the headline and content of the emails, ACE is the process of classification as the five categories of military, political, logistics, management, and blessing. In this paper, we abstract headline and text of the email and get the original feature set of mail content text by processing the word segmentation and stop words of mail content text with the Chinese academy of sciences ICTCLAS segmentation system; Adopting the Vector Space Modal (VSM), we turn the mail content text into data vector which can be identified by PC; On the basis of analysis of commonly used feature selection methods, we propose a kind of modified CHI method for feature selection and reduce the dimension of the original feature; we obtain mail contents classification model by training of text data with support vector machine (SVM), in which the kernel function is radial basis kernel function (RBF) with its optimal parameters are determined by5-fold cross-validation and the grid search method and binary tree multi-class SVMs classification model is built by using maximum separation interval as a class separation measure; using this model, we accomplish the text classification to unknown categories email.In order to check out the availability of mail content classification system, we chosen emails sent to leader in2012in a gleam of forces as data source, which including656emails. In this training set, we classified232emails among of them by the mail content. The experiment results show that Select the number of features200, the classification of the best overall performance; the classification of recall ratio and precision ratio of modified CHI method was higher1.3%and0.8%than the traditional, respectively; the classification results of binary tree method is approximate to the DAG method, and is higher0.9%and2.6%than the "One-Against-Rest" and "One-Against-Rest "method, respectively.
Keywords/Search Tags:Mail Content Classification, SVM, CHI Statistical, Binary Tree SVM, Cross Validation
PDF Full Text Request
Related items