Font Size: a A A

Construction Of Chinese Email Corpus

Posted on:2007-11-03Degree:MasterType:Thesis
Country:ChinaCandidate:J H LiFull Text:PDF
GTID:2178360185478456Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Nowadays with the great development of the Internet, Email has been becoming more and more important in human's life for its celerity and convenience. Web surfers are accustomed to communicate through Emails, and have to deal with bulks of Emails everyday. Therefore, it is urgent to classify Emails into folders automatically. On the other hand, a mass of text categorization (TC) techniques, such as Bayes, KNN, SVM, etc., are applied to Email categorization (EC), however, there is no open Chinese Email corpus available on the Internet, and researchers have to collect Emails themselves to do experiments. Additional, as far as a certain EC approach concerned, its performance variation is according to the used training Email corpora, and an excellent one or with"high quality"may result in satisfactory outcome.For that purpose this paper proposes a way to construct a Chinese Email corpus system with diversity, dynamic property and semi-automation.Firstly, the popular Email-corpora are presented as well as the infrastructure of corpus is introduced. According to those analyses, this paper proposes the system framework and shows the flow of Email proposal. Secondly, as dozens of Emails with MIME format are collected, complying with MIME criterion, they are parsed and fields'information is extracted. And a series of operations have done for each Email for the purpose of providing much richer information, availing to get better performance of Email categorization and clustering. Thirdly, while constructing Email presenting model, features extracted from Email header are collected. Maximum Entropy Model is applied to Email categorization. Not only the categorization results coursed by Email feature fields, the numbers of features and iteration are presented and discussed, but also the performance of hierarchy categorization and direct categorization is compared, so appropriate settings are done according to the empirical results. Fourthly, traditional text...
Keywords/Search Tags:Chinese Email Corpus, Email Categorization, Maximum Entropy Model, Email Clustering, Corpus-increment
PDF Full Text Request
Related items