Font Size: a A A

Chinese Text Data Classification

Posted on:2005-05-17Degree:MasterType:Thesis
Country:ChinaCandidate:Z L LiFull Text:PDF
GTID:2208360122480451Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of Information Technology and improvement of Internet application, information on internet exponentially increased, it was an important research subject to deal with large numbers of information and to store big text set automatically. One of effective method to management texts is to classify them, also called text classification.Automatic texts classification is an intelligent technology of information processing, and the foundation of text retrieval, which applied to news categorization, electronic conference, e-mail categorization and information filtering etc. Automatic texts classification plays an important role in traditional intelligence retrieval, foundation of web index architecture, web information retrieval, and so on. Based on web mining technology, automatic text classification has become a hot research area in the field of data mining and net mining.This thesis introduced the technical foundation of Chinese texts classification, Vector Space Model, and discussed Chinese word segmentation, analyzed many text feature selection algorithms and Bayes categorization model. With a lot of experiments, the thesis deeply researched and evaluated many texts feature selection algorithm such as Mutual Information, Information Gain, Chi-square evaluation, Weight of Evidence for Text. The thesis also did an improvement on Mutual Information. Because of ineffectiveness of Naive Bayes model for text classification, this thesis proposed integrating Boosting theory of machine learning in classification process, boost Naive Bayes categorization model through many times training . Improved by experiments, Mutual Information and Naive Bayes integrated withBoosting bring very good precision, recall, and F1 score.
Keywords/Search Tags:text categorization, feature selection, Vector Space Model, automatic word-segmentation, Naive Bayes
PDF Full Text Request
Related items