Font Size: a A A

The Research And Implementation Of Multi-Lingual And Multi-Category Text Classification System

Posted on:2011-09-29Degree:MasterType:Thesis
Country:ChinaCandidate:Z H LiuFull Text:PDF
GTID:2248330395958425Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Along with the swift development in information technology and Internet in this era of information and knowledge explosion, people are facing rapid rising information. Consequently, tools are badly needed for people to find from the massive information what they really need. Along with the increasing practical application demand, the traditional text classification of single language and single category can not satisfy people’s demand. It is therefore necessary to work out a set of highly efficient automatic classification system which can perform multi-lingual and multi-category classification for massive text data.This thesis mainly deals with the research and implementation of automatic text classification system under a multi-lingual and multi-category system. This system is based on N-Gram, the information gain (IG), and Naive Bayes algorithms. Microsoft Visual C/C++is adopted for the development of this system. The model training for the automatic text sorting system has been implemented and the text classification function achieved. In particular, for users, the system model training has been implemented in all languages and correspondent complete classification system and such functions as the model training of a given language correspondent to a given classification system, the mail analysis, the mail backstage process, the enumeration of all categories, the deletion of selected class, and addition of new category have all been achieved. The test shows that the system is rapid in process massive texts and accurate in sorting and it can meet the demand of users and increase the efficiency in processing texts.The thesis first introduces the background of the system development, the development goal and system statement, and presents the structure of the thesis as well as the relative technology adopted in the system. For the design of the system, the analysis of the system requirement is offered at the beginning, and the function structure diagram, the design principle, and the general design of the system are presented. For the part of system implementation, the author demonstrates the training process of the core sorter of the classification system, and discusses in detail the text segmentation, the feature selection, the classification training, and text classification in combination with each functional module. For the test of the system, when the threshold of features is different,the author gives the changing graph of the system’s classification accuracy, and then discusses the system’s accuracy test, the stability test, and the performance test, and offers relational graph between the training text size and the training time and has drawn the test conclusion. Finally, the author makes the summary of the text work and makes discussions on the future work.
Keywords/Search Tags:Naive Bayes Classifier, multi-lingual and multi-category text classification system, feature selection, IG
PDF Full Text Request
Related items