Font Size: a A A

Research On Automatic Text Categorization System Based On Neuron Network

Posted on:2009-09-14Degree:MasterType:Thesis
Country:ChinaCandidate:S P LiFull Text:PDF
GTID:2178360245455159Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Automatic text categorization (ATC) has already become a research focus in the field of information processing; it refers to the task of automatically sorting a set of documents into categories from a predefined set. It is a core of text mining. ATC is an effective means of organizing and managing the massive information resources, and applied widely in the field of information processing. Therefore, the research of automatic text categorization has the broad business prospects and realistic significance.In this paper, the traditional solutions to some key technical problems in the field of ATC are studied at first, such as Chinese word segmentation, feature Selection, feature weighting and categorization algorithm. By comparing and analyzing the implementation technologies, we make a further discussion and put forward an automatic text categorization prototype based on neural network. The prototype designed based on modularization and the key algorithms and functions are packaged in modules, so the prototype has a good portability. In the prototype, the core modules are pretreatment, text representation and classifier.In the process of pretreatment, we used ICTCLAS which have developed by the Chinese Academy of Sciences for text segmentation and then select the useful feature for ATC. Now, the most automatic text classification system used Stoplist for selecting feature, but it is not satisfactory in practical application. In this paper, we put forward a new method to select useful feature. In this method, we select the feature based on part of speech at first, and we eliminate the other useless words by Stoplist at later. We find the new method improves the efficiency of the prototype.In the process of text representation, we use different methods of feature selection and weighting to take out features for the construction of text vector space. In this module, users can select the dimension of vector space by two ways.In the module of classifier, we select neural network as our classification algorithm, because it has the advantages of self-learning ability, robustness and easing to design. In this module, we call the classifier in two ways, training or testing. We use the VC++ as the developing platform to design an automatic text categorization prototype based on neural network, and the prototype has been designed with the method of software engineering. At the end of this paper, the main data structures and algorithms have been introduced for the key modules of this prototype. We select the standard C++ programming language to design the core algorithms, so the prototype is easy to migrate to other platforms. We also ensure the stability and robustness of the prototype by detailed exception handling mechanism.
Keywords/Search Tags:Automatic Text Categorization, Part of Speech, Feature Selection, Feature Weighting, Neural Network
PDF Full Text Request
Related items