Font Size: a A A

Chinese Text Classification Based On Structural Covering Algorithm

Posted on:2008-08-15Degree:MasterType:Thesis
Country:ChinaCandidate:J MengFull Text:PDF
GTID:2178360215996599Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Text classification is the basis and core of text mining, and plays an important rule in traditional information retrieval, construction of web site architecture, and search for web information. It has become a hot research project in recent years. Text Automatic Classification is an important application field of natural language process, an efficient means and necessary trend to substitute the troubled traditional manual classification. Especially, with the development of Internet technology, the network becomes an effective platform for people to exchange and process information, and digital information increases daily with high speed. Facing such a great deal of information, manual classification becomes helpless, and must be substituted by Text Automatic classification.Recently, for the study of Text Automatic Classification technology, researchers mostly focus on the exploration and improvement of different classification algorithms. However, the feature selection of Text Classification has always been a key technology of Text Classification. Therefore, it is necessary to study feature selection algorithms and different classification algorithms.The main work this paper finished and its innovative points are as follows:1. At first the traditional solutions to some key technical problems in the field of TC are studied, then this paper presents an Structural Covering Algorithm-Based Chinese Text Classification System(for short, CCTCS) as the main topic. Some key techniques implemented in this classifier, such as text pretreatment, feature selection, dimension reduction, structural covering algorithm and its improvement are discussed in details.2. The first step in CCTCS is Chinese word segmentation on Chinese texts using Chinese lexical analysis system(Institute of Computing Technology, Chinese Lexical Analysis System, ICTCLAS) provided by Chinese Academy of Sciences Institute of Computing Technology, getting rid of empty words and adjectives, only reserving nouns and verbs; then after text pretreatment, stop-words and rare-words are deleted, so that the dimension of texts can be reduced to the half on average and coarse dimension reduction can be realized. However, the main problem of CCTCS is feature selection for textual data. Feature Selection involves what feature to select and how large the dim of the feature space should be. Aiming at the preceding problem, this paper uses a feature selection method using Information Gain(IG) and Principle Component Analysis(PCA).3. In CCTCS, we use Artificial Neural Networks(ANN) as the classifier. The recorded term weights form the original feature vector, matching with neurons in the input layer of ANN one by one. In the stage of training, CCTCS applies labeled texts to ANN for training. While in the stage of text classifying, CCTCS inputs feature vectors of the text to be classified, then the trained ANN classify the unlabelled text to judge its class.4. This system designs the ANN classifier using Structural Covering Algorithm(viz. Alternative Covering Algorithm), at first, it analyzes Generic Alternative Covering Algorithm(GACA), and finds its several disadvantages. In the processing of classifying, refusal of identification and miscarriage of justice will occur, which will greatly influence the system's identification and correctness. So in order to overcome these disadvantages, Alternative Covering Algorithm is improved and its detailed process is presented. Experiments prove that the Improved Alternative Covering Algorithm(IACA) is better than the generic one in the whole capability. This improved algorithm can not only improve the training speed of the alternative covering algorithm, but also reduce the number of the test samples that can't be covered by the spherical neighborhood gained before and improve the accuracy of recognition.5. This paper compares and analyzes the classification performance of different feature selection methods and different Alternative Covering Algorithms which design the classifier in the experiments. Therefore, it is proved that the proposed feature selection method IG+PCA for Chinese text classification based on Alternative Covering Algorithm is superior to the method that only use IG. And Improved Alternative Covering Algorithm is better than Generic Alternative Covering Algorithm in the whole performance that applied to Chinese text classification. The experiments also show that the performance of the NN that designed by Alternative Covering Algorithm becomes highest when the feature dim is around 200.This paper has finished some work in Chinese Text Classification, We could do further research on the following several aspects in future.1. All this paper's conclusion is gained in the condition of experimentation, and then we could validate its applied effects in the practical application.2. The feature selection algorithm this paper puts forward can be applied to English Text Classification. And we can design more networked, intellectualized, multifunctional classification system, and then apply it to the pop practical application such as email percolator, search engine and so on.3. In order to improve the expansibility of this paper's method, we could consider classifying to make Principle Component Analysis, then we could look for the Principle Component(PC)'s infinite linear irrespective group of each sort to gain the whole feature, this is an our research emphases in the future.
Keywords/Search Tags:Neural Network (NN), Alternative Covering Algorithm, Text Classification, Feature Selection, Principle Component Analysis
PDF Full Text Request
Related items