Font Size: a A A

Text Classification System Based On Neural Networks Nntcs Design And Realization

Posted on:2004-09-22Degree:MasterType:Thesis
Country:ChinaCandidate:G LiuFull Text:PDF
GTID:2208360095456156Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Text classification is the basis and core of text mining, and plays an important rule in traditional information retrieval, construction of web site architecture, and search for web information. It has become a hot research project in recent years.At first the traditional solutions to some key technical problems in the field of TC are studied, also core techniques and system architecture of the typical TC systems are discussed, the applications of TC are described in this paper. Then this paper presents a text classifier based on neural networks (NNTCS) as the main topic. Some key techniques implemented in this classifier, such as feature extraction, dimension reduction, hierarchical classification and classifier training, are discussed in details.The first step in NNTCS is Chinese word segmentation on Chinese documents. Feature Terms are selected from documents. Term frequencies of each term are recorded.In NNTCS, we use artificial neural networks (ANN) as the classifier. The recorded term frequencies form the original feature vector, matching with neurons in the input layer of ANN one by one. In the stage of training, NNTCS applies labeled documents to ANN for training, and the error back propagation algorithm (BP) is employed to adjust weights of the networks. After training, the final fixed weights are saved as knowledge of classification. While in the stage of document classifying, NNTCS inputs feature vectors of the document to be classified, runs network with fixed weights, then compares the output with the predefined threshold to judge the class of the unlabelled document.NNTCS imports a traditional technique called Latent Semantic Indexing (LSI) for dimension reduction. LSI comes from the field of Information Retrieval. It transforms the original vector space to abstract k-dimension semantic space. So the huge dimensions of the original vector space are reduced greatly, also the training speed and system performance are improved.ANN is often used in common pattern recognition systems, but rarely in TC. It's because the vector space is so huge that the performance of ANN is weakened. LSI's advantage in dimension reduction can avoid this flaw. So both ANN and LSI are improved.NNTCS employs genetic algorithm (GA) in the stage of training to optimize initial weights of ANN. Because of GA's advantage of globally searching, it can avoid ANN'S problem of local convergence. Thus the advantages of both GA and ANN are brought into play completely.Finally an open test is done on the developed system NNTCS. As experiment results show, NNTCS can reach both high precision and high recall on average.
Keywords/Search Tags:Text Classification, Neural Networks, Feature Extraction, LSI, GA
PDF Full Text Request
Related items