Font Size: a A A

Statistics-based Text Classification

Posted on:2004-06-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:J ChengFull Text:PDF
GTID:1118360092995619Subject:Library science
Abstract/Summary:PDF Full Text Request
People feel that the information on the Internet is very abundant; but on the other hand, they also feel that they cannot conveniently find the information they need. The reason is that the information systems now existing have not well organized the information resources. Many technologies are helpful in resolving this problem. Content-based text management is one of them. And the text classification is the foundation of content-based text management. So text classification is researched on three levels: the theoretical, the technological and the application level and it is used to solve some problems in the information retrieval, management and gathering in "Chinese American Digital Academy Library (CADAL)" and "Chinese Scientific Digital Library (CSDL)"-projects.In the information retrieval research, this paper analyses the relationship between text retrieval and text classification. "Text Retrieval" and 'Text Classification" are two inseparable methods to find useful information. They should not be used separately as they are used now. If the text retrieval system can combine them together in the information retrieval process, it will improve the search results significantly. "Hierarchy Retrieval System" of CADAL is designed to combine taxonomy and retrieval system together. "Content-based Information Recommendation System" uses text classification technology to provide content-based service. I construct the prototype system and research on the kernel arithmetic. Theoretically, the system can support the retrieval of on million books, and return the search results in one second.In the information management research, it is impossible now to manage the mass information resources by handwork. Yet the precision of text classification of computer is lower than the handwork. The paper points out the theoretical problem of it. There is a contradiction between hierarchy taxonomy and text classification arithmetic. They cannot integrate together well because they are using different data module. In the "Chinese English Physical Websites Classification System", the paper analyses the taxonomy and train data set and gets some useful experience on information management.In the information gathering research, the paper uses "Redundancy Webpage Filtering System" to show how to resolve the reconstruction problem in the information gathering process. The paper mainly discussed how to reduce the time complicacy and space complicacy of the filtering system. In the "TREC 2002 Text Filtering Compete", the difference between normal subjects and combine subjects are compared and how to use the text classification technologies into the filteringsystem is also explained. From the system, we can get the conclusion that only when a system has considered all the related factors in the text classification system, can it get a fine results.The paper also researched on the evaluation strategy of text classification and text retrieval. Sometimes, precision and recall are used to evaluate the result of text classification, but the evaluation method in common use is accuracy. The paper uses an example to show a problem that may exist when we only use precision and recall to evaluate a system. It also analyses the relationship between precision, recall and accuracy.To enhance the quality of content-based information services, we need to improve the text arithmetic of text classification. As a result, this paper researches in detail into statistical text classification arithmetic. It divides the text classification system into three parts, that is, term selection, term weighting, and classifier construction.1) Term selection: The paper researches on term reducing and N-Gram model. To give a full representation of a document, the text classification system uses multi-level term to represent it. It uses Chinese character, words and descriptor to distill the terms of a document. This method can improve the classification performance under most test circumstances.2) Term weighting: Based on the tests that other s...
Keywords/Search Tags:Text Classification, Term Selection, Classifier Construction, Text Filtering, Digital Library
PDF Full Text Request
Related items