Font Size: a A A

Statistical Automatic Text Classification Methods In Digital Libraries

Posted on:2003-02-08Degree:MasterType:Thesis
Country:ChinaCandidate:B LiuFull Text:PDF
GTID:2178360185995499Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Digital libraries are used to conserve massive digitalized information and knowledge. Automatic text classification is the key technique for information organization and management in Digital Libraries. Automatic text classification is defined as the task to assign pre-defined category labels to documents. This article studies the statistical automatic text classification methods in the National Science Digital Library.To improve document representation, this article puts forward the multi-level feature selection method. The method extracts the statistical text features on three different levels as Chinese letters, the common wordlist and the professional wordlist. These features can represent more statistical character of the document set and is useful for improve the system performance.To improve the weakness of standard KNN algorithm, this article brings forward the kernel-based distance-weighted KNN algorithm. The kernel-based weighted KNN algorithm solves the multi-peak distribution problem and the overlap boundary problem of the sample set, as well as the classifier's precise decision problem.The Internet and text databases contain many pre-classified training samples. Some of the samples are redundant and bad in quality, which greatly impair the classifier performance. To remove the redundant documents, this article puts forward a new fingerprint algorithm based on sorted text features. To address the quality problem of training samples, this article uses sample weightiness analysis to select training samples.The libraries provide many theme words for every subject, which contains much information on theme words mapping. We use mutual information to value the differential of each word for different class and utilize the words mapping information for better performance. This is the foundation for our future research.
Keywords/Search Tags:Digital Library, Automatic Text Classification, Multi-level Feature, Kernel-based Distance-weighted KNN Algorithm, Fast Redundant Document Detection, Sample Weightiness Analysis, Subject Theme Word
PDF Full Text Request
Related items