Statistical Automatic Text Classification Methods In Digital Libraries

Posted on:2003-02-08

Degree:Master

Type:Thesis

Country:China

Candidate:B Liu

Full Text:PDF

GTID:2178360185995499

Subject:Computer application technology

Abstract/Summary:

Digital libraries are used to conserve massive digitalized information and knowledge. Automatic text classification is the key technique for information organization and management in Digital Libraries. Automatic text classification is defined as the task to assign pre-defined category labels to documents. This article studies the statistical automatic text classification methods in the National Science Digital Library.To improve document representation, this article puts forward the multi-level feature selection method. The method extracts the statistical text features on three different levels as Chinese letters, the common wordlist and the professional wordlist. These features can represent more statistical character of the document set and is useful for improve the system performance.To improve the weakness of standard KNN algorithm, this article brings forward the kernel-based distance-weighted KNN algorithm. The kernel-based weighted KNN algorithm solves the multi-peak distribution problem and the overlap boundary problem of the sample set, as well as the classifier's precise decision problem.The Internet and text databases contain many pre-classified training samples. Some of the samples are redundant and bad in quality, which greatly impair the classifier performance. To remove the redundant documents, this article puts forward a new fingerprint algorithm based on sorted text features. To address the quality problem of training samples, this article uses sample weightiness analysis to select training samples.The libraries provide many theme words for every subject, which contains much information on theme words mapping. We use mutual information to value the differential of each word for different class and utilize the words mapping information for better performance. This is the foundation for our future research.

Keywords/Search Tags:

Digital Library, Automatic Text Classification, Multi-level Feature, Kernel-based Distance-weighted KNN Algorithm, Fast Redundant Document Detection, Sample Weightiness Analysis, Subject Theme Word

Related items

1	Word Frequency Extraction And Automatic Text Classification Methods In The Digital Library
2	Research On Document Distance Calculation Based On Word Embedding And Its Application
3	Applications And Research On Possibilistic Fuzzy Kernel Clustering Algorithm Based On Sample-feature Weighted
4	Document-Level Sentiment Analysis Of Deep Learning Incorporating Topic Features
5	An Algorithm To Hierarchical Text Classification Based On Feature Selection
6	Research On Internet Short Text Message Oriented Multi-Document Automatic Summarization
7	Research On Digital Library And Automatic Document Classification
8	Offical Document Writing Assistant System Design And Implementation
9	Adaptive Weighted KNN Text Classification
10	Document-level Sentiment Classification Based On Dynamic Word Embeddings And Hierarchical Neural Networks