Font Size: a A A

Word Frequency Extraction And Automatic Text Classification Methods In The Digital Library

Posted on:2003-11-21Degree:MasterType:Thesis
Country:ChinaCandidate:M R RenFull Text:PDF
GTID:2208360065964011Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Digital Library is a new computer application field that involves many technologies such as network, multimedia, data warehouse, data mining and copyright protection and so on, and research on it is at the beginning.A parallel digital library system based on parallel computing environment has been developed by our group. It has not only existing digital libraries' general functions but also query function based on structure and content which isn't realized in all other digital library systems. In addition, our system can establish adaptive digital libraries for our users with special needs.This paper designs and realizes the word frequency extract and automatic text categorization subsystem. Automatic text categorization subsystem can takes advantage of predefined class pattern's hierarchical structure to construct hierarchical classifier, overcoming the shortcomings of other text categorization systems that consider classes flattening. In word frequency extract subsystem, the paper designs an efficient hash algorithm according to English words and Chinese words' traits. The algorithm improves performance of the word frequency extract and statistics effectively. In addition, a text classification system based on Vector Space Model is studied and a new method for calculating word weight is proposed.
Keywords/Search Tags:Automatic Text Categorization, Word Frequency Extract, Bayesian Theory, Vector Space Model
PDF Full Text Request
Related items