Font Size: a A A

Automatic Text Categorization Based On Rough Set Theory

Posted on:2006-09-04Degree:DoctorType:Dissertation
Country:ChinaCandidate:X Y ZhangFull Text:PDF
GTID:1118360155958701Subject:Computer should be |
Abstract/Summary:PDF Full Text Request
Automatic text categorization (ATC) is presently one of the hottest research issues in information retrieval and natural language processing. Since 1990's machine learning approaches have been popularly applied in ATC. Although they have achieved better performance than traditional ones, they still face some significant problems. This paper mainly aims to study on the problems of rough set theory in ATC. The research results are descried as follows in detail.First of all, a language-independent approach to text representation of Chinese and English documents is presented. Text representation approaches with term weighting schemes such as commonly used TF/IDF are widely used to extract indexing terms of documents. Term frequency or term document frequency are usually computed in the whole document. Therefore, much more computational complexity and storage space are required. In addition, these approaches initially developed for English documents cannot be directly applied to Chinese documents without using Chinese word segmentation techniques, which have blocked the performance of Chinese text representation. This paper presents an approach independent on word segmentation techniques and text collections. In this approach, GF/GL weighting scheme is proposed to measure the content importance of each N-gram in an individual document, and then an algorithm is developed to filter final keywords. The experimental results prove that our proposed approach can more effectively extract indexing terms of Chinese and English documents than TF/IDF based approach.Secondly, a model is proposed to treat semantic heterogeneity of indexing terms. The indexing terms extracted by text representation approaches are normally uncontrolled. In this case, one concept is usually indexed with different terms. In machine learning based text categorization, semantic heterogeneity of indexing terms may lead to the increase of computational complexity and the decrease of classification performance. This paper proposes a rough set-based transfer (RST) model create semantic transfer relations between the terms of various indexing languages. The experiments prove that RST can overcome the problems of intellectual methods and classical similarity measures effectively.
Keywords/Search Tags:Text representation, Keyword extraction, Rough set theory, Heterogeneous concept treatment, Automatic text categorization, Classification algorithm, Classification rules, Machine learning
PDF Full Text Request
Related items