Automatic Text Categorization Based On Rough Set Theory

Posted on:2006-09-04

Degree:Doctor

Type:Dissertation

Country:China

Candidate:X Y Zhang

Full Text:PDF

GTID:1118360155958701

Subject:Computer should be |

Abstract/Summary:

PDF Full Text Request

Automatic text categorization (ATC) is presently one of the hottest research issues in information retrieval and natural language processing. Since 1990's machine learning approaches have been popularly applied in ATC. Although they have achieved better performance than traditional ones, they still face some significant problems. This paper mainly aims to study on the problems of rough set theory in ATC. The research results are descried as follows in detail.First of all, a language-independent approach to text representation of Chinese and English documents is presented. Text representation approaches with term weighting schemes such as commonly used TF/IDF are widely used to extract indexing terms of documents. Term frequency or term document frequency are usually computed in the whole document. Therefore, much more computational complexity and storage space are required. In addition, these approaches initially developed for English documents cannot be directly applied to Chinese documents without using Chinese word segmentation techniques, which have blocked the performance of Chinese text representation. This paper presents an approach independent on word segmentation techniques and text collections. In this approach, GF/GL weighting scheme is proposed to measure the content importance of each N-gram in an individual document, and then an algorithm is developed to filter final keywords. The experimental results prove that our proposed approach can more effectively extract indexing terms of Chinese and English documents than TF/IDF based approach.Secondly, a model is proposed to treat semantic heterogeneity of indexing terms. The indexing terms extracted by text representation approaches are normally uncontrolled. In this case, one concept is usually indexed with different terms. In machine learning based text categorization, semantic heterogeneity of indexing terms may lead to the increase of computational complexity and the decrease of classification performance. This paper proposes a rough set-based transfer (RST) model create semantic transfer relations between the terms of various indexing languages. The experiments prove that RST can overcome the problems of intellectual methods and classical similarity measures effectively.

Keywords/Search Tags:

Text representation, Keyword extraction, Rough set theory, Heterogeneous concept treatment, Automatic text categorization, Classification algorithm, Classification rules, Machine learning

PDF Full Text Request

Related items

1	Based On Rough Set Text Automatic Classification Study
2	Research Of Text Categorization Algorithm Based On Rough Set Theory
3	The Research And Application Of Text Categorization Based On Machine Learning
4	Research And Implementation Of Text Classification Based On Depth Learning Theory And SVM Technology
5	Text Representation And Algorithms For Chinese Text Classification
6	Research Of Hierarchical Text Categorization System Based On VSM And Rule Matching
7	Research And Implementation Of Chinese Automatic Text Classification System Based On SVM
8	Analysis Of Text Information Based On Deep Learning
9	Automatic Classification Based On The Concept Of The Text
10	Research Of Text Mining Based On Rough Set Theory