Research On Text Classification With Bayesian Model And Relational Technologies

Posted on:2007-04-26

Degree:Doctor

Type:Dissertation

Country:China

Candidate:P Gu

Full Text:PDF

GTID:1118360212468468

Subject:Computer application technology

Abstract/Summary:

With the rapid growth of Internet and increase of electronic text, automatic text classification (ATC) has been the hottest research issues in information retrieval and natural language processing. Although text classification with Bayesian classifier is simpler, intuitionistic and stable in performance, they still face with some significant problems in some complex text classification tasks. This paper mainly aims to study the problems of ATC with modified Bayesian classifier, the main results are descried as follows in detail.(1) By explaining the limitation of Bayesian classifier on text classification, the paper introduced some typical generalized Na?ve Bayesian classifiers: Na?ve Bayesian classifier, Semi-Na?ve Bayesian classifier, Tree Augmented Na?ve Bayesian. With the theoretical and experimental analysis for each classifier's learning and classification method, it provides evidence for further investigation and improvement of the Na?ve Bayesian classifier.(2) A new features selection algorithm with associative features extension. Features selection(FS) can greatly affect the performance of text classification, on different feature set, even the same classifiers can be quite diverse in classification performance. This paper analysis and present three main problems with the existing feature selection algorithms: imperfection of the feature space; redundancy among feature set; low efficiency of FS algorithm. For solving these problems, a new feature selection algorithm based on correlation analysis was proposed, which first extended the original feature set with associative features, and then a modified correlation measure and heuristic formulae were employed for redundancy elimination and feature selection. As the new FS algorithm avoid pairwise correlation analysis, which results in a time complexity of O ( NlogN), Also, the information gain in the selected feature set was increased by redundancy elimination in the algorithm.(3) Text classification with Bayesian latent semantic model(BLSM). Different from previous Bayesian models, we propose an enhancement of the classical document representation through concepts extracted from ontology. With concepts included in the Bayesian model, mapping between concepts and words, concepts and classes are constructed. As a result, we can capture the intended word sense and boost the classifiers within the context of documents. Faced with problem of data missing and...

Keywords/Search Tags:

Text Classification, Semantic, N-Gram, Correlation Analysis, Semi-supervised Learning, Co-training

Related items

1	Text Classification Based On Semi-supervised Learning
2	Research On Semi-supervised Short Text Classification Based On Co-operative Training
3	Research On Short Text Classification Of Semi-supervised Pre-training Based On Autoencoders And Word Order Dependencies
4	Research On The Text Classfication Based On The Semi-supervised Learning
5	Research On Multi-label Text Classification Based On Semi-Supervised Learning
6	Chinese Question Classification, Based On Semi-supervised Learning
7	Research On Text Classification Algorithms Based On Machine Learning
8	Research On Text Classification Algorithms Based On Semi-supervised Learning
9	Research On Emotion Classification Of Weibo Based On Improved Semi-supervised Tri-training
10	Research On Text Entity Relation Extraction Based On Semi-supervised Learning