Font Size: a A A

Research On Text Classification With Bayesian Model And Relational Technologies

Posted on:2007-04-26Degree:DoctorType:Dissertation
Country:ChinaCandidate:P GuFull Text:PDF
GTID:1118360212468468Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid growth of Internet and increase of electronic text, automatic text classification (ATC) has been the hottest research issues in information retrieval and natural language processing. Although text classification with Bayesian classifier is simpler, intuitionistic and stable in performance, they still face with some significant problems in some complex text classification tasks. This paper mainly aims to study the problems of ATC with modified Bayesian classifier, the main results are descried as follows in detail.(1) By explaining the limitation of Bayesian classifier on text classification, the paper introduced some typical generalized Na?ve Bayesian classifiers: Na?ve Bayesian classifier, Semi-Na?ve Bayesian classifier, Tree Augmented Na?ve Bayesian. With the theoretical and experimental analysis for each classifier's learning and classification method, it provides evidence for further investigation and improvement of the Na?ve Bayesian classifier.(2) A new features selection algorithm with associative features extension. Features selection(FS) can greatly affect the performance of text classification, on different feature set, even the same classifiers can be quite diverse in classification performance. This paper analysis and present three main problems with the existing feature selection algorithms: imperfection of the feature space; redundancy among feature set; low efficiency of FS algorithm. For solving these problems, a new feature selection algorithm based on correlation analysis was proposed, which first extended the original feature set with associative features, and then a modified correlation measure and heuristic formulae were employed for redundancy elimination and feature selection. As the new FS algorithm avoid pairwise correlation analysis, which results in a time complexity of O ( NlogN), Also, the information gain in the selected feature set was increased by redundancy elimination in the algorithm.(3) Text classification with Bayesian latent semantic model(BLSM). Different from previous Bayesian models, we propose an enhancement of the classical document representation through concepts extracted from ontology. With concepts included in the Bayesian model, mapping between concepts and words, concepts and classes are constructed. As a result, we can capture the intended word sense and boost the classifiers within the context of documents. Faced with problem of data missing and...
Keywords/Search Tags:Text Classification, Semantic, N-Gram, Correlation Analysis, Semi-supervised Learning, Co-training
PDF Full Text Request
Related items