Research And Implementation Of A Rich Format Text Classification Method

Posted on:2007-04-28

Degree:Master

Type:Thesis

Country:China

Candidate:F Zhu

Full Text:PDF

GTID:2178360185978457

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

This thesis introduces an approach that is applicable to rich format text, aiming at classifying OpenOffice.org documents which follow OpenDocument standard.Common plain text classification methods behave poor when simply applied on rich format texts. The thesis, by analyzing the reasons, proposes the factors that should be taken in account in stage of rich format text classification modeling, and then groups them into seven aspects. The thesis analyses and parses OpenOffice.org document from text classification viewpoint, depicts the methods of extracting content, formatting, structure and descriptive information, which are most related to classification, from OpenOffice.org documents, and then constructs three different classification models for OpenOffice.org documents, respectively called label components classifier, structure components classifier and comprehensive classifier. The thesis implements these three classifiers through Na?ve Bayes.The thesis accomplishes closed testing on Fudan corpus and open testing on corpus randomly downloaded from Internet, and then states corresponding analyses in detail. The result indicates that three classifiers describing in the thesis can automaticly classify OpenOffice.org documents and work quite well.

Keywords/Search Tags:

text classification, rich format text classification, OpenDocument, classification modeling, feature selection

PDF Full Text Request

Related items

1	Research On Text Classification Of Web Text Mining
2	Research On Key Techniques And Applications In Text Classification
3	A Study Of Text Classification Algorithms Based On Feature Selection
4	Research On Text Classification And Its Related Technologies
5	Classification Research On News Text Classification Based On Feature Selection Method
6	Research And Improvement Of Automatic Classification Technology For Chinese Text
7	Research And Improvement On Feature Selection And Classification Algorithms For Text Classification Based On KNN
8	Based On The Rapid Large-scale Text Hierarchical Classification Problem Of Centralized
9	Contributions To Several Key Issues Of Associative Text Classification
10	Research On Feature Generation Methods For Text Sentiment Classification