Font Size: a A A

Research And Implementation Of A Rich Format Text Classification Method

Posted on:2007-04-28Degree:MasterType:Thesis
Country:ChinaCandidate:F ZhuFull Text:PDF
GTID:2178360185978457Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
This thesis introduces an approach that is applicable to rich format text, aiming at classifying OpenOffice.org documents which follow OpenDocument standard.Common plain text classification methods behave poor when simply applied on rich format texts. The thesis, by analyzing the reasons, proposes the factors that should be taken in account in stage of rich format text classification modeling, and then groups them into seven aspects. The thesis analyses and parses OpenOffice.org document from text classification viewpoint, depicts the methods of extracting content, formatting, structure and descriptive information, which are most related to classification, from OpenOffice.org documents, and then constructs three different classification models for OpenOffice.org documents, respectively called label components classifier, structure components classifier and comprehensive classifier. The thesis implements these three classifiers through Na?ve Bayes.The thesis accomplishes closed testing on Fudan corpus and open testing on corpus randomly downloaded from Internet, and then states corresponding analyses in detail. The result indicates that three classifiers describing in the thesis can automaticly classify OpenOffice.org documents and work quite well.
Keywords/Search Tags:text classification, rich format text classification, OpenDocument, classification modeling, feature selection
PDF Full Text Request
Related items