This thesis introduces an approach that is applicable to rich format text, aiming at classifying OpenOffice.org documents which follow OpenDocument standard.Common plain text classification methods behave poor when simply applied on rich format texts. The thesis, by analyzing the reasons, proposes the factors that should be taken in account in stage of rich format text classification modeling, and then groups them into seven aspects. The thesis analyses and parses OpenOffice.org document from text classification viewpoint, depicts the methods of extracting content, formatting, structure and descriptive information, which are most related to classification, from OpenOffice.org documents, and then constructs three different classification models for OpenOffice.org documents, respectively called label components classifier, structure components classifier and comprehensive classifier. The thesis implements these three classifiers through Na?ve Bayes.The thesis accomplishes closed testing on Fudan corpus and open testing on corpus randomly downloaded from Internet, and then states corresponding analyses in detail. The result indicates that three classifiers describing in the thesis can automaticly classify OpenOffice.org documents and work quite well. |