Font Size: a A A

Study And Implementation Of Document Content Extraction And Feature Selection

Posted on:2012-06-28Degree:MasterType:Thesis
Country:ChinaCandidate:J JieFull Text:PDF
GTID:2248330395955678Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
As more and more information exists in electronic documents form, text processingsoftwares are increasing. In text processing, traditional document processing systemshas been unable to meet the demands in scalability and versatility. In the meanwhile, theexisting technologies of feature selection only take the word frequency and semanticinto consideration, ignoring the importance of words in the orginal documents.Given the above issues and based on Military document processing, a unifiedsemi-structured text model for multi-format documents is defined in this thesis. Thismodel is effective to unity multi-format documents and to retain the semantic structureinformation of the words in the original documents. On the base of this model, thecontent extraction and feature selection of HTML, XML, PDF and WORD are studied.In HTML content extraction, the basic DOM tree-based content extraction algorithm isimproved. In addition, a new feather selection algorithm based on content properties andinformation gain is proposed to solve the problems of the existing feather selectionalgorithms. This algorithm improves the performance of feature selection and theselected set of features has more difference information. A multi-format documentextraction system is designed and implemented to verify the algorithms in the thesis.Finally, the experiment result shows the feasibility and validity of the improvedHTML DOM tree-based content extraction algorithm and the content properties andinformation gain algorithm are better than traditional methods.
Keywords/Search Tags:Content Extraction, Feature Selection, Information Gain, DOM Tree
PDF Full Text Request
Related items