Study And Implementation Of Document Content Extraction And Feature Selection

Posted on:2012-06-28

Degree:Master

Type:Thesis

Country:China

Candidate:J Jie

Full Text:PDF

GTID:2248330395955678

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

As more and more information exists in electronic documents form, text processingsoftwares are increasing. In text processing, traditional document processing systemshas been unable to meet the demands in scalability and versatility. In the meanwhile, theexisting technologies of feature selection only take the word frequency and semanticinto consideration, ignoring the importance of words in the orginal documents.Given the above issues and based on Military document processing, a unifiedsemi-structured text model for multi-format documents is defined in this thesis. Thismodel is effective to unity multi-format documents and to retain the semantic structureinformation of the words in the original documents. On the base of this model, thecontent extraction and feature selection of HTML, XML, PDF and WORD are studied.In HTML content extraction, the basic DOM tree-based content extraction algorithm isimproved. In addition, a new feather selection algorithm based on content properties andinformation gain is proposed to solve the problems of the existing feather selectionalgorithms. This algorithm improves the performance of feature selection and theselected set of features has more difference information. A multi-format documentextraction system is designed and implemented to verify the algorithms in the thesis.Finally, the experiment result shows the feasibility and validity of the improvedHTML DOM tree-based content extraction algorithm and the content properties andinformation gain algorithm are better than traditional methods.

Keywords/Search Tags:

Content Extraction, Feature Selection, Information Gain, DOM Tree

PDF Full Text Request

Related items

1	Study On Model And Algorithm Of Dynamic Feature Fusion Based On Information Sources Selection And Sequential Extraction
2	Self-Adaptive Webpage Content Extraction Via Tag Path Features
3	Maximum Information Gain Relief Algorithm And Its Application On Telecommunication Data Feature Selection
4	Research On Information Gain Based Software Birthmark
5	Research On The Influencing Factors Of Enterprise Microblog Forwarding Effectiveness Based On Content Features
6	Research On Feature Selection Method For Short Text
7	Feature Extraction And Feature Fusion For Content-Based Image Retrieval
8	Improved Feature Selection Methods For Web Pages Based On DIV Iterative Search And Information Gain
9	The Study Of Some Issues For Unsupervised And Semi-supervised Dimensionality Reduction
10	Research On Feature Selection Algorithm Of Spam Filtering