Table Information Extraction Based On Web Structure

Posted on:2013-07-25

Degree:Master

Type:Thesis

Country:China

Candidate:Y Liu

Full Text:PDF

GTID:2248330377960738

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

In recent years, with the development of Internet, Web is becoming the largest andmost complex knowledge base in the world, more and more people get their requiredinformation from the Web, thus, the Web information extraction method come into being.There are various kinds of data types on the Web, such as structured table data,semi-structured Web data and unstructured text pages and so on. Web tables widely exist inthe real world, including online shopping, supplying-demanding information pages andsearching result pages, it is hence a necessary and significant issue to extract structural tabledata from the Web table. However, this semi-structured Web tables are difficultly used forWeb application systems such as users’ recommend, supplying and demanding analysis. Sothe main research of this dissertation is the semi-structured Web table informationextraction.Web pages can be parsed into tree structures. The research on examples analysis showsthat Web table information in the parse tree presents a conspicuous hierarchy structure.Meanwhile, for the homologous Web table data regions, their corresponding sub-treestructures present a similar characteristic. A data region extraction method EtractDRs basedon top-down tree edit distance is proposed in this paper. It uses the tree edit distance tomeasure the similarity of tree structures, merges these structures whose edit distances areless than the pre-specified threshold to form candidate table data regions, and adopts theheuristic rules to get the final data regions. The main work of this paper is as follows：(1) Supervised learning algorithm, make full use of the page structure to process data,then web pages are parsed into DOM trees. We use tree path to extract web table data,without analyzing the specific content in the pages. This method is simple and applicable.(2) Unsupervised learning algorithm, in this paper, we use a restricted top-down treeedit distance method, according to the HTML page code and the structural characteristics ofparse tree, the most appropriate comparison approach between the information structuralpages is the top-down tree edit distance.

Keywords/Search Tags:

Web table, information extraction, tree edit distance, DOM tree, string editdistance

PDF Full Text Request

Related items

1	Web Information Extracting Based On Tree Edit Distance
2	Research On Deep Web Information Extraction Technology
3	The Research Of Semi-structured Web Pages Information Extraction
4	Research And Application Of String Approximate Matching Algorithm Based On Multivariate Information
5	Workflow Application Of Clustering Tree Edit Distance
6	Research On Automatic Web Information Extraction Technique
7	Sequentially Matching Similarity String Algorithm Research
8	Storage Optimization And Tree Vertical Merging Algorithm Of Tai Tree Editing Distance Algorithm
9	Efficient Approximate String Matching With Edit Distance Constraint
10	Algorithms Based On Visual Similarity Of The Research In Information Extraction And Implementation