Font Size: a A A

Table Information Extraction Based On Web Structure

Posted on:2013-07-25Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiuFull Text:PDF
GTID:2248330377960738Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years, with the development of Internet, Web is becoming the largest andmost complex knowledge base in the world, more and more people get their requiredinformation from the Web, thus, the Web information extraction method come into being.There are various kinds of data types on the Web, such as structured table data,semi-structured Web data and unstructured text pages and so on. Web tables widely exist inthe real world, including online shopping, supplying-demanding information pages andsearching result pages, it is hence a necessary and significant issue to extract structural tabledata from the Web table. However, this semi-structured Web tables are difficultly used forWeb application systems such as users’ recommend, supplying and demanding analysis. Sothe main research of this dissertation is the semi-structured Web table informationextraction.Web pages can be parsed into tree structures. The research on examples analysis showsthat Web table information in the parse tree presents a conspicuous hierarchy structure.Meanwhile, for the homologous Web table data regions, their corresponding sub-treestructures present a similar characteristic. A data region extraction method EtractDRs basedon top-down tree edit distance is proposed in this paper. It uses the tree edit distance tomeasure the similarity of tree structures, merges these structures whose edit distances areless than the pre-specified threshold to form candidate table data regions, and adopts theheuristic rules to get the final data regions. The main work of this paper is as follows:(1) Supervised learning algorithm, make full use of the page structure to process data,then web pages are parsed into DOM trees. We use tree path to extract web table data,without analyzing the specific content in the pages. This method is simple and applicable.(2) Unsupervised learning algorithm, in this paper, we use a restricted top-down treeedit distance method, according to the HTML page code and the structural characteristics ofparse tree, the most appropriate comparison approach between the information structuralpages is the top-down tree edit distance.
Keywords/Search Tags:Web table, information extraction, tree edit distance, DOM tree, string editdistance
PDF Full Text Request
Related items