Font Size: a A A

Research On Data Extraction Of Deep Web Based On Visual Information And Tree Match

Posted on:2016-04-19Degree:MasterType:Thesis
Country:ChinaCandidate:S W FanFull Text:PDF
GTID:2308330461984230Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of network technology, the Web has become a huge source of collection with huge amounts information. Deep Web is made up of Web database which can be accessed online. With a large amount of information, highly structured, Comprehensive coverage of the field, Deep Web has a huge application value for the applications of analysis and data mining system. Along with the electronic commerce, market intelligence and other application demand, how to obtain the information or data from the Deep Web, in order to carry out the depth analysis so as to provide more valuable services and applications, such as the price system, meta search and so on, has become a hot topic. In order to effectively use Deep Web, Deep Web data integration emerged, including data acquisition, data extraction and data integration, the Deep Web data extraction is the key.Because of the massive and heterogeneous features of the Deep Web, the Deep Web data extraction has become a very challenging work, the main difficulties are:(1) Deep Web involves a wide field, a large amount of data, the automatic extraction of Deep Web data is very important. (2) different Deep Web page has different appearance, so the extraction methods should have certain adaptability, to ensure accuracy and efficiency of the extraction.In this paper, based on the visual information and the tree matching techniques, we proposed a automatic extraction method which handle the pages contains a list of semi-structured data from Deep Web. The main contributions and innovations are the following two points:(1)Identification and extraction of data records in the list pageWeb page is designed to facilitate users to browse, which has a wealth of visual information, such as font, layout, background, etc. For the convenience of using visual information of the page, We present the representation of the page--Visual Block Tree. Compared to other page segmentation technologies, such as VIPS, there is no assumptions and heuristic rules, and more objective.In order to extract data records, we first identify the data region. Combining the visual features of list page, we propose the data region identification algorithm, compared to traditional methods, this method has strong adaptability. For identifying data records under the dataregion, this paper uses a sequence division strategy. The basic idea is cluster the subtrees of the dataregion node at first, then we divide the sequence of the subrees based on the result. After filter out the noise nodes, we determine the boundary of each data record, finally extract the data record under the dataregion.(2)Based on tree matching technology to achieve the alignment of data itemsAlignment of data items means put the same semantic items in the different record under the same column in the relational tables, that is generate the relational schema. This article will look at each data record to a tree, so that the schema generation can be treated as multiple sequence alignment. Firstly a strict tree matching pattern is given; then using a simple tree matching (Simple Tree Matching, abbreviated STM) algorithm to obtain a maximum matching between two trees. The use of visual block tree data structure, we can be able to carry on some pruning operation in STM, make the algorithm complexity from O (n2) dropped almost linear. Finally, based on STM, we gives a schema generation algorithm.
Keywords/Search Tags:List Page, Visual Block Tree, Deep Web Data Extraction, Tree Matching
PDF Full Text Request
Related items