Font Size: a A A

Deep Web Mining Combining Vision And DOM Information

Posted on:2015-11-19Degree:MasterType:Thesis
Country:ChinaCandidate:X M ZouFull Text:PDF
GTID:2298330428951941Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
The rapid development of Internet has caused it to contain the vast informationresources, covering various fields of the real world. Relative to the Surface Web,Deep Web contains more abundant data, has more visitors and grows faster. But DeepWeb pages are dynamically generated, which hard to be indexed by traditional searchengine. Therefore, how to effectively obtain and make use of data in Deep Web isbecoming an important research direction. Deep Web data is showed by the queryresults pages, but web page data is various and non-structural, which easy to browsefor users but difficult to use. We have studied data auto-extraction in Deep Web queryresults pages, based on the vision and DOM information. Our work includes thefollowing aspects:(1) Locate data region. First of all, find visual features that can locate the dataregion by analyzing their characteristics in the Deep Web query results pages. Thencollect the related Deep Web pages as sample, and label the nodes in the sample.The decision tree, which can locate data region based on vision information, isobtained by Weka. Thus we can locate data region node by rules corresponding to thedecision tree.(2) Extract data records. Data records extraction can be divided into two steps:locating data records and denoising.In the first step, algorithm of locating data records is proposed, according to theDOM structure and visual features of data records nodes in the page. However, theresults contain not only all the data records, but a small amount of noise nodes;In the second step, the similarity of data records nodes is defined by the xpathand the noise nodes are eliminated by comparing similarity, thus the data recordsnodes can be obtained.(3) Align data items. First the data records are divided into the correspondingdata items. Then to facilitate data items alignment, the corresponding data structure isdesigned. Finally algorithm of data items alignment based on xpath is put forward. (4) Template. The corresponding templates are proposed for data region, datarecord and data item, according to their feature. By using the template, not only is theextraction speed increased, but also it is easy to extract data in consecutive pages.The innovations of this paper are as follows:(1) The concept of xpath is introduced, and the similarity of nodes is defined bythe xpath, and then the noise nodes can be eliminated by comparing similarity. Anddata items can be aligned by the comparison of xpath.(2) The concept of the granularity of data items is come up with and the relevantway to divide data record into data items is put forward.Based on the above research, we design and develop automatic data extractionsystem of Deep Web query results pages, and solve the other problems encountered inthe process of extraction, such as extraction of AJAX asynchronous data, etc.Experiments show that our method can quickly and accurately extract data from theDeep Web query results pages.
Keywords/Search Tags:Deep Web, vision information, DOM, data extraction, xpath
PDF Full Text Request
Related items