Font Size: a A A

Research On Deep Web Data Acquisition Based On Visual Information And DOM Tree

Posted on:2015-03-06Degree:MasterType:Thesis
Country:ChinaCandidate:X H LiFull Text:PDF
GTID:2268330428998402Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With rapid expansion of the Internet information in recent years, commercial value ofdata being continuously explored to provide value-added services. For example, opinionanalysis, meta search, comparison shopping, big data application and so on, most of themare based on the Deep Web data acquisition and integration. As more and morebackground databases appear which have high quality information and field-related, DeepWeb data acquisition and integration is still a popular research field.In order to retrieve tuples from the target database effectively, and extract structureddata from the dynamically generated pages, the main contents of this paper includes asfollows:1) In view of the query interface has multi-attributes and top-k features, first of all, webuild a data space tree model and pruning the tree by using the heuristic information.Secondly, we give a dynamic selection strategy for value of text field in mixed attributesinterface. Finally, this scheme can effectively improve data siphoning efficiency that isverified by experiments.2) In order to locate the main data area of Deep Web page automatically, this papergives a set of heuristic features and quantitative method, and puts forward a linearweighted method based on the quantized value to do main data region mining.3) In order to extract the search results, this paper proposes an algorithm namedblock-regrouping to do data record extraction, that utilizing the visual information ofsearch results page and DOM label tree of the page to compute the visual block similarity,then conducts experiments to verify the efficiency of this method.4) In order to shorten records extraction time with the same template, we propose amethod to generate a wrapper for the data source.5) On the basis of existing work, we design a Deep Web data extraction prototype system. Besides, this paper conducts experiments over controlled and real site databases toillustrate the feasibility of this system.
Keywords/Search Tags:Deep Web, Data siphoning, Data region mining, Record extraction, Wrapper
PDF Full Text Request
Related items