Font Size: a A A

Research On Domain-Specific Techniques For Deep Web Data Acquisition

Posted on:2013-04-21Degree:MasterType:Thesis
Country:ChinaCandidate:J B GuoFull Text:PDF
GTID:2248330371493549Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Along with rapid development of Internet technology, high-quality information resources hidden in the Web databases receive widespread attention due to complete structure and huge amount of data. However, such information resources can be displayed as HTML pages only when users submit queries to Web query interface, leading to unavailable data through traditional search engines, which is called Deep Web. Therefore, to make maximum use of Deep Web resources, it is required to retrieve data hidden behind query interface and extract structured data from query results.This article makes a study on Deep Web data acquisition technology which is oriented to specific domain. Our work mainly includes two parts:data retrieve and data records extraction. The main research work is summarized as follows:1) For the range-type attribute in Deep Web query interfaces, it puts forward a domain partition method based on sampling, which effectively improve the efficiency of data retrieve in top-k interfaces.2) For the classification-type attribute, it employs a method for data retrieve based on hierarchical tree model. The method adjusts submitted order and effectively reduces the number of submitted queries.3) For the text-type attribute, we raise a method for selecting candidate values. The method screenings the candidate values using their distribution in sampling library and increases the average query harvest.4) According to distribution of feature node in query results page, it brings forward an algorithm for locating data areas. The algorithm combines page structure and attribute feature of data records, so as to weaken influence of page structure’s variation on extraction results.5) In the phase of extracting data records, we propose a method for data record extraction combined characteristic sequence division with tree similarity. The method not only can improve the accuracy of data record extraction but also can align extracted data records.Finally, experiments verifies that our presented algorithm is effective. Besides, we designs an e-commerce Deep web information integration system based on above gains.
Keywords/Search Tags:Deep Web, data acquisition, data retrieve, data extraction
PDF Full Text Request
Related items