Font Size: a A A

Research On Data Source Selection And Result Cache On Deep Web

Posted on:2010-08-25Degree:MasterType:Thesis
Country:ChinaCandidate:Z D QuFull Text:PDF
GTID:2218330368999990Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
As the development of information technologies, information on Web is growing rapidly. According to the depth of the information, the Web can be divided into two categories: Surface Web and Deep Web. Deep Web is different from the Surface Web in the access mode. Surface Web indicates the web can be reached by hyperlinks and can be indexed by traditional search engines. Instead, the data resources in Deep Web are hidden in web databases, which can not be accessed directly through static URL links but for their query interfaces. The amount of information contained in Deep Web is 400-500 times to Surface Web. The data in Deep Web is in a specific field and is very valuable. So, this information in Deep Web is much more than Surface web, it is better to make the best use of it. To fully use the data in Deep Web, we have two problems to resolve. One is to ensure the results having high quality. The other one is to ensure the query efficiency.To ensure the query quality, data sources selection is a very important step. The existing strategies only focus on the data sources interfaces, which are not enough to select the best-effort data sources in the same domain. To solve this problem, an integrative Data Source Selection Model named as DSSM is proposed in this paper, in which, the interface schema, the search mode, the contents in background databases, as well as the quality of data sources are considered together. So the model has the ability to select the best-effort data sources satisfying user queries.To ensure the query efficiency, cache is essential. Because of the characteristics of Deep Web, the existing cache systems are not suitable to apply in the Deep Web data integration. So, a Result Cache Model named as RCM is proposed in this paper. The objects in RCM are result records and their pages. In the model, we solve some problems, such as the storage structure definition, the data consistency, the distributed storage balance, the cache optimization. After applying cache, we get higher efficiency.
Keywords/Search Tags:Deep Web, data source selection, cache, schema, instance, quality
PDF Full Text Request
Related items