Font Size: a A A

Research On Query Planning For Deep Web

Posted on:2013-11-26Degree:MasterType:Thesis
Country:ChinaCandidate:Z J WangFull Text:PDF
GTID:2248330377958802Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Nowadays, a popular trend in data dissemination involves online data sources which areknown as the Web databases that are hidden behind query forms, thus forming what isreferred to as the deep web. As compared to the surface web, where HTML pages are staticand data is stored as document files, deep web data is stored in backend databases. DynamicHTML pages are generated only after a user submits a query by filling an online form.According to the statistics of BrightPlanet Company, the amount of the data which DeepWebdatabase stores is500times than the amount of static pages’s data, the number of such datasources is still increasing rapidly every year.Therefore, the research on the Deep Web isessential and significance profound.Due to the scalable, autonomous, heterogeneous anddynamic nature of Web databases, besides, sources may have diverse and limited querycapabilities,query processing in Deep Web data integration is more challenging comparedwith that in traditional distributed environment. To deal with source autonomy andheterogeneity, the paper presents a method to describe data sources.How large is this vocabulary? To answer the puzzle,We performed an informalsurvey:Using search engines(eg,google.com) and Web directories(eg,invisibleweb.com),wocollected a total200sources,with50in each of Movies,Books, Automobiles andMusicRecords domains.Our survey found that while sources proliferate, their aggregateschema vocabulary tends to converge at a relatively small size. Inspired by the result ofsurvey, we create inverted indexing for each vocabulary.Besides, we alse present a modularscheme for generating efficient feasible query plans for target queries. Five modules worktogether to achive these tasks: expansion, pretreatment, rewrite, searching relevant datasources and generate modules in detail;We describe an algorithm for effectively generatinglogic plans based on the inverted indexing and an algorithm for finding an executableordering for logic plans.In this paper we alse show that because sources have restrictions on retrieving theirinformation,sources not mentioned in a logic plan can contribute to generate efficient feasiblequery plans, since they can provide useful bindings. We show in which cases these off-queryaccesses are useless, and prove that in these cases we can generate efficient feasible query plans by using only the sources in a logic plan. In the cases where off-query accesses arenecessary, we propose an algorithm for finding all the useful sources for a logic plan.Experiments show that our algorithm of generating executable query plans has goodefficiency, accuracy and scalability.
Keywords/Search Tags:Web database, query capabilities, feasible query plans
PDF Full Text Request
Related items