Font Size: a A A

Studies On Efficient Query Scheduling And Data Acquiring Techniques Of WEB Data

Posted on:2018-02-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:J Y JiangFull Text:PDF
GTID:1368330512486010Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of Web technology and the increasing prevalence of network communication devices,Web data grows exponentially and Web service permeates in people'daily life.Meanwhile,the volume of Web users also increases over time,and these people utilize Web services to improve their life as much as possible.The explosion of Web data and Web users,urge the development of Web data management systems(WDMSs)to satisfy users'personalized data requests,which provide Web users high quality,value-added data services by acquiring data from Web data sources.In these WDMSs,efficient query scheduling and data acquiring are the keys to the users' satisfaction and the success of WDMSs.User query scheduling is to prioritize query executions in order to improve the total per-formance.Efficient query scheduling in WDMSs brings users better experiences,thus system administrators can potentially gain more revenues.On the other hand,the goal of user query is to obtain high quality data for decision support,hence we need to acquire these data from amounts of Web data sources.However,the autonomy,dynamism,overlap and volume of Web data sources makes efficient acquiring fresh and complete Web data extremely challeng-ing.In order to provide Web users high quality data services,studies on efficient query sched-uling and data acquiring of Web data have important social significance and economic value.Aiming at Web data,this paper studies how to improve the performance of query sched-uling with shared common sub-expression results,how to efficiently and incrementally craw dynamic deep web through top-k queries,and how to improve the efficiency of selecting data sources with high relevance and low overlap from amounts of Web data sources.Specifically,our studies include the following four key works:(1)Query scheduling based on shared results of common sub-expressionsUsers register queries in their personal dataspaces,and acquire their demanded data in Web data management systems.However,the dynamism of Web data needs efficient query scheduling to quickly acquire more new Web data.Current researches ignore the query rele-vance when scheduling user queries,and redundant work performs repeatedly with degraded query efficiency,this paper proposes a new query scheduling approach with shared results of common sub-expressions among queries to efficiently schedule queries.First,to comprehen-sively measure query efficiency and effectiveness,this paper defines users satisfaction as QHR(Query Harvest Rate),the ratio of new tuple number of query results to query processing time.Then we propose query splitting strategy based on query expressions to extract common sub-expressions,and remove redundant ones.Finally,we derive the query priority to optimize total user satisfaction,thus efficiently share common results.The experiments on TPC-H da-tasets show that our approach can effectively optimize total QHR.(2)Incremental deep Web crawling through top-k queryIn dynamic deep Web data sources which only allow top-k queries,changed tuple are re-turned with unchanged ones,resulting in low crawl efficiency.This paper proposes a bot-tom-up incremental crawling approach based on query tree to efficiently obtain changed tu-ples under the query type restriction(top-k queries)with limited query resources.First,valid leaf queries are generated using a query tree through top-k queries,whose changes and corre-sponding query change costs are estimated by historical data and domain knowledge.Second,with estimated query cost and data quality,we model incremental web crawling problem as a knapsack problem,and propose an approximate algorithm to globally maximize total data quality under limited query resources,which selects an appropriate set of queries to obtain the query results.The experimental results on Microsoft Academic Graph dataset show our ap-proach can improve both the efficiency and effectiveness of crawling dynamic deep web data sources.(3)Stratified sampling based overlapping source selectionIn many deep Web data sources,the overlap of query results on different sources results in low query efficiency.This paper proposes a tuple-level stratified sampling approach for efficiently selecting data sources with high relevant and low overlap.First,we design an er-ror-bound and tuple-level stratified sampling to obtain sample tuples and accurately estimate coverage of given query in each data source.Second,we propose a partial sample based overlap estimation strategy with given samples and query results of partially selected sources.Last,we design a heuristic method(kNN-like)to discover data sources with high relevance and low overlap.Experimental results on TPC-W synthetic dataset and Abebooks real dataset show that our approach can not only ensure the accuracy of user query results,but also im-prove the efficiency than the state of the art methods.(4)T-Music:Personalized Web Music SystemBased on the above research of Web data management and its key technologies,this work developed a personalized Web music system which provides multimedia music data services and data management functions.The system architecture include three layers:data servicing layer,data managing layer,and data acquiring layer.To improve systems perfor-mance,T-Music enables efficient query scheduling to acquire more new data,incremental crawling to improve performance with top-k query constraints,and source selection to im-prove efficiency of overlapping source selection.The developed prototype has been widely used.We test the system performance with real dataset crawled from Sogou Music,and the experimental results show the superiority of this work in the field of Web data management.
Keywords/Search Tags:query scheduling, data acquiring, top-k query, incremental crawl, stratified sampling, data source selection
PDF Full Text Request
Related items