Font Size: a A A

Research And Implementation On Distributed Holistic Twig Query Processing Based On Nodes Distribution

Posted on:2016-04-15Degree:MasterType:Thesis
Country:ChinaCandidate:S ChenFull Text:PDF
GTID:2428330542489385Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
XML has been adopted as a standard format for web data representation and exchange.With the sharply increasing scale of XML data,pursuing effective management of massive amounts of XML data as well as providing efficient and fast data retrieval have been a research focus in the field of data mining today.So far,most storage and retrieval systems of XML data are using Native XML database or relational database,but those systems are not effective on large-scale XML data,and the storage and retrieval of XML data technology on distributed environment is not mature.MapReduce is an effective solution to deal with huge amounts of data.However,there are few research findings aiming at processing the query over massive XML data.And the existing distributed twig query algorithm needs structural join operation in the Map phase,which may conduct a large number of useless intermediate results in most cases.In addition,this kind of algorithm often need extra operation of twig query pattern decomposition.In order to solve the above problems,in this thesis,we put forward two kinds of query schemes based on node distribution,i.e.,algorithm NDTH and DTH,to realize twig query over massive XML data.These two kinds of query schemes proposed in this thesis are based on nodes distribution,which distribute all the nodes that may contribute to the query solution to the same computing node in the Map phase,instead of perform structural join operation.So in the Reduce phase it can adopt any holistic matching algorithm,and choose the optimal performance of holistic matching algorithm according to the characteristics of the query,such as optimal matching algorithm of ancestor-descendent relationship or parent-child relationship.In this thesis,we first propose NDTH algorithm,which collect global keys through a coordinator site in ComMapReduce.This algorithm can improve the efficiency of query processing and guarantee the final query solution is complete.Next,based on the study of XML data structure and MapReduce framework,we analyze the limitations of fragmentation technology in existing XML query processing methods based on MapReduce.We put forward the relax-fragment algorithm,which can realize arbitrary fragmentation of XML without dependence on query information.Then,based on the relax fragmentation strategy RFS,we propose DTH algorithm.The DTH algorithm use the RF index which stored ancestors information,to speed up the query processing and guarantee the correctness and completeness of the query results.Finally,extensive experiments are conducted using real-world datasets,and we made an analysis on the experimental results of the two kinds of distributed twig query processing algorithms proposed in this thesis.The experimental results in this thesis show that the distributed algorithms NDTH and DTH can reduce the query processing time on massive XML data,achieving high efficiency and good performance.
Keywords/Search Tags:XML, nodes distribution, MapReduce, Twig query processing, holistic query processing
PDF Full Text Request
Related items