Font Size: a A A

Research And Implementation Of Distributed Twig Query Processing Over Massive XML Documents In The Cloud

Posted on:2015-10-08Degree:MasterType:Thesis
Country:ChinaCandidate:P ZhangFull Text:PDF
GTID:2308330482455549Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, XML, as a semi-structured data, has become an important standard format of data storage and exchange on the Internet. Thus, the rapid growth of the data size of the XML documents makes query processing of XML data a more popular research direction. One of the core operations of XML data query is twig pattern query, but the traditional twig algorithms cannot be applied to massive XML documents query processing. Since there is little research achievement of previous work on distributed twig query processing, how to efficiently process twig query over massive XML data becomes a new research topic.Cloud computing is the current main technology for processing massive data. Massive data management technology research in cloud computing includes data partition and distributed query processing. This thesis focuses on XML data partition and distributed query processing to implement massive XML docuemnts twig query processing into cloud computing environment. This thesis first presents a random fragmentation strategy of XML document named AF (Arbitrarily Fragmentation), which ensures the randomness of fragmentation by recording nodes information and avoids partition restrictions of structural information. Then we propose a cloud-based distributed query processing algorithm DTS(DTwigStack), which effectively handles twig query processing over massive XML data. AF segmentation algorithm uses dividing node information, parallelly processes all the fragments, and combines all the intermediate query results into final partial results. In order to guarantee that all the intermediate results related to the same final result are gathered in the same reducer, ComMapReduce frame with Coordinator node is introduced. DTS algorithm uses Coordinator to collect key information from all sites and then returns summarized globle key information to DTS on each mapper.This thesis designs a series of experiments, including the evaluation of DTS performance with varied numbers of Hadoop slave nodes and datasets sizes respectively. Speedup, sizeup and scaleup are also analyzed according to the experimental results. The performance evaluation verifies both effectiveness and efficiency of DTS for twig query processing over massive XML documents.
Keywords/Search Tags:cloud computing, distributed computing, mass data, xml, twig query processing
PDF Full Text Request
Related items