Font Size: a A A

Research On Parallel Processing Mechanism Of Large-scale XML Data

Posted on:2020-05-18Degree:DoctorType:Dissertation
Country:ChinaCandidate:K F SongFull Text:PDF
GTID:1368330590958834Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet application technology and the continuous expansion of data center,e Xtensible Markup Language(XML)has become a hot topic in distributed applications.As a special form of semi-structured data,XML has become an important standard for Internet information storage and data exchange,and has been widely and deeply applied in data exchange middleware,Web services,RSS subscription,large data storage and analysis.Faced with the rapid growth of the scale of XML data and the rapid development of cloud computing and other related technologies,how to effectively improve the distributed computing and storage capabilities of large-scale XML data in data centers,how to retrieve information fragments most related to user needs from massive data,and how to cluster these data documents dynamically in order to obtain a higher recall rate in retrieval,are the key problems to be solved urgently in the field of XML distributed application.In recent years,researches on XML data have involved many aspects,from data exchange to heterogeneous system middleware,from nodes coding to keywords query,from single application to distributed applications,from conventional data to large-scale data.However,the traditional XML management methods still have many shortcomings in the processing of large-scale applications.This thesis studies the key links of XML data management,such as document nodes coding,keywords query,document dynamic clustering,and XML data exchange computing architecture.The main contents and innovations are as follows:(1)In order to effective management of XML data and improve the the accuracy of query,one of the most common preprocessing method is encoding XML nodes.Since XML data has complete semantic and structural information,in order to avoid destroying the context semantic relationship when multiple computing nodes partition the same document logically(i.e.,File is divided into Blocks)in distributed environment,an XML node coding algorithm based on Map Reduce is proposed,which is Sequence-Depth-Offset Labeling(SDOL)algorithm.This algorithm is on the basis of interval encoding and prefix encoding.SDOL,which supports the logical representation of ancestor-offspring,father-son relationship and sibling relationship between labeling nodes.At the same time,a globalblock index and an inverted hash table index are established for XML data blocks stored on different computing nodes.The experimental results based on real data sets show that the encoding algorithm effectively takes into account the utilization of encoding storage space and retrieval efficiency in distributed environment.Compared with the representative dynamic prefix encoding IDDewey algorithm and the renewable interval encoding method SEQU,the overall performance of the SDOL algorithm is improved by 22.1% on average.(2)Currently,most XML keywords query algorithms are based on seeking the Smallest Lowest Common Ancestor(SLCA)node.In deep-level XML data,traditional algorithms have many problems,such as repeated computation of common ancestor nodes,more structured join operations when merging multi-keywords query results.To further improve the efficiency of XML data query,combining SDOL coding scheme with Hash index table,a bottom-up keywords search algorithm B-SLCA is proposed,which is suitable for distributed processing mode.The algorithm traverses upward from the node where the keyword is located to the SLCA node where multiple keywords are found,or traverses to the root node.The efficiency and performance of the algorithm are analyzed theoretically and experimentally.The results show that the overall efficiency of the B-SLCA search algorithm based on Hash index is about 11% higher than that of the Twig method based on CAT index,and about 30% higher than that of the Stack-tree method based on B+ tree index.This is because the Stack-tree method needs more structured join operations in judging the combination of multi-keyword query results.In the initial stage of query,two CAT trees need to be traversed to consume more disk IO time,while the B-SLCA method only needs to judge the encoding attributes(node ID,depth and parent node ID).B-SLCA keywords search algorithm can still maintain well query performance when the number of V keywords in queries is high and the frequency of keywords in documents is low.(3)Because of the diversity,complexity and dynamics of keywords input by users,how to quickly find documents matching keywords from massive XML documents is one of the important indicators to improve the performance of keywords query.At present,documents clustering mostly adopts static classification method,ignoring the dynamic nature of keywords query,even if the same keywords is queried in different order,the results will be different.Aiming at the above problems,a dynamic clustering method of XML documents based on attribute correlation is proposed.When encoding XMLdocuments nodes,the static probability statistics of the relativity between the attributes of documents nodes is carried out.That is to say,the probability of any two nodes in a document becoming parent-child relationship and brother relationship is calculated.At the same time,principal component analysis(PCA)is used to extract the key features of document and reduce the dimension of vector space,which can reduce the eigenvector space from the original hundreds to less than 10 dimensions,effectively simplifying the calculation process of high-dimensional vector space.Before querying,the relativity of keywords in XML documents is sorted and preprocessed.Naive Bayesian method is used to complete the automatic and effective classification of XML documents.The experimental results show that the recall rate of the dynamic document clustering algorithm based on attribute correlation is nearly 20.5% higher than that of the traditional vector space model clustering method,and the degree of correlation between query results and user needs is improved to some extent.(4)Component interface and message middleware mechanism based on XML are important methods to solve data exchange between multi-source heterogeneous systems.This method has some shortcomings in large-scale data applications,such as poor exchange performance and insufficient computing power.In order to improve the ability of data exchange and computing in the whole data center,an XML data exchange and computing architecture based on Open Flow protocol is designed.The control functions of network exchange devices distributed in different fields in traditional networks are centralized on a central controller to realize centralized control and distributed forwarding architecture.At the same time,Name Node,the computing node of Hadoop platform,is introduced into the control plane,the exchange level XML model and cache structure are designed,and the cache data exchange strategy is designed.The experimental results show that the architecture model is feasible in dealing with large-scale XML data and distributed applications,and improves the performance of XML data processing in the network layer of data center.
Keywords/Search Tags:XML Data, Parallel Processing, Keyword Query, Attribute Relativity, Document Dynamic Clustering
PDF Full Text Request
Related items