Font Size: a A A

A Distributed Graph Storage And Query System For Web Data Management

Posted on:2010-10-26Degree:MasterType:Thesis
Country:ChinaCandidate:D TaoFull Text:PDF
GTID:2178360275491626Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the dramatic development of WWW(World Wide Web),web data witness a explosive boom up in both quantity and scale,which makes web data forming the hugest database in the world.Moreover, other data associated with web data,such as search engine records and click records on web service,is also growing rapidly.Compared to traditional data,web data is semi-structure data and has the character of high increase rate,variety data type.Therefore,it is unlikely to deal with traditional data and web data in the same way.Nowadays,there is a large demand of web data analyses technology in all fields,which has been attracting increasingly attention in relevant area of database research.Therefore,we introduce in CWI,a new query and analyze tool for massive data.In realistic application,we need to store and query large scale of data,implement keyword searching and query on data structure.As parts of CWI,TLGM and TLGM-QL meet these demands. We emphasize on implementing TLGM data model in the distributed environment,and we design and implement four basis operators supporting TLGM-QL.During the designing phase,we find that the disproportional spacing real world data will induce to degeneration of the store and query algorithms,which increase the time cost.In order to solve this problem,we bring forward a series of algorithms to keep the difference between data nodes' storage and calculation load in a bearable scope.On this base,we bring a new reconstructing algorithm for subgraph to support query on graph structure.We also propose several balancing methods to ensure the efficient of algorithm above. We design and run experiments on virtual and real world data to prove the system's efficiency.Our major contributions of this thesis include:1.Web data's character is analyzed,and TLGM model is introduced to illustrate the difference between web data and traditional data on storage,querying and indexing.Firstly,we try using relational database to store graph data,designing several queries and making experiment on it.By checking the experiment results,we show the limitation of centralized storage.2.We analyze the TLGM model and illuminate its implementation under distributed environment.Furthermore,we summarize the query language supported by this model,and propose four basis operators.We use some examples to prove these operators have good flexibility,and then we provide the pseudo code of them.3.A novel algorithm of subgraph reconstruction is proposed,which is used to support query on graph structure.We implement this algorithm in the MapReduce framework,which makes it having good scalability.In addition,we use cache strategy to improve the efficiency.We also make some improvement on balancing methods for real word data often cause imbalance load between different data nodes.Extensive experiments are performed to verify the efficiency of our algorithmWe believe our work is a good example of web data storage and querying with practice since we not only provide some key solutions for storing web data as graph,but also implement a novel framework to index and query massive web data.Our work has great importance in web data storing area.
Keywords/Search Tags:TLGM data model, distributed storage, Web data management
PDF Full Text Request
Related items