Font Size: a A A

Research On Data Extraction And Distributed Graph Data Management

Posted on:2017-04-20Degree:MasterType:Thesis
Country:ChinaCandidate:Q L DingFull Text:PDF
GTID:2278330488466894Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Graph database is a member of the family of NoSQL, which has a unique advantage in dealing with the complex data that has more correlation between each other, it provides quick query and efficient utilization of big data which has the structural characteristics of similar to the graph.How to quickly extract-transform-load (ETL) relational data to graph data, how to effi-cient analysis and use of these graph data, are two important problems in research of graph data application. Although there are some domestic and international research about ETL, however, these research has problems such as:1) The converted graph data were of poor qual-ity; 2) the efficiency of transforming was low; 3) the transformed results were not suitable for distributed storage. In the aspects of efficiently analyzing and utilizing graph data, most methods has insufficiency in graph data distributed storage and distributed computing.Therefore, this thesis focus on improving the design of ETL method, efficient manage-ment of large-scale graph data. The key contributions are as follows:(1) To overcome these limitations about current ETL method, a sub-schema-based ETL method for transforming relational data to graph data was proposed. By splitting schema of relational database to several sub-schemas, this method improved the algorithm and procedure of traditional ETL method and provided an efficient method for parallel ETL. The trans-formed results can satisfy the requirements of distributed storage, and conduct to be the basis data for Spark GraphX computing framework.(2) Considering the complex graph data, this article designed a distributed graph data method based on graph database, which can management distributed storage and scheduling of distributed computing framework for data analysis.Finally, J2EE and Neo4j were applied to implement the ETL prototype system for ex-perimental verification (referred to as:BSS-ETLS), Neo4j and Spark GraphX are used to realize the prototype system (referred to as:GCDMS). Experimental results show that hat the improved ETL method yielded better performance than traditional methods; GCDMS has ob-vious advantages in dealing with the massive graph data which has the strong structure.
Keywords/Search Tags:Graph Database, Distributed Computing, BSS-ETLS, GCDMS
PDF Full Text Request
Related items