Font Size: a A A

Integration Of Hadoop And MongoDB For Massive Data Processing

Posted on:2016-01-09Degree:MasterType:Thesis
Country:ChinaCandidate:Q ZengFull Text:PDF
GTID:2428330473464946Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the exponential growth in data variety and data volumes,NoSQL technology and MapReduce for scalable parallel analysis have garnered a lot of attention.MongoDB,is a representative NoSQL database,supporting both scalable index and flexible query for Massive Data.While Hadoop,is the most popular open source implementation of a powerful MapReduce framework for parallel computing.In view of this,we are devoted to integrate MongDB and Hadoop into a platform and to build a Mongo-hadoop integrated system based on MongoDB and Hadoop.Our purpose is to take the advantage of MongoDB and Hadoop,to cope with large data better for storage,computing and query.Firstly,this paper introduced the basic framework of Hadoop and MongoDB and took an in-depth research on their working mechanism.We also analyzed the advantages,deficiency and similarity between Hadoop and MongoDB in a comparative way,and then reached two conclusions:one is for data calculation,MongoDB MapReduce has great limitations,which can not meet the calculation and analysis of complex data.The other is for data storage,HDFS as the underlying distributed file system for Hadoop,is designed for high throughput data,but cannot achieve efficient query on the data.To solve the first problem,we realized the connector between MongoDB and Hadoop which called Mongo-Hadoop connector,through this plug-in,Hadoop MapReduce can achieve the data from MongoDB and process it efficiently.In a two nodes experiment,we concluded that the performance of Hadoop MapReduce can achieve 5 times that of MongoDB MapReduce averagely.For the second problem,we realized the integration framework based on Hadoop and MongoDB,and put forward four kinds of different integration schemes to deal with large data processing for different needs.Mongo-Hadoop is the integration of MongoDB and Hadoop,in order to achieve the better compatibility,the cluster deployment and the configuration of parameters is particularly important.In this paper,we analyzed the roles of MongoDB cluster and Hadoop cluster,summed up a strategy for the deployment of the Mongo-Hadoop cluster by considering the node localization,data resource utilization and scalability.At the same time,we researched and tunned some of the parameters which impact the way the Mongo-hadoop works and the overall performance.To get the optimal integration schemes and understand the performance trade-offs of using these two technologies together better,three benchmarks is designed to test the performance of Mongo-hadoop under different scenarios.Experimental results show that,using reasonable integration scheme can obtain a highest performance up to 3 times.Experiments show that,compared with other architectures,if the Mongo-Hadoop use reasonable integration scheme,the performance can improve by 28%,while the occupied nodes only account for 50%.
Keywords/Search Tags:Integration, MongoDB, Hadoop, Big data, Cluster deployment, Parameters optimization
PDF Full Text Request
Related items