Font Size: a A A

Research And Implementation Of Integration Of R Language And Hadoop

Posted on:2015-04-10Degree:MasterType:Thesis
Country:ChinaCandidate:W F LiuFull Text:PDF
GTID:2308330452956910Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Nowadays, we are experiencing data explosion which, in many domains, helping usto understand comprehensive phenomenon in variety of realms. As the knowledge buriesin these large number of data is less obvious to learn, we are eager to apply sophisticatedstatistical analysis methods to dig valuable information up these large-scale datatsets. Rhas provided sufficient functionalities for data analysis, but it failed in distributedenvironments. According to this, to integrate R and distributed computing environmentsmakes sense.Distributed computing was complex until Google published several papers which arecalled GFS(Google File System), MapReduce(distributed computing framework) andBigTable(Google’s NoSQL database). And then Apache has developed a distributioncalculating framework name Hadoop based on these papers. Hadoop provides sufficientAPIs for programmers to satisfy various of programming requirements. Relatively, R,evolved from language S, provides most powerful abilities for large-scale data analysisand chart drawing except for one big problem concurrency, which seems like abottleneck for R’s efficiency. This problem could be solved if we can integrate R andHadoop in which R benefits distributing abilities from Hadoop, and in return, supplyHadoop for data analysis.This thesis benefits from latest research achievements on exist large-scale datasetsand distributed computional framework. The experiment result shows that integration of Rand Hadoop provides a scalable solution for large-scale statistical computing in R, and Rruns approximately two times faster in three Hadoop nodes than in one single localmachine.Additionally, propriet configuration assists Hadoop accommodate to varities oflarge-scale data processing jobs,which improves R language’s running effiency too.
Keywords/Search Tags:Large-scale data, Distributed computing, Distributed storage, Hadoop, Integration of R Language and Hadoop
PDF Full Text Request
Related items