Font Size: a A A

Design And Implementation Of Big Data Processing Platform Based On Hadoop

Posted on:2018-05-18Degree:MasterType:Thesis
Country:ChinaCandidate:T F HeFull Text:PDF
GTID:2348330569485791Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of science and technology,more and more data are produced,which makes large data processing technology become one of the most popular technology research in recent years.However,in the practical application,the popularization rate of large data processing technology is far less than the speed of data generated,which makes many companies face data cannot be processed in a timely manner,and therefore cannot dig out the value of the data.How to realize the efficient processing of large data sets is the main content of this paper.The processing process includes data extract,data transform and data load,which is called ETL process.The content of this paper is to build a large data processing platform through Hadoop large data storage architecture,Hive,flume data acquisition technology and Sqoop data synchronization technology to achieve efficient processing of large data sets.Hadoop is the most popular framework for large data processing at at the moment.Hdoop has such advantages as high reliability,high scalability,high efficiency and low cost.The Hadoop implementation of the MapReduce computing framework is an efficient parallel framework.Hadoop users must write specific MapReduce program to deal with tasks,but Hadoop exposed bottom interface,even a simple task users also need to write a lot of code,it is hard to reuse a the code.The emergence of Hive largely solves this problem,Hive is an open source data warehouse tools that is based on Hadoop,and it supports a kind of SQL like language,Hive can compile HQL into a MapReduce program,so that Hive can use Hadoop efficient parallel processing ability.As a result,Hive users are able to write a small amount of code for rapid development.Therefore,this paper chooses Hive as the tool of data cleaning and processing.Based on the in-depth research of these big data technologies,especially Hadoop and Hive,a big data processing platform based on Hadoop is developed in this paper.In data ETL process,the data conversion process takes the longest time.Therefore,in this paper,we focus on the optimization principle and method of Hive QL and optimize the Hive QL for actual business data processing through this research.
Keywords/Search Tags:Big Data Processing, Hive, Hadoop, ETL, Optimization
PDF Full Text Request
Related items