Font Size: a A A

Research On Approaches To Large-scale Data Analysis

Posted on:2011-06-15Degree:MasterType:Thesis
Country:ChinaCandidate:G Q WangFull Text:PDF
GTID:2178360308452374Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the development of information technology, the construction of information systems is in transition phase for many fields. Take the financial field for example. In the past, the core of the IT construction is about business trading system. But now, lots of companies have pay more attention on information management system aimed at customers,rick control and gain analysis. This kind of transition needs to collect all business systems data and share the data between departments,platforms.MapReduce is a simple and flexible parallel programming model proposed by Google for large scale data processing in a distributed computing environment. Users specify a map function that processes a key/value pair to generate a set of intermediate values associated with the same intermediate key. Many large-scale data analysis tasks in real world are expressible in this model.Being a type of high performance database management system, parallel database is a product combined between parallel technology and database technology. Parallel database improves the efficiency of working for rational database. Common parallel database can divide into three kinds of architecture method. It includes shared memory,shared disk and shared nothing.Based on the analysis of MapReduce and parallel database, this article proposes a more common and more extensional architecture for large-data processing. Then we make a test on related product.At first, we analyze the MapReduce and parallel database and know their development and practical thinking. Next we make a comparison between them. Then we propose three kinds of method for combining MapReduce and SQL. The methods are making a SQL layer built on a MapReduce engine,MapReduce invoking SQL and SQL invoking MapReduce. We think the last method is the best one.Next we propose the general architecture combined MapReduce and parallel database. This architecture includes client,master host and segments. The master host is in charge of collecting and processing other segments information. The segments are responsible for executing the tasks. Then we extend SQL with the MapReduce user defined function. We propose the mode of SQL invoking MapReduce. Then we describe the general information about data distributed strategy and data mirror processing.At last, we test a excellent parallel database named Greenplum. The testing is based on the business data of one real security company. The test includes data loading,statistical analysis and so on. Then we get the conclusion that the Greenplum is good at large-scale data analysis processing.
Keywords/Search Tags:large-scale data, distributed file system, parallel database, load balancing, data distributed
PDF Full Text Request
Related items