Research On Approaches To Large-scale Data Analysis

Posted on:2011-06-15

Degree:Master

Type:Thesis

Country:China

Candidate:G Q Wang

Full Text:PDF

GTID:2178360308452374

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

With the development of information technology, the construction of information systems is in transition phase for many fields. Take the financial field for example. In the past, the core of the IT construction is about business trading system. But now, lots of companies have pay more attention on information management system aimed at customers,rick control and gain analysis. This kind of transition needs to collect all business systems data and share the data between departments,platforms.MapReduce is a simple and flexible parallel programming model proposed by Google for large scale data processing in a distributed computing environment. Users specify a map function that processes a key/value pair to generate a set of intermediate values associated with the same intermediate key. Many large-scale data analysis tasks in real world are expressible in this model.Being a type of high performance database management system, parallel database is a product combined between parallel technology and database technology. Parallel database improves the efficiency of working for rational database. Common parallel database can divide into three kinds of architecture method. It includes shared memory,shared disk and shared nothing.Based on the analysis of MapReduce and parallel database, this article proposes a more common and more extensional architecture for large-data processing. Then we make a test on related product.At first, we analyze the MapReduce and parallel database and know their development and practical thinking. Next we make a comparison between them. Then we propose three kinds of method for combining MapReduce and SQL. The methods are making a SQL layer built on a MapReduce engine,MapReduce invoking SQL and SQL invoking MapReduce. We think the last method is the best one.Next we propose the general architecture combined MapReduce and parallel database. This architecture includes client,master host and segments. The master host is in charge of collecting and processing other segments information. The segments are responsible for executing the tasks. Then we extend SQL with the MapReduce user defined function. We propose the mode of SQL invoking MapReduce. Then we describe the general information about data distributed strategy and data mirror processing.At last, we test a excellent parallel database named Greenplum. The testing is based on the business data of one real security company. The test includes data loading,statistical analysis and so on. Then we get the conclusion that the Greenplum is good at large-scale data analysis processing.

Keywords/Search Tags:

large-scale data, distributed file system, parallel database, load balancing, data distributed

PDF Full Text Request

Related items

1	Distributed File System Zd-dfs Design And Implementation
2	Re Ase Arch On The Reliability Assurance Technology Of Distributed Storage System For Large-Scale Data
3	Distributed Database Cluster System Zd-ddb Design And Implementation
4	Research And Implementation On Technologies Of Meta-data Load Balancing In Distributed File System
5	Research On The Key Techniques For Parallel File Storage System
6	Research On Hadoop Cluster Optimization In Large Scale Network Data Environment
7	The Research Of Data Storage In Distributed Main Memory Database
8	Research On Load Balancing Technology In Distributed File System
9	The Design And Implementation Of Distributed Data Management System For Large-Scale Virtual Screening
10	A Distributed Data Storage System With Security Auditing And Load Balancing