Design And Optimization Of Massive Relational Data Processing Technology Based On MapReduce

Posted on:2019-06-05

Degree:Master

Type:Thesis

Country:China

Candidate:Q P Huang

Full Text:PDF

GTID:2348330542955569

Subject:Communication and Information System

Abstract/Summary:

PDF Full Text Request

With the beginning of the Internet to the world in 1995 and the rapid growth of information technology,the data also showed a trend of "massification." Relational data is the data directly or indirectly related to the existence of these data implied a variety of relationships,so people began to pay attention to the processing of massive data processing.In recent years,heavy data,connection query processing and optimization techniques have gradually become the research focus.In order to reduce the impact of mass data redundancy and improve the efficiency of query connection,this thesis draws on the experience of traditional mass data processing technology and puts forward a mass data processing system based on MapReduce.The design concept,architecture and process of the system Discussed.First of all massive WiFi log line to heavy processing as an example,the large-scale parallel data processing framework MapReduce applied to the massive relational data deduplication.Then,the method of query processing based on MapReduce connection is designed,and the problem of dealing with the query of two tables and the query of multi-table connection is solved by using Reduce-based connection and Map-based connection method.Then,MapReduce was improved,and the results of the Map phase were input to the middle of the MapReduce job simultaneously with the historical data,and then pushed to the Map of the latter task as a stream to allow the data to be shared globally among the MapReduce jobs.Re-partition the partition before entering the Reduce stage to repartition the data and thus solve the problem that MapReduce mass log data is re-counted for heavy memory overflow.At the same time,in view of the characteristics of query task of relational data connection,the thesis also reviews the optimization idea of SMapReduce framework,and on this basis,proposes the Commander connection query processing algorithm.By adding a Commander node to receive,store and update a small amount of global information,the global information communicates with each Map node through the node to filter the map data,avoiding the transmission and sorting of unnecessary tuples and reducing the processing Cost,improve the efficiency of the connection query processing algorithm.Finally,the thesis compare the deduplication performance of traditional de-duplication technology with MapReduce and the improved system through experiments,and detect the improvement performance of de-duplication process,and improve the system's operating efficiency under the background of massive data,and avoid the overflow of memory.At the same time,through the experiment,the improved system is respectively connected with MapReduce and SMapReduce to query the performance comparison test,and then the improved performance of the connection query is detected.The results show that the improved framework can effectively deal with the connection query,filter out a large number of unnecessary intermediate output,with good performance.

Keywords/Search Tags:

MapRedece, data processing, duplicated data deletion, query processing, relational data

PDF Full Text Request

Related items

1	A Design Science Approach to Deletion in Transactional Processing Relational Databases
2	Stream-Oriented Processing Of SQL Query Plan Generation Technology Research
3	Research And Implementation Of Data Cleaning System Based On Pre-Processing Techniques
4	Analytical query processing in data intensive applications
5	Research XML Data Management Based On Relational Database
6	Efficient structural query processing in XML databases
7	Research On Key Techniques Of Query Processing Over Wireless Sensor Networks
8	Research On NLP-Based Duplicated Web Pages Deletion Algorithm
9	Research On Key Technologies Of Distributed Rank-aware Query Processing
10	Research On Approximate Query Processing Techniques In The DataWarehouse