Font Size: a A A

Design And Optimization Of Massive Relational Data Processing Technology Based On MapReduce

Posted on:2019-06-05Degree:MasterType:Thesis
Country:ChinaCandidate:Q P HuangFull Text:PDF
GTID:2348330542955569Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
With the beginning of the Internet to the world in 1995 and the rapid growth of information technology,the data also showed a trend of "massification." Relational data is the data directly or indirectly related to the existence of these data implied a variety of relationships,so people began to pay attention to the processing of massive data processing.In recent years,heavy data,connection query processing and optimization techniques have gradually become the research focus.In order to reduce the impact of mass data redundancy and improve the efficiency of query connection,this thesis draws on the experience of traditional mass data processing technology and puts forward a mass data processing system based on MapReduce.The design concept,architecture and process of the system Discussed.First of all massive WiFi log line to heavy processing as an example,the large-scale parallel data processing framework MapReduce applied to the massive relational data deduplication.Then,the method of query processing based on MapReduce connection is designed,and the problem of dealing with the query of two tables and the query of multi-table connection is solved by using Reduce-based connection and Map-based connection method.Then,MapReduce was improved,and the results of the Map phase were input to the middle of the MapReduce job simultaneously with the historical data,and then pushed to the Map of the latter task as a stream to allow the data to be shared globally among the MapReduce jobs.Re-partition the partition before entering the Reduce stage to repartition the data and thus solve the problem that MapReduce mass log data is re-counted for heavy memory overflow.At the same time,in view of the characteristics of query task of relational data connection,the thesis also reviews the optimization idea of SMapReduce framework,and on this basis,proposes the Commander connection query processing algorithm.By adding a Commander node to receive,store and update a small amount of global information,the global information communicates with each Map node through the node to filter the map data,avoiding the transmission and sorting of unnecessary tuples and reducing the processing Cost,improve the efficiency of the connection query processing algorithm.Finally,the thesis compare the deduplication performance of traditional de-duplication technology with MapReduce and the improved system through experiments,and detect the improvement performance of de-duplication process,and improve the system's operating efficiency under the background of massive data,and avoid the overflow of memory.At the same time,through the experiment,the improved system is respectively connected with MapReduce and SMapReduce to query the performance comparison test,and then the improved performance of the connection query is detected.The results show that the improved framework can effectively deal with the connection query,filter out a large number of unnecessary intermediate output,with good performance.
Keywords/Search Tags:MapRedece, data processing, duplicated data deletion, query processing, relational data
PDF Full Text Request
Related items