Font Size: a A A

The Implementation Of Parallel Algorithm Based On Hadoop And The Instance Analysis Of GPS Data

Posted on:2016-09-14Degree:MasterType:Thesis
Country:ChinaCandidate:Z B RongFull Text:PDF
GTID:2308330461968869Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of cloud computing, Internet of things and mobile Internet, big data is becoming the new hotspot of information technology and the new direction of industrial development, have great effect on the production and life of mankind. Big data sources on the Internet, enterprise systems and IoT systems, through the analysis and mining of big data processing system, generate new knowledge to support decision-making and business intelligence operation. The arrival of the era of big data to the data management and analysis presents new challenges, rationality and timeliness of data processing method has become a hotspot for big data statistical analysis.In recent years, big data analysis is an important research direction based on data mining algorithm, but most of them are improvement traditional data mining algorithm in the stand-alone environment, due to memory and scalability limitations, can not effectively meet the surge in demand for massive data processing, so this paper studies the traditional data mining algorithm for parallel implementation method in MapReduce environment. At the sametime, this paper analyzes the existing form of massive data and the performance bottleneck of Hadoop platform processing a mass of small files, proposed massive small files processing strategy. Taking the taxi GPS data as an example, verify the efficiency of MapReduce to realize the short-term traffic prediction, in the Hadoop environment, improved K nearest neighbor short-time traffic flow prediction algorithm, which improves the prediction accuracy. Based on the above situation, this paper has done the following three tasks:(1) In the stand-alone environment, the problem of traditional data mining algorithm in the analysis of large scale data are high memory consumption, low computing performance, poor scalability and reliability and so on, therefore, this paper put forward a new implementation method which is based on MapReduce parallel environment for KNN, Apriori and K-Means algorithm. meanwhile selecting "speedup, scalability and reliability" as(2) an indicatorand, verified by using different sizes of real dataset with different nodes in the cluster, The results show that this method is feasible and valid and is able to improve the overall performance and efficiency of KNN, Apriori and K-Means algorithm to meet the needs of large-scale data mining.(3) Since Hadoop has inherent defects of high memory overhead, low computing performance in massive small files processing, this paper implement three effective methods and propose two strategies for small files problems in this paper. First, we implement methods of CombineFileInputFormat (CFIF), Hadoop Archives (HA) and Sequence Files (SF) for massive small files processing. Moreover, we propose strategies selection according to the actual needs of different users. Finally, we verify the implemented methods and the proposed strategies by evaluating the memory consumption of Namenode and running speed of MapReduce in two experiments. Experimental results show that the effective methods and strategies can enhance the overall performance of Hadoop and improve the efficiency of massive small files processing.(4) Select the massive taxi GPS data as a case study, the use of KNN algorithm based on MapReduce to solve the low efficiency problem of short-term traffic flow. Before the short term traffic flow prediction, introduction of small files processing strategy, pre processing of massive taxi GPS data file, to make up for the massive small file read and write speed is slow, the defects of low efficiency of processing. In the MapReduce environment, improved K nearest neighbor short-time traffic flow prediction algorithm of state vector and distance vector, to solve the problem of accuracy of short term traffic flow prediction.
Keywords/Search Tags:Big data, MapReduce, Small files, Parallelization, Traffic flow prediction
PDF Full Text Request
Related items