Font Size: a A A

Design And Implementation Of Small File Processing And Algorithm Parallelization Based On Hadoop

Posted on:2016-09-19Degree:MasterType:Thesis
Country:ChinaCandidate:H C GuanFull Text:PDF
GTID:2308330479984910Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With rapidly development for Global Information industrial, the company has accumulated a large amount of data, especially the amount of data of the small file, is growing rapidly. In the face of numerous small file data, the traditional stand-alone systems already can’t satisfy the demand of storage capacity, also can’t effectively analysis and calculation. In order to solve the difficulties faced by the stand-alone system, all kinds of distributed system gradually be applied to the mass data processing. Hadoop is a good platform for the distributed which was developed by Apache. It’s distributed file system HDFS and distributed programming model Map Reduce respectively for data storage and computing provides a strong support, the data processing system based on Hadoop has been widely studied. Hadoop, however, was originally designed to store large log file, its performance on store huge amounts of small files are not good. Therefore, in the massive small file system for data processing, we need to study one of the first questions is how to optimize the Hadoop, in order to efficiently store the data files. Storage, however, is only the first step in the system to do work, we also need for calculation, data based on Hadoop to statistical analysis of data, of the potential value of data mining, is also a need to study the problem.This dissertation analyzes the working principle of HDFS, Map Reduce, with the demand of mass small file processing system, researched on the two key technologies of data storage and analysis in small file processing system based on Hadoop. We mainly finished the following work:Firstly, we study the deficiency of the Hadoop storage mass small file, put forward a strategy for small files to be merged before storage, the method uses Hadoop bringing archive tool to merge small file, effectively improve the system to store small file performance, in addition, the combined file can be directly as input data, graphs task is very easy to the analysis of the system for subsequent processing.Secondly, on the base of small file storage, we studied the classical data mining algorithm on Hadoop parallel implementation method, for often used in data analysis, clustering analysis and frequent pattern mining these two technologies, we selected the k-means algorithm and FP- Growth algorithm to design and implement them based on Hadoop parallel.Finally, we build an experiment environment on the Hadoop platform and carried out the simulation experiment on these two key technologies. Experimental results show that the proposed merger before storage method can effectively improve the performance of system storage mass small files. Besides, the data mining algorithms paralleled according to the Map Reduce model has good performance and stability, which provides efficient computing ability for the system.
Keywords/Search Tags:HDFS, MapReduce, Small File, Data Mining
PDF Full Text Request
Related items