Font Size: a A A

Design And Implementation Of Data Transform Platform Based On Big Data

Posted on:2016-04-30Degree:MasterType:Thesis
Country:ChinaCandidate:B WangFull Text:PDF
GTID:2308330503977802Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the increasingly rapid development of computer technology, the data which people in contact with is growing explosively. The drastic and continuant increase in data scale not only bring in enormous value and profit to people, but also result in severe challenges. Massive data processing work has become a hot issue in today’s research. Now there are many sophisticated processing algorithms in issue-specific data processing, no matter from efficiency or from the computational complexity, the traditional data processing algorithms have been unable to meet the processing needs of massive information, cloud computing technology development provides a new research direction for massive data processing. Cloud computing distribute the ability of storage and computing among multiple nodes in cloud cluster. So it enabling huge data set storage and computing power. In order to be able to respond to the challenges posed by the large data, companies developed their own cloud computing platform for data processing and analysis of data has become the mainstream trend.In this thesis, on the basis of research on massive data processing, the data transform platform which can be customized to simplify the massive data processing has been brought up. In order to improve the quality of data, there is a need to conduct out-lier detection in data sets. Because the traditional algorithms have rather high time complexity in clustering process, in this thesis, a parallel scheme for outlier detection based on traditional cluster algorithm was proposed.In the data transform platform solutions, "action flow" approach to abstract data processing action has been designed, by which enables users to custom data processing methods and processes on actual needs. In order to avoid customers to design SQL statements and programming codes, "input-process-output" statements in the form of configuration file was proposed. In the outlier detection solutions, in order to deal with massive data processing, the thesis has a parallel design and implementation of traditional K-Medoids clustering algorithm. Meanwhile, a distance sum-based method for outlier detection was designed, and there is no need to set parameters in advance. The experimental results showed that the efficiency and accuracy have been promoted considerably.The thesis proposed solutions to adapt massive data processing, saving a great deal of code writing time, and a distance sum-based method for outlier detection was proposed, and the whole project has rather good practical value.
Keywords/Search Tags:data processing, cloud platform, outlier detection, Hadoop, K-Medoids
PDF Full Text Request
Related items