Design And Implementation Of Parallel Data Mining System Based On Spark

Posted on:2018-03-21

Degree:Master

Type:Thesis

Country:China

Candidate:H L Su

Full Text:PDF

GTID:2348330518994420

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the rapid development and extensive application of big data technology, a lot of related researches have appeared in academia and IT enterprises, including improvement of Hadoop and Spark, performance optimization of data mining algorithms. However, few researches focused on the service of data mining platform, and the use of data mining platform such as Spark is still at the elementary level. When use of such platforms,application developers have to manage cluster deployment, operation and maintenance on thire own, even the underlying resources, consuming lots of time and energy.In this paper, based on the popular technology of container, with the advantage of lightweight and quick start, Spark cluster can be quickly created on containers, dynamically expanded and contracted, to achieve the purposes of automatically deployment and management. So that the Spark cluster becomes an on-demand service presented to data mining application developers.The main contents of this paper can be divided into two aspects. On the one hand, the design and implementation of a Spark data mining service system, including the unified management of the underlying physical resources, Spark cluster automatic configuration and modification based on container, and data mining application submission management. From the point view of developers, demands analysis is made in both functional and non-functional aspects. And then, analyze the module division of the system layer by layer top-to-down, to implemention. On the other hand,this paper proposes a data dependency model based on dependency analysis technologies, to analyze the dependence of the data elements in algorithms. And a parallelization analysis algorithm is proposed, analyzing the feasibility of data parallel computation on the source dataset, to reduce the workload of developers and improves the efficiency of solving these problems.

Keywords/Search Tags:

big data, spark, container, parallel computation, dependency analysis

PDF Full Text Request

Related items

1	Optimization And Application Of SVM Algorithm Based On Spark
2	Spark-based Distributed Functional Dependency Discovery Algorithm
3	Research And Implementation Of Efficient WEB Container Log Processing System Based On Spark
4	Research On Parallel Feature Selection Algorithm Based On Spark
5	Study And Implementation On Distributed Large Scale Matrix Computation Algorithms With Spark
6	Research On Customs Commodity Risk Tax Detection Based On Spark Platform
7	Research On Fast Data Cube Computation Method Based On Spark Platform
8	The Implementation Of Remote-Memory Management System And Performance Optimization In Spark
9	Performance Analysis Of Parallel Mind Evolutionary Computation
10	Research And Improvement Of Big Data Parallel Clustering Algorithm Based On Spark