Font Size: a A A

Design And Implementation Of Parallel Data Mining System Based On Spark

Posted on:2018-03-21Degree:MasterType:Thesis
Country:ChinaCandidate:H L SuFull Text:PDF
GTID:2348330518994420Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development and extensive application of big data technology, a lot of related researches have appeared in academia and IT enterprises, including improvement of Hadoop and Spark, performance optimization of data mining algorithms. However, few researches focused on the service of data mining platform, and the use of data mining platform such as Spark is still at the elementary level. When use of such platforms,application developers have to manage cluster deployment, operation and maintenance on thire own, even the underlying resources, consuming lots of time and energy.In this paper, based on the popular technology of container, with the advantage of lightweight and quick start, Spark cluster can be quickly created on containers, dynamically expanded and contracted, to achieve the purposes of automatically deployment and management. So that the Spark cluster becomes an on-demand service presented to data mining application developers.The main contents of this paper can be divided into two aspects. On the one hand, the design and implementation of a Spark data mining service system, including the unified management of the underlying physical resources, Spark cluster automatic configuration and modification based on container, and data mining application submission management. From the point view of developers, demands analysis is made in both functional and non-functional aspects. And then, analyze the module division of the system layer by layer top-to-down, to implemention. On the other hand,this paper proposes a data dependency model based on dependency analysis technologies, to analyze the dependence of the data elements in algorithms. And a parallelization analysis algorithm is proposed, analyzing the feasibility of data parallel computation on the source dataset, to reduce the workload of developers and improves the efficiency of solving these problems.
Keywords/Search Tags:big data, spark, container, parallel computation, dependency analysis
PDF Full Text Request
Related items