Research Of Large-scale Data Mining Technology Based On Spark

Posted on:2019-07-30

Degree:Master

Type:Thesis

Country:China

Candidate:W Gui

Full Text:PDF

GTID:2428330548477010

Subject:Management Science and Engineering

Abstract/Summary:

PDF Full Text Request

How to exploit value information from massive information data has become an important research topic under the background of big data.The calculation of mass data based on data mining algorithm is an important research topic of the subject.However,it takes a lot of time to process the huge amount of data based on the serial algorithm in the traditional stand-alone environment,which makes it more difficult to meet the data computation task of increasing scale.Distributed computing technology for the realization of massive data mining provides a technical solution.Spark as a memorybased computing framework,high-iterative type of data mining algorithms in this framework is often faster than Hadoop.This paper founds distributed Spark cluster,on the basis of which completed the parallelization of classical algorithm.Aiming at the problem that the FP-Growth algorithm in stand-alone environment can not meet the demand of mining frequent itemsets of massive data.Based on the theory and technology of Spark Core,the support counting and grouping process of this algorithm are improved in the DAG(Directed Acyclic Graph)memory computation framework.Finally,the Spark core resource scheduling parameters are adjusted,and the number of processes of each sub-node and its corresponding CPU cores are reasonably set to achieve a balanced allocation of computing resources during the operation of the algorithm.The experimental results show that the improved parallel algorithm has higher time performance,so it can be applied to the task of frequent itemset mining for large-scale data.In the K-means algorithm,the choice of the K value is uncertain and the random selection of the center point causes a large error.This paper improve the selection process of the initial cluster center by defining the probability function,and simplify the distance calculation formula.The parallelization of the improved algorithm is realized under Spark.In the experimental simulation stage,a reasonable K value selection scheme is given through the method of multiple clustering combined with the evaluation result of the sum of squares of the minimum error.The results show that the improved algorithm has higher time performance and the accuracy of clustering.Based on the above parallel improved algorithm,this paper takes Jinan massive taxi driving data as a case study,and draws the road network topology of Jinan City using GIS(Geographic Information System)technology.Finally,we constructed the traffic hotspot map of taxi operation and the static subarea of Jinan traffic network system according to the experimental results.The technical reference for the location of temporary waiting-point is provided.

Keywords/Search Tags:

Data mining, Spark, Association rules, Clustering

PDF Full Text Request

Related items

1	Association Rules Mining And Its Applications In Microarray Gene Expression Data
2	Research On The Optimization Of Association Rules
3	Research On Association Rules Mining In Data Streams And Its Application
4	Distributed Association Rules Algorithm Based On The Spark
5	Research And Application Of Association Rules Mining Algorithm Based On Spark
6	The Research And Application Of Data Mining In Mining Rules Of Medical Diagnosis
7	The Research On The Algorithm Of Mining Quantitative Association Rules
8	Research On The Algorithm Of Telecom Business Association Rules
9	Parallelizable Algorithms Research Of Association Rules Mining
10	Research And Application On The Technologies In Mining Association Rules