Font Size: a A A

Research Of Large-scale Data Mining Technology Based On Spark

Posted on:2019-07-30Degree:MasterType:Thesis
Country:ChinaCandidate:W GuiFull Text:PDF
GTID:2428330548477010Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
How to exploit value information from massive information data has become an important research topic under the background of big data.The calculation of mass data based on data mining algorithm is an important research topic of the subject.However,it takes a lot of time to process the huge amount of data based on the serial algorithm in the traditional stand-alone environment,which makes it more difficult to meet the data computation task of increasing scale.Distributed computing technology for the realization of massive data mining provides a technical solution.Spark as a memorybased computing framework,high-iterative type of data mining algorithms in this framework is often faster than Hadoop.This paper founds distributed Spark cluster,on the basis of which completed the parallelization of classical algorithm.Aiming at the problem that the FP-Growth algorithm in stand-alone environment can not meet the demand of mining frequent itemsets of massive data.Based on the theory and technology of Spark Core,the support counting and grouping process of this algorithm are improved in the DAG(Directed Acyclic Graph)memory computation framework.Finally,the Spark core resource scheduling parameters are adjusted,and the number of processes of each sub-node and its corresponding CPU cores are reasonably set to achieve a balanced allocation of computing resources during the operation of the algorithm.The experimental results show that the improved parallel algorithm has higher time performance,so it can be applied to the task of frequent itemset mining for large-scale data.In the K-means algorithm,the choice of the K value is uncertain and the random selection of the center point causes a large error.This paper improve the selection process of the initial cluster center by defining the probability function,and simplify the distance calculation formula.The parallelization of the improved algorithm is realized under Spark.In the experimental simulation stage,a reasonable K value selection scheme is given through the method of multiple clustering combined with the evaluation result of the sum of squares of the minimum error.The results show that the improved algorithm has higher time performance and the accuracy of clustering.Based on the above parallel improved algorithm,this paper takes Jinan massive taxi driving data as a case study,and draws the road network topology of Jinan City using GIS(Geographic Information System)technology.Finally,we constructed the traffic hotspot map of taxi operation and the static subarea of Jinan traffic network system according to the experimental results.The technical reference for the location of temporary waiting-point is provided.
Keywords/Search Tags:Data mining, Spark, Association rules, Clustering
PDF Full Text Request
Related items