Font Size: a A A

The Optimization Of Clustering And Classification Algorithms Based On SPARK

Posted on:2018-09-27Degree:MasterType:Thesis
Country:ChinaCandidate:Y H DangFull Text:PDF
GTID:2348330512981347Subject:Engineering
Abstract/Summary:PDF Full Text Request
In recent years with the rapid development of Internet,massive amounts of data every day were producing.There are a lot of valuable information contained int these data,it is difficult to extract these information from mass data for traditional ways.In order to solve this problem of mass data processing,distributed computing technology arises at the historic moment.Due to the ability of data storage and processing of the computer cluster,it is possible to solve the problem of performance bottleneck of traditional way.However,the traditional data mining algorithm has not adapt to the distributed computing environment already,so the parallel optimization for traditional data mining algorithm has to be done to satisfy the requirement of the cluster parallel computing,which is one of the hot research topic in the field of big data in recent years.There are some mainstream distributed computing frameworks:Storm,Hadoop,Spark,mainly divided into batch processing mode and stream processing mode.Storm was developmented mainly for the real-time data flow scenarios in stream processing mode,.Hadoop and Spark wrere developmented mainly for large-scale data storage and processing with batch computing models.HDFS has the advantages of high reliability,high scalability and high fault tolerance,so it is suitable for large-scale data storage.MapReduce is a parallel programming model,which greatly simplifies programming for programmers.RDD is a more simpler programming model,and it is much more efficient than Hadoop especially for the iterative algorithm.The emergence of these distributed computing framework provides a great convenience for large-scale data processing.In this thesis,the parallelization of DBSCAN clustering algorithm and L1-SVM classification algorithm was studied,then combine the characteristics of Spark framework and twoalgorithms to carry out parallel expansion.The main work is as follows:1.The traditional clustering algorithm DBSCAN is analyzed to study its parallelization improvement.As the data distribution is uneven,in order to solve the load balancing problem of the data segmentation.Parallelization of DBSCAN algorithm presents a new way to partition data.Firstly the parallel DBSCAN algorithm clusters the data of local node,and the clustering algorithm in this process is the same as the traditional on.Finally,the clustering results of each node are clustered to form the final result.In this process I used RDD to ensure the efficiency of data iteration.2.To improve the parallelization of the L1-SVM classification algorithm.Firstly,it is necessary to solve the problem that each iteration of the nonlinear SVM is based on the result of the previous operation.In this thesis,the result of multiple discriminant functions was used to approximate the global result,which brings to repeat the data and avoids the efficiency problem of some algorithms.3.Construct Spark experimental environment,compare the parallelized algorithms and the traditional algorithm in the case of different parameters with the various indicators.
Keywords/Search Tags:Spark, DBSCAN, L1-SVM, Parallelization
PDF Full Text Request
Related items