Research On High Efficient Data Mining Algorithm Under The Distributed Environment

Posted on:2018-10-03

Degree:Master

Type:Thesis

Country:China

Candidate:Q Zhang

Full Text:PDF

GTID:2348330518458010

Subject:Computer application technology

Abstract/Summary:

With the emergence and development of mobile Internet,Internet of things,social network,digital home,e-commerce and other new generation of information technology,a large number of data is constantly generated.Cloud computing platform provide storage and computing capabilities for these massive and diverse big data.After a series of steps,such as management,processing,analysis and optimization,the results will be fed back to the above applications with the aid of all kinds of distributed platform,which will bring a huge economic and social value.Data mining mainly consists of four parts,clustering analysis,prediction modeling,correlation analysis,anomaly detection.Through the application of data mining algorithm,the required information will be obtained from the complex data.Some classical algorithms,such as clustering algorithm,classification algorithm,association rule algorithm and so on,have been widely used in practice.Based on the distributed framework,Spark,this paper focuses on the research of mining algorithm of maximal frequent itemset and density clustering algorithm.In the aspect of frequent itemset mining,because of the high value and advanced information is hidden in the longer frequent items,the maximum frequent itemset mining keep a higher value Combining the advantages of existing algorithms,the refined algorithm solve the problem of data mining frequent itemset in large scale and high dimension and avoided the reduction of efficiency in conventional maximum frequent data mining algorithms under large scale and high dimensional datasets,by first utilizing depth-first search algorithm to generate maximum candidate frequent set,then sorting the aquired dataset by length and testing superset cyclically.In the aspect of density clustering algorithm,based the G-DBSCAN algorithm and the Spark framework,an overlapping density clustering algorithm is proposed after analysis the limitations of existing clustering algorithms density limitations on dealing with large data after.The algorithm refines the merging method of local clustering result and different partitions.It first generates packets of fixed size overlap by traversing data set,then locally merge each group and finally merge all partitions in global.The improved algorithm not only increases the accuracy and time efficiency of the clustering results,but also reduces the requirement of the hardware in the process of mass date.Experiments prove that these two kinds of improved algorithms are feasible and effective.

Keywords/Search Tags:

big data, data mining, frequent items, density clustering, the distributed technology

Related items

1	Research On Frequent Items Mining And Clustering Algorithms Of Data Stream
2	Research On Frequent Items Mining Technology Based On Trajectory Data
3	Study On Key Technologies Of Frequent Items Mining And Clustering On Data Streams
4	Research On Algorithms For Mining Frequent Patterns In Data Streams
5	Research On Count-based Algorithm For Mining Frequent Items Over Data Stream
6	Theoretical Analysis And Algorithm Study On Improvement Of Finding Frequent Items In Data Streams
7	Research And Application Of Key Technologies Of Distributed Computing Over Data Streams
8	Research On The Algorithm For Mining Frequent Items From Data Streams
9	Research On Frequent Items Problem Using Lower Bound In Massive Data Stream
10	The Research Of Frequent Itemsets Mining Algorithm Over Data Streams