Font Size: a A A

Research On High Efficient Data Mining Algorithm Under The Distributed Environment

Posted on:2018-10-03Degree:MasterType:Thesis
Country:ChinaCandidate:Q ZhangFull Text:PDF
GTID:2348330518458010Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the emergence and development of mobile Internet,Internet of things,social network,digital home,e-commerce and other new generation of information technology,a large number of data is constantly generated.Cloud computing platform provide storage and computing capabilities for these massive and diverse big data.After a series of steps,such as management,processing,analysis and optimization,the results will be fed back to the above applications with the aid of all kinds of distributed platform,which will bring a huge economic and social value.Data mining mainly consists of four parts,clustering analysis,prediction modeling,correlation analysis,anomaly detection.Through the application of data mining algorithm,the required information will be obtained from the complex data.Some classical algorithms,such as clustering algorithm,classification algorithm,association rule algorithm and so on,have been widely used in practice.Based on the distributed framework,Spark,this paper focuses on the research of mining algorithm of maximal frequent itemset and density clustering algorithm.In the aspect of frequent itemset mining,because of the high value and advanced information is hidden in the longer frequent items,the maximum frequent itemset mining keep a higher value Combining the advantages of existing algorithms,the refined algorithm solve the problem of data mining frequent itemset in large scale and high dimension and avoided the reduction of efficiency in conventional maximum frequent data mining algorithms under large scale and high dimensional datasets,by first utilizing depth-first search algorithm to generate maximum candidate frequent set,then sorting the aquired dataset by length and testing superset cyclically.In the aspect of density clustering algorithm,based the G-DBSCAN algorithm and the Spark framework,an overlapping density clustering algorithm is proposed after analysis the limitations of existing clustering algorithms density limitations on dealing with large data after.The algorithm refines the merging method of local clustering result and different partitions.It first generates packets of fixed size overlap by traversing data set,then locally merge each group and finally merge all partitions in global.The improved algorithm not only increases the accuracy and time efficiency of the clustering results,but also reduces the requirement of the hardware in the process of mass date.Experiments prove that these two kinds of improved algorithms are feasible and effective.
Keywords/Search Tags:big data, data mining, frequent items, density clustering, the distributed technology
PDF Full Text Request
Related items