Font Size: a A A

Research And Application Of Association Rule Algorithm Based On Spark Platform

Posted on:2019-01-25Degree:MasterType:Thesis
Country:ChinaCandidate:P XuFull Text:PDF
GTID:2428330566499383Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of Internet/mobile Internet,digital devices,Internet of things/sensors and other technologies,and social informatization has been continuously promoted,the global data production speed is growing rapidly.The sheer scale,ubiquity,and explosive growth of this kind of data makes us truly live in a big data age.Association rule mining technology is an important branch in data mining.It is mainly devoted to the discovery and analysis of related items in a large number of transaction data.There are previously unknown and potential value of knowledge and rules between these items.In the big data environment,the classic association rules mining technology can not satisfy the characteristics,such as large data volume,diverse data structure and data distribution storage.Therefore,we should have a combination of existing mining theory and model,and combine with service characteristics of huge amounts of the data.Then we improve the traditional association rules algorithm to make it suitable for today's popular large data calculation engines.Based on the characteristics of big data development and the research status of association rules mining,this thesis develops the application and research work of association rule algorithm on the popular big data computing engine Spark.The main work is as follows:First,the thesis analysis the big data platform.Hadoop distributed file storage component has been researched and analyzed,which is the member of the Hadoop ecosystem to store large amounts of data.In addition,the thesis focuses on the big data analysis engine Spark,including its system structure,program execution,and the operation logic and caching mechanism of Spark RDD.Secondly,the association rule algorithm and its parallel mechanism are studied.In this thesis,the classical association rules algorithm is analyzed and deeply understood.Besides,the existing parallelization algorithm R-Apriori based on Spark platform is deeply analyzed.There are two defects in the Apriori algorithm of classical association rules :(1)the iterative process of algorithm requires frequent scanning of transaction data sets,which generates a large amount of I/O overhead.(2)the algorithm computes frequent item assembly to produce a large number of candidate sets.In this thesis,two methods of changing data storage structure and optimizing candidate connection process are adopted to improve the algorithm.The experimental results show that the improved that Apriori algorithm reduces the time of scanning database,and the generation time of candidate sets decreases exponentially with the number of items.Then,this thesis combines the Spark of parallel mechanism,and analysis the calculating engine Spark deeply,and studies the Spark cell elasticity of distributed data sets of data conversion process.In addition,the thesis illustrates the parallel algorithm under the Spark platform,verified the feasibility of the algorithm.Finally,based on the above methods and theories,this thesis broke through the inherent limitations of traditional serial algorithm Apriori,and builted Spark experimental cluster,and realized the parallelization of improved Apriori algorithm under the Spark computing framework.The results show that the algorithm is superior to the R-Apriori algorithm under Spark framework in terms of data scalability,acceleration ratio and scalability.
Keywords/Search Tags:big data, association rule algorithm, Hadoop, Spark, Apriori, parallelization
PDF Full Text Request
Related items