Font Size: a A A

The Researches Of Data Mining Technology Based On Data Stream In The Big Data Environment

Posted on:2016-07-27Degree:MasterType:Thesis
Country:ChinaCandidate:J WangFull Text:PDF
GTID:2348330476955762Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Today is the era of big data with the development of internet and e-commerce as well as the explosive growth of data. The new requests have been put forward to manage and analyse of data in the fields of sensor data transmission, web log data collection and mobile phone interaction etc. The essence of the requests is the management and analysis of data stream. The new issues of data mining technology in the big data environment is how we can excavate useful infomation from the huge amounts of data quickly, accurately and at low cost for the enterprises. At the same time, the popularity of e-commerce makes people shop from offline to online. Under this environment, recommendation system has come into being, which also indicates the problem of information overload.This thesis focuses on the two issues:(1) Data stream algorithm for mining frequent elements;(2) Mining the user's recommended information based on parallel algorithm. These issues also are hot topics in the field of data mining. When using traditional serial algorithm to solve these problems with the growing amount of data, the solution efficiency will be declined sharply in the performance due to the limitations of the single computing power or memory capacity.This thesis focuses on analyses and improvements of the traditional algorithm, and comparative analyses of experimental data.(1) About data stream algorithm for mining frequent elements, this thesis discusses the methods of finding the most common elements and the most popular elements. The most common element has been resolved by using improved DGIM algorithm and distributed algorithm on the hadoop platform.The most popular element problem has been resolved by using improved exponential decay window model. The improved DGIM algorithm's upper bound of error converges to small value by adjusting the number of bucket of the same size. The improved exponential decay window model has been improved for the computational efficiency by giving a threshold for the model.(2) In the area of mining user's recommended information based on parallel algorithm, this thesis analyzes association rules and collaborative filtering recommendation based on parallel algorithm. About association rules of mining, this thesis improved the traditional Apriori algorithm by using the hadoop platform. The improved algorithm greatly reduced the number of iterations by adding local candidate itemsets and mined the global frequent itemsets using only two traverses of data set which is more suitable for big data environment mining. In the collaborative filtering recommendation, this thesis computes the similarity calculation by parallel algorithm based on the Hadoop which can distribute the calculation to nodes of each machine and uses common words analysis based on the hadoop to fill the utility matrix. Common words analysis algorithm based on the hadoop don't need to load the utility matrix to local memory. The time complexity of the algorithm has been reduced from O(m3n) to O(mn) by transforming row and column vectors for Map Reduce model which uses row vectors to compute, so the improved algorithm is more suitable for big data environment. In the experiment section of this thesis, the process of recommendation has been simulated, which used double platform architecture. The computing platform based on hadoop provided a good support for effective data processing.
Keywords/Search Tags:Data stream mining, The mining of association rules, Information recommendation, Parallelization
PDF Full Text Request
Related items