Font Size: a A A

Research And Improvement Of Association Rules Algorithm Based On Hadoop

Posted on:2017-12-29Degree:MasterType:Thesis
Country:ChinaCandidate:X GaoFull Text:PDF
GTID:2348330512452841Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of society, the application of Internet has got popularity and plays important role in our daily life and work. The intelligent terminals are gradually popular and generate lots of network social data, and these data hides many valuable data relations, however, how to manage and dig these valuable data has gotten more and more attention. Developers of computer find that the deep digging for data at scale would discover the regular data and then clear up them to make decision scheme for users. With the exponential increase of data, the calculated amount of digging generally exceed the load of local computers, but the increasingly active distributed platform could provide strong process capacity. If we Combinee the data digging and cloud computing to realize the parallel operation of big data process, then we would gain high computing efficiency of apriori algorithm, and at the meantime we could gain more nuclear data rules.While descripting each link of data digging, this paper analyzes the apriori algorithm. In order to apply the apriori algorithm to the distributed platform, we made optimization at a certain degree for the architecture of this algorithm to adapt the need of platform and to finish all the tasks of data digging on it. In the process of design, we refine the data item to make parallel computation in distributed environment, then the cluster make computation of frequentness. The Hadoop platform needs to sort all the affairs, firstly each affair would make up in each worker and the master makes tasks' allocation. Each worker make sorting for their part data and designs the apriori algorithm in the cluster to realize the map and reduce operations.This platform makes digging for the data relationship of item set and it needs to dig the biggest item of each subset. After each worker sorts all the affairs, we will get the corresponding sub-item set. The algorithm requires that the amount of sub-item sets is less than or equal to the pre-set biggest number of item sets. For the new sub-item sets' support, the algorithm would make automatic calculation. On the Hadoop platform, each worker computes the support of item and clears up related affairs, and then reduces the same item sets in the Reduce stage to sum the support of each item. Next the system analyzes all the supports and compares the minimum support with each item's support and deletes them that are smaller than the minimum support. At last, the system reduces all the frequent item sets and makes computations to realize the apriori. This paper realizes apriori algorithm architecture based on distributed computing framework Hadoop and the experiments' results verifies the feasibility of running apriori algorithm on distributed framework, which is of great significance for processing big data on distributed platform.
Keywords/Search Tags:Data mining, association rule, distributed computing, frequent item set
PDF Full Text Request
Related items