Font Size: a A A

Research On Pattern Mining Based On Sampling In Big Data

Posted on:2015-12-12Degree:MasterType:Thesis
Country:ChinaCandidate:J AiFull Text:PDF
GTID:2298330422491924Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the concept of cloud computing and mobile Internet comes into the livesof ordinary people more and more deeply, the big data is also becoming increasinglypopular. In today’s competitive business battlefield, the one who can grasp the keyto crack the big data will be the one who could be able to stay ahead in the businessworld. However, for the exploration and research on algorithms of the big data isunable to meet the people’s need to dig out a lot of valuable knowledge from themassive information. Therefore, the study of data mining algorithms on big data isextremely important.Frequent patterns mining is a subject of extensive research and a very valuableresearch subject. In the past20years, a variety of frequent pattern miningalgorithms have been proposed. Briefly, it includes three types of algorithms in total.The first class is "Candidate-test" pattern-based apriori algorithm and its expansionalgorithm. The second class is the FP-Growth algorithm and the expansionalgorithm. The last is vertical mining algorithms. However, all of the frequentpattern mining algorithms of the three type have common shortcomings. In today’ssharp increase in the amount of data, these algorithms have been unable to meet theneeds of a large amount of data mining. On the one side, the data is so large that itcan’t be stored in the memory. What more, the amount of data improve rapidly thatit boost the running time of the algorithm, so it can’t meet the actual requirementsof the people. The efficiency of Mining algorithm still needs to be improved, whilethe research of mining algorithms on big data is not enough. So to propose new,efficiency and efficient pattern mining algorithm is meaningful. Boley et alproposed a direct sampling method in the pattern space, greatly improved the timecomplexity, while the effectiveness of its excavated pattern can’t be guaranteed.This paper improves the direct sampling algorithm by verify and updatesampling results. What more, the paper improve the two-step random procedure. Weadjust the length of the excavation pattern by control the probability threshold, so asto increase effectiveness of the mined pattern with the cost of not a very big timecomplexity. Through experiments, we can see the enhanced direct sampling methodcan be a good method to improve the effect of mining algorithms. Meanwhile, we propose a distributed enhanced two-stage random samplingalgorithm based on Map-Reduce. The algorithm solve the problem of sampling withweights (WAS) by A-RES/A-ExpJ algorithm, to solve the sampling problem inMap-Reduce framework. And we find solution the obtain of low-frequency itemsetsby lossy-counting algorithm, to facilitate pattern validation process. Thus, thealgorithm is well migrate to the Map-Reduce framework.
Keywords/Search Tags:pattern mining, sampling, big data, Map-Reduce
PDF Full Text Request
Related items