| The huge scale of data, various structures and multi-dimensions--all thecharacters of big data require the data mining algorithm improving to the next level:effectiveness of results, quickness of computing and easy process of complex dataformats. To solve these problems, combining the traditional Apriori algorithm withMapReduce framework, this paper proposes a new algorithm MDR-XHapr, bywhich frequent items of data can be extracted in NoSQL database.Firstly, this paper introduces characters of MapReduce. MapReduce is acomputing framework which can be applied in Distributed System (DS). Bysufficiently utilizing the ability of computers, the computing in DS can be quiteeffective with this framework. Traditional data mining algorithm Apriori is analyzedand summarized in this paper to figure out the problems--candidate items screeningand database scanning. We also analyze some improvement of Apriori. In order tosolve the performing bottleneck of the frequent-items-mining algorithm, fourproblems are proposed: candidate items screening, database scanning, effectiveness ofresults and design of the algorithm for parallel.Secondly, this paper proposes an algorithm named MDR-XHapr based onApriori and MapReduce framework. This algorithm utilizes key-value format for thestorage of data, which can store data in various structures easily. Invalid data can bescreened and discarded when we import the data into the database so that the storagefor data can be economized in distributed system, and the problem of redundancy andcomplex data formats will be solved. With MapReduce framework, traditional iteration algorithm should be optimized to three procedures: Map (interest itemsobtaining), Reduce (counting) and Finalize (threshold screening). Through operatingparallelly in DS database, MDR-XHapr scans database only one time to reduce theexpenses of scanning repeatedly for huge scale of data. To analyze the final results offrequent items mining, we propose a concept: interest items. By using interest itemsscreening, we ensure the effectiveness of results and reduce the number of candidateitems in this algorithm.Finally, MDR-XHapr is tested in NoSQL database MongoDB. Three datasetsare used to test the algorithm: Illness, Adults both in UCI datasets, and oceandatasets of13stations along the coast in China. In single computer and distributedsystem, use Illness and Adults to test MDR-XHapr. The results of experiment showsthat MDR-XHapr can reduce the candidate items, scan the DS database only once toget the interest items and improve the performance of mining. Use ocean dataset toget the interest items which the coastal industry is interested in, and compare theresults with the real data observed in weather stations. The results show thatMDR-XHapr can be applied in real datasets and have great application prospects. |