Font Size: a A A

The Research And Implementation Of Minhash Algorithm Based On Mahout

Posted on:2016-09-09Degree:MasterType:Thesis
Country:ChinaCandidate:X S WuFull Text:PDF
GTID:2298330452966423Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rise of big data research field in recent years, the mass and high dimension datausually need to be processed. The Nearest Neighbor Search problem has been widely applied andresearched in many applications fields. Compared to the exact Nearest Neighbor Search, theApproximate Nearest Neighbor Search has great improvement in efficiency in practicalapplications, so it can effectively solve the approximate nearest neighbor search problem. In orderto achieve the balance between search efficiency and search results, the Approximate NearestNeighbor Search is for the lookup efficiency at the expense of finding the improvement ofaccuracy.Locality Sensitive Hashing algorithm (LSH) is mainly to solve the Approximate NearestNeighbor Search problem, and has achieved significant results in the practical application, it is agood method to solve the Curse of Dimensionality. The LSH method can assure the accuracyof the query with probability guarantee, so can be fast in solving the Approximate NearestNeighbor Search. MinHash is originally proposed as an efficient Nearest Neighbor Searchalgorithm which groups similar web pages into the same cluster. MinHash algorithm use theJaccard similarity to measure the similarity degree of the objects.Based on the extensive reading of references at home and abroad, we focus on theimplementations and performance of the MinHash algorithm in the distributed platform, and thengive sufficient analysis and research. Particularly, we first observe a significant deviation betweenthe original MinHash algorithm and the standard MinHash algorithm, and the original MinHashalgorithm is wrong and not complete in the Mahout distributed platform. After careful study theprinciple of MinHash algorithm and the Mahout distributed platform, we rewrite the MinHashalgorithm. To verify the soundness of the revised version, we conduct extensive experiments withseveral real datasets; experimental results confirm the validity and accuracy of ourimplementation. In order to evaluate the accuracy of the approximate results of MinHash algorithm in Mahout, we also devise the Jaccard similarity algorithm to obtain theaccurate Jaccard similarity of any two files. Finally, we make a comparison between the accurateJaccard similarity and the approximate results of MinHash algorithm.
Keywords/Search Tags:MinHash, MapReduce, Mahout, Algorithm Implementation
PDF Full Text Request
Related items