Font Size: a A A

Research On Key Technologies Of High Performance Similarity Search Algorithms And Optimization

Posted on:2018-01-04Degree:DoctorType:Dissertation
Country:ChinaCandidate:H FengFull Text:PDF
GTID:1368330596452849Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Similarity search is a basic subject in computer science.It is widely used in many research fields,including information retrieval,multimedia processing,machine learning and so on.Nearest neighbor search is the main subclasses of similarity search.As in most of the application scenarios,approximate nearest neighbor results can be good enough to meet the applications' needs.In recent years,the academic community has proposed a series of similarity search approaches to approximately solve the nearest neighbor search problem.Currently in the big data era,the applications has to deal with large scaled and high dimensional datasets with fast speed.The high performance similarity search has become one of the most popular topics in both academic and industrial communities.Therefore,this thesis also focuses on the research of high performance similarity search algorithms and optimizations.The main achievements of the thesis include:1.The optimal subspace construction method for approximate nearest neighbors determination.Subspace clustering based kNN algorithms are a category of similarity search approaches proposed in the recent years.This category achieves scalable per-formance but the search precision is very unstable.Aiming at generating stable high precision search results,four different subspace construction schemes are proposed,and the relationships among the subspace construction scheme,search precision and search speed are found through experimental analysis.Based on the findings,the optimal sub-space construction method is proposed.This method solves unstable precision problem,and improves the precision by 26.7%under the premise of keeping the same search speed.2.A new scalable and high precision parallel similarity search algorithm named PCAF is proposed.PCAF uses the estimated ranking to predict the difference in similarity between each data point.At the same time,a double-heap based data filtering mechanism is designed.A fine-grained data parallel scheme is introduced by decoupling the dependence within each search task.The results show that PCAF has the best scalability and fastest speed,and it can achieve a high precision(over 98%)in the shortest time.Compared with four most popular parallel similarity search algorithms,PCAF gets 1.3 to 18.9x speedup,and it is the only one that can perform exact search in a variety of different datasets.3.An execution optimization framework is proposed to solve the parameter ad-justing issue for similarity search algorithms in practical applications.This framework includes definitions on system architecture,module components and logical structure for implementation systems to be build onto it.By setting the required precision,speed or scalability optimization target,the system can automatically adjust the algorithm's pa-rameters by using binary search principle.As the result,the optimized algorithm can satisfy the precision and performance requirements of the practical application.The ex-perimental results show that the algorithm-specific systems built on this framework for RKD,RBC and PCAF algorithms can improve the search performance by 5.87 to 70.46 times when reaching the desired precision over 95%.Among them,the optimized PCAF can complete the search in less than 3 seconds with high precision of 95.15%for a real large-scale dataset named "GIST1M",which contains 1 million 960-dimensional data points.
Keywords/Search Tags:Similarity Search, k Nearest Neighbors Search, kNN, Subspace Clustering, PCAF, Execution Optimization
PDF Full Text Request
Related items