Font Size: a A A

Study On Match Similarity Search

Posted on:2016-10-01Degree:MasterType:Thesis
Country:ChinaCandidate:J X RenFull Text:PDF
GTID:2308330476452142Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Similarity search has much concern in such fields as data mining, multimedia information retrieval, social network analysis and biology. Most of existing methods usually used distance methods with metric property to measure similarity and launched a search in full-dimension space. However, the research indicatesd that data object’s attribute values with outlilers, as with the presence of high difference values between current object and other objects in certain dimensions, due to an error reading or noise in receiving, transmission and conversion process, so that final accuracy of the search target was lowered. Follwing that, there were studies proposing based on partial attribute dimensions or subspace similarity search problem, attempting to minimize the effects, but similarity caculation between all objects and query one should be carried out on the attribute values of the same partial dimension selected in advance, so the solutions were still not flexible and effective.Currently, the increasing popularity of cloud computing makes the programs under centralized environment an urgent need to meet the need of big data application. But a single server and traditional technical architecture has been unable to meet the challage of handling huge amounts of data, so the use of a stable and more mature distributed computing framework such as MapReduce, can solve the scalability of single algorithm and ensure the performance of distributed algorithm.Furthermore, index technology has been an important research direction in the content database, so that appropriate index method can improve the performance of original algorithm fundamentally. Also, with the advancement of the research, centralized program in the paper can use an index for further being optimized, so as to be adaptive to the needs of different search queries, for example, similarity join query under Pan data. Above all, the main works of this paper are summarized as follows.1、As for deficiencies of existing subspace similarity search researches, this paper presents a subspace or matching similarity search program. Because similarity ranking under this scenario is based on the matching difference of attribute values between data objects and query object on various parts of dimensions, so this paper also calls it match similarity search. In addition, considering a single algorithm not meeting demands of large-scale data, this article also proposes a distributed program on the basic scheme.2、Around the solved problem and basic solution in first job, we propose a new proximity graph index structure and its construction approach. On one hand, to improve the search performance of the work in centralized scheme; on the other hand, to expand range of query requirements under basic program.Finally, through multiple experimental indicators for argumentation and analysis, the two studies have demonstrated that the proposed solutions are effective and relatively efficient.
Keywords/Search Tags:similarity search, distance metric, adaptive subspace, MapReduce, proximity graph
PDF Full Text Request
Related items