Font Size: a A A

Studies On Distributed Similarity Queries On Different Types Of Data

Posted on:2016-12-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:M D ZhuFull Text:PDF
GTID:1318330542487066Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In recent years,with the development and popularization of communication technology and smart mobile equipment,in kinds of applications and web sites more and more people become "producers" and "disseminators" of information rather than just information"consumers",such as microblog,blog,micro video,review website,etc.While data are dramatically growing in the Internet,the types of data are showing a trend of diversification.Based on the diversification users are able to enjoy more convenient services,such as location-based services,voice queries,image queries.Faced with massive data and complex data types,how to effectively manage the database has become a hot research field.This dissertation deeply studies issues related to distributed similarity queries on a variety of data types,and proposes distributed frameworks for queries on various types of data,and further put forward the corresponding NN,kNN,RkNN similarity query algorithms based on the frameworks.The contribution of this dissertation can be summarized as follows:(1)For data types supporting tree-like indices,first of all,through the analysis of the common characteristics of tree-like indices,including B-tree,M-tree,R-tree,a distributed framework which integrates tree indices and Chord topology is proposed,the framework takes into account the contradiction of the query and update operations in a distributed environment,and the framework can dynamically adjust the number of replicas of each index node according to the number of the query and update mode,in order to improve query throughput as much as possible the update cost under the condition of keeping update cost low.Secondly,based on the framework,a range query algorithm and kNN algorithm are proposed,and then theoretical analyses of related computation complexity are presented.Finally,in order to further improve the performance of query and update efficiency,a method of dynamic index separation is integrated into the distributed framework.(2)For the spatio-textual data type,first,by the analysis of features of similarity computation on spatio-textual data,a hybrid indexing method named hybrid-LSH is proposed,hybrid-LSH simultaneously considers objects' space distance and text similarity,and places similar data objects to the same hash bucket to reduce I/O cost with high probability,and then theoretical analyses of the accuracy and effectiveness of the hybrid-LSH are presented.Secondly,adaptive range query algorithms which can handle queries with varying query ranges is provided,based on the adaptive NN query algorithms,a kNN algorithm with theoretical accuracy guarantee is proposed.Finally,combined with cloud computing technology,proposed algorithms are deployed in a distributed environment,because hybrid-LSH avoids the pair-wise comparison as the traditional methods,similarity computation is only needed in each barrel,which results in significant savings in terms of computation complexity and network cost.(3)For the data type with relation information,first,by analyzing the features of the data types with relation information,a decision tree based framework for distributed data management and query processing is proposed.Secondly,based on the proposed framework a similarity query algorithm is presented,the algorithm can obtain query results before comparing all properties of objects,thereby saving computational cost.Finally,because in a distributed environment,the computation complexity of constructing a decision tree is relatively high,as well as the communication cost.By analyzing the characteristics of construction of decision tree,a distributed method for efficiently building decision tree is proposed,the method doesn't require to sort global data objects,and in the method by transmitting only part of data objects,split point can be calculated with a theoretical accuracy guarantee,what's more,a theoretical analysis of the method's computation complexity is presented.(4)For the uncertain textual data type,first,by analyzing the characteristics of uncertain cosine similarity computation,a fast similarity computation method is proposed.Second,because the cosine distance function is not a metric distance function,and it's difficult to build indices on data,and traditional methods are mainly focus on centralized environments.In this dissertation based on conversion of cosine distance and an improvement of MVP-tree index,a distributed framework for processing cosine similarity queries on uncertain data is proposed.Finally,based on the framework an efficient kNN query algorithm and an RkNN query algorithm are presented.
Keywords/Search Tags:distributed methods, similarity queries, complex data types, uncertain data, index structure
PDF Full Text Request
Related items