Font Size: a A A

Research On Batch Processing Of Approximate String Query On Large-scale String Datasets

Posted on:2019-08-21Degree:MasterType:Thesis
Country:ChinaCandidate:Y W ZhouFull Text:PDF
GTID:2428330566496861Subject:Computer technology
Abstract/Summary:PDF Full Text Request
String query is widely applied in data analysis,search engine,biological sequence analysis and many other fields.However,exact string query is not feasible in many situations.For instance,the problems of data quality caused by information factors,technical factors,process factors,and management factors have brought many difficulties for string query.Therefore,it requires to research on approximate string query.Approximate string query refers to finding a string set that are similar to the given string measured by a string similarity function.In recent years,researchers have proposed some methods based on q-gram model and inverted index to answer similarity string query.Current results are only for individual-string query and are lack of batch processing method.Batch processing can improve efficiency of query and resource utilization efficiency.Motivated by the fact,we will focus on the batch processing for approximate string query.This paper is based on the inverted index and uses the Trie-tree to manage the grams in memory.We obtain the candidates set by loading and merging the inverted lists corresponding to the gram,and then uses dynamic programming algorithm to verify the edit distance between candidates and query strings.In order to enhance the filtering effect,the length information and position information can be combined to the inverted index.For batch processing of approximate string query,we establish the gram selection cost model based on the strategy of sharing partial inverted lists among multiple queries and design a string approximate query batch processing algorithm to reduce the query cost.At last,the effectiveness and efficiency of the algorithm is verified from different aspects,including average query time,average wait time and so on.Existing works assume candidates can be fully loaded into memory and they don't take into account the memory constraints.However,in reality,memory-constrained situations need to be considered in many tasks.It benefits to reduce memory consumption in string query tasks.So we will focus on the batch processing under memory-constrained.In order to increase the probability that the same batch query strings share the common grams,we cluster the query strings.We establish a cost model of string query scheduling,and solve it by a dynamic programming algorithm to realize the scheduling of the queries.Under memory constraint,we present the batch processing framework of approximate string query and establish a series of experiments to verify the effectiveness and the efficiency of the algorithm.
Keywords/Search Tags:inverted index, approximate string query, batch processing, memory constrained
PDF Full Text Request
Related items