Font Size: a A A

Study On Similarity Search For Textual And Spatial Data

Posted on:2019-09-21Degree:DoctorType:Dissertation
Country:ChinaCandidate:M H YuFull Text:PDF
GTID:1368330590451417Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In the information age,applications such as search engine,location-aware ser-vice(LBS)and office automation system become more helpful in daily life.In these systems,the most important part is data processing,especially data searching.As there are many problems including typo errors and multiple attributes data existing in the real datasets,similarity search becomes an efficient strategy to process data.To address this issue,this thesis studies on similarity search for textual and spatial data and provides efficient algorithms,including:1.A Disk-based Textual Similarity Search Algorithm for Large Data:To address the textual similarity search problem on large scale of dataset,this thesis provides a binary tree index based on the length of texts.With the help of this index structure,this thesis designs a filter-verification framework,and based on this framework this thesis presents algorithms for threshold-based and top-k textual similarity searches.In addition,this thesis extends these algorithms and provide a disk-based algorithm to support large dataset.In the experiments,this thesis evaluated the efficient of our index and framework and compared with existing methods.The results showed our algorithms achieved highest performance than state-of-art algorithms.2.A Rt-Tree-based Spatio-Textual Similarity Search Algorithm:This thesis presents an Rt-tree-based algorithm for location-aware publish/subscribe problem which constructs index by integrating tokens from subscriptions into Rt-tree nodes.To improve the power of textual pruning and reduce the size of index,this thesis selects represent token from each subscription and stores them into Rt-tree nodes.In addition,this thesis extends our algorithm for ranking semantics.The experi-mental results showed our algorithm can efficient deliver messages to corresponding subscriptions by pruning a large number of unnecessary subscriptions.3.An Adaptive Textual Spatial Search Algorithm:To efficient support multiple types messages delivery in a location-aware publish/subscribe system,this thesis provides a cost-based algorithm which can adaptive select better strategies for different type of data.To be specific,this thesis first designs a trie-based index with spatial information and a quadtree index which integrates token on each node.And this thesis also provides corresponding algorithms based on these two indices for location-aware publish/subscribe problem.To process multiple types of data,this thesis presents a cost-based algorithm which always adaptive select lower cost algorithms between trie-based and quadtree-based algorithms to filter and verify data.In the experiments,this thesis evaluated the performance of our algorithms and compared them with existing algorithms.The results showed the cost-based algorithm can achieve highest performance for multiple types data as it always selected more adaptive methods in filtering and verification.
Keywords/Search Tags:Similarity Search, Disk-based Algorithm, R~t-Tree, Complexity
PDF Full Text Request
Related items