Font Size: a A A

Study On Keywords-Based Approximate Search Techniques On Relational Databases

Posted on:2009-08-31Degree:DoctorType:Dissertation
Country:ChinaCandidate:B WangFull Text:PDF
GTID:1118360308978813Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
In recent years, with the booming development of Web data, the problem of how to provide effective and efficient search techniques for users becomes a hot issue. In distributed environment, collected data from different data sources are typically stored in relational database systems. Such data might have heterogeneous schema. In current phase, search engine is the primary way to support user queries.According to search interest, users submit queries to search engine. The search engine needs provide efficient way to answer those queries. In many cases, users'queries might not exactly match data in the database. For example, users have limited knowledge about data, data might contain errors, data from different sources might contain inconsistency, and etc. Therefore, how to efficiently support keyword-based approximate search on relational database brings several challenges, such as approximate query processing on relational database, approximate string matching, ranking search results, and etc. This dissertation faces on the above challenges and makes the following contributions.(1) In order to improve the search performance of string data in relational tuples, a novel index structure, called VGRAM (Variable-length GRAM), is proposed. Compared with traditional index structure for approximate string query, VGRAM index not only reduces query time, but also decreases index space by more than one time. Meanwhile, it provides good portability. Any algorithms using gram-based index can adopt VGRAM to improve their query performances. Its main idea is to choose high-quality grams of variable lengths from a collection of strings to support the queries on the collection. The main corresponding research problems inlucde how to select high-quality grams from the collection, how to generate variable-length grams for a string based on the pre-selected grams, and what is the relationship between the similarity of the gram sets of two strings and their edit distance. Extensive experiments on real data sets to evaluate show the significant performance improvements on three existing algorithms.(2) A cost-based gram selectivity technique is proposed. When adopting gram-based inverted lists as index structure, the index item, i.e. gram, determines the index structure and further affects query preormance. According to the relationship between the gram index item set and the performance of queries, a dynamic programming algorithm is proposed for computing a tight lower bound on the number of common grams shared by two similar strings. Then this dissertation analyzes the effects of a set of gram index items on the index structure of the string collection as well as ultimately the performance of queries deeply, designs an algorithm for automatically computing a set of index items of high-quality grams for a workload of queries. Experiments on real data sets show the improvement on query performance achieved by these techniques. To our best knowledge, this study is the first cost-based quantitative approach to deciding good grams for approximate string queries.(3) Limited form query interface requires query expansion to improve the queryability, therefore a query rewriting technique is proposed based on form query interface. Many approximate search engines provide form interface, and users often post queries through form-based interfaces to retrieve data. However, answers to these queries are mostly computed according to the keywords entered into different fields specified in a query interface, and their recall could thus be low. The recall ratio in answering this type of query can be improved by considering closely related previous queries submitted through the same interface, along with the answers. This dissertation presents an approach to enhancing the retrieval of relevant answers, especially recall ratio, to a form based query by adopting the data-mining approach using previous relevant queries and its answers. Experimental results on a randomly selected set of 3,800 documents from various Web sites show that the proposed data-mining and query rewriting approach achieves the average precision and true positive ratios on rewritten queries in the upper 80% range, whereas the average false positive ratio is less than 2.0%.(4) According to the dependencies in between relational table and tuples, a semantice evaluation model is proposed to support keywork-based search on relational database, including computing semantic correlation and semantic score function. Based on the proposed semantic score function, two Top-k search algorithms, BA (Blocking Algorithm) and EBA (Early-stopping Blocking Algorithm), are presented. EBA improves BA by providing a filtering threshold to terminate iterations as early as possible. Experimental results show the semantic ranking function guarantees a search result with high precision and recall and the proposed BA and EBA algorithms have better query performance than the existing approaches.In summary, this dissertation dedicates to study key problems related to keyword-based approximate search on relational databases and proposes several novel solutions. Theoretical analyses and experimental test over real datasets demonstrate the effectives and efficiency of the proposed approaches.
Keywords/Search Tags:keywords-based search, approximate string query, query expansion, gram-based inverted list, index, semantic ranking function
PDF Full Text Request
Related items