Font Size: a A A

Research On Query Log Based Keyword Search Over The Database

Posted on:2013-10-22Degree:MasterType:Thesis
Country:ChinaCandidate:L GaoFull Text:PDF
GTID:2248330374983119Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The database system is usually treated as the warehouse which is used to organize, store and mange the data. It has been widely used in many fields such as enterprise, departments and even in the personal lives. With the development of the modern Web, the information of the modern world significantly explodes and the data stored in the database system also greatly increases and users search the information more frequently. In order to access the traditional databases, users have to master the structured query language and be familiar with the underlying data schemas, which is very complicated for average users. Inspired by the great success of information retrieval style keyword search on the Web, the keyword search on the databases has been extensively studied by researchers from the database and information retrieval field and emerged as the hot area.Compared to the traditional database access, the keyword search is simple and easy to use and has no fixed format which relives the users from the steep learning curve. Despite its advantage the keyword search brings the great challenges to the developers on how to develop the efficient and robust system. Generally the result returned by the traditional database is a set of tuples. Compared with this, keyword search usually assembles different pieces of data matched with keywords locating in different tables to form the final result, which causes the explosion of the search space. In general the search space is exponential in the number of keywords in the query. Moreover the queries posed by users are often dirty with irrelevant or incorrect words which will affect the accuracy and efficiency of query processing.Lots of candidate network are generated in the query processing of the schema based approach. The relationship represented by some of these candidate networks are less meaningful or rarely accessed by users while the relationship represented by some of these candidate networks are preferred by users. Traditional schema based approach usually evaluates all candidate networks following the order of the size of the candidate network rather than the preference to the candidate network by users, which also affect the quality and efficiency of query processing.In this paper, motivated by the mentioned problem, we propose two query log based approaches to enhance the original query cleaning approach in order to further improve the quality of the query cleaning. We also make use of the tree mining algorithm to mine the query log in order to obtain the user preference which is applied to improve the schema based approach. Our contribution is as follows:1. Based on the original scoring function in the query cleaning algorithm, we propose two query based approaches to enhance the original approach. The original scoring function scores the generated terms based on the database without considering the behavior of terms in the query log. We score the generated terms based on the query log in two different ways to obtain the term score in the query log. Then we combine the original score together with the term score in the query log to form the final score. Our experiments shows that the two enhanced approaches proposed indeed improve the quality of query cleaning.2. We further improve the traditional schema based approach using the query log. Generally, the schema based approach processes queries in two consecutive steps:candidate network generation and candidate network evaluation. The query log used here records the candidate networks selected by users. We introduce the existing tree mining algorithm to mine the query log to obtain the frequent patterns preferred by users. Then we introduce the tree edit distance to define the similarity between the generated candidate networks and the mined frequent patterns. Based on the similarity the candidate networks are sorted and the candidate networks with higher rank are prioritized to evaluate.
Keywords/Search Tags:Search, Schema Graph, Query Log, Query CleaningEffectiveness, Tree Mining, Tree Edit Distance
PDF Full Text Request
Related items