Font Size: a A A

Research On Query Processing And Result Caching In Search Engine

Posted on:2017-03-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:L B QianFull Text:PDF
GTID:1318330536981060Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology,search engines become a necessary tool to obtain fast and accurate information from massive network resources.In the growing number of web data and various user requests,search engines have to deal with thousands of queries from billions of web page data in seconds time.Therefore,search engines face enormous challenge in the performance,which make them become the focus of current research to study how to improve system scalability and performance of distributed search engines.Aiming at key technical problems including search engines architecture,query algorithms and query result caching,this thesis studies and discusses how to process queries efficiently and improve the performance of distributed search engines.Based on previous work,the indexes and documents are classified by theme,and then the query processing algorithms are optimized.Furthermore,the query result caching strategies are improved and the pre-judement strategy is proposed.All these work can contribute to improve the query performance and system scalability of search engines.The main work can be summarized as follows:1.A theme classification based on web page structure method is proposed for the classical architecture and index partitioning of search engines.then the extensible model of the parallel task is built.First,the importantance of theme information in different tissues is distinguished by web page structures.The original document data and index are built according to theme and approximate URL classification,then each category of index and document data are managed based on the classification.Second,a multi-thread task pool is designed to build the parallel retrieval model for the index and document of each category.Finally,by comparing the experiment results,it is found that the improved model has the advantage of targeted inquires by classified request.The improved model can reduce the average query time and increase system throughput,greatly reducing retrieved scope for each query object and geting a good scalability.2.An improved inverted index structure with bitmap structure is proposed for the postings lists intersections in the query process of the current search engines.In this index structure,a parallel optimization of request is designed by Max-score strategy.First,based on the the skip list structure,the bitmap structrue is employed to record the document identity of inverted items.It can effectively reduce the time complexity of the inverted list merging.Second,depending on the improved index structure and the Max-score strategy,a parallel request algorithm of mult-thread pool management is designed by combining with items heap and results heap.Finally,in order to avoid the cost of switching process,the method which manages the muil-thread dynamically is designed to implement the parallel request algorithm of postling lists.Experiments show that the bitmap structure and request algorithms can improve the average query speed and throughput,the designed muli-thread method can increase the preformance of the entire indexes system.3.In order to further enhance the performance of search engines,several approaches are presented to improve strategies for static and dynamic result caching.In addition,an self-adaptive capabilty allocation algorithm is proposed for hybrid result caching.First,by analyzing the query process and the user query logs locality,a query repeated distance factor is introduced to improve the static result caching.Second,the worth of caching is described based on the query popularity and freshness,and a freshness attenuation mechanism is presented to further improve the dynamic policies.Finally,according to the queue-list cache internal structure,an adaptive capability allocation algorithm is employed to adjust the cache capability to further improve the efficiency in a hybrid cache combining static and dynamic caching policies.By comparing the static,dynamic and hybrid results caching strategies,it shows that the improved caching strategies and related algorithms can increase the overall system performance and significantly reduce the average processing time.4.Based on the improved static and dynamic cache strategies,a new hybird result caching is implemented in term of results page cache and document identity cache.As search engines continuously update their indexes,the query results in long-lived cache entries may become stale.A pre-judgment approach is presented to improve the freshness of the result cache.By designing double-pointer skip table,the Queue-Hash cache structure is designed to search result cache quickly.The incomlete allocation approach is developed to avoid swapping out repeatedly.Simulations using real data show that the new hybrid cache outperforms both page-only cache and docId-only cache.The pre-judement approach can maintain the freshness and increase the accuracy of the query result cache.Choosing the appropriate incomplete allocation strategy can further improve the cache performance.In summary,for the problems of query performance in search engines,the model architecture,search algorithm and query result caching are optimized.Further research will be focused on the classification or clustering methods of the query model and consider the index pruning strategy to establish a more widely distributed caching strategies and predictive models.
Keywords/Search Tags:search engines, query result cache, inverted index, parallel search, query process, time-to-live strategy
PDF Full Text Request
Related items