Font Size: a A A

Research On Intersection Cache Based On Frequent Itemset Mining In Search Engines

Posted on:2016-04-10Degree:MasterType:Thesis
Country:ChinaCandidate:W W ZhouFull Text:PDF
GTID:2348330479953428Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Modern search engines need to store vast amounts of data and receive high concurrency retrieval requirements. Due to the cheap price, large capacity and so many advantages of the hard disk(HDD), many search engines take HDD as their main storage medium. But HDD's reading and writing performance are relatively too low to the memory. As a consequence, search engines' main bottleneck lies in the low I/O speed of disk. In order to settle this problem, a lot of caching technologies have been used as optimization measures. However, some existing caching technologies have some underlying problems. The query results cache and posting list cache can't cope well with longer queries. The intersection cache data selection strategy is really inefficient and its flexibility to different applications is too poor. Therefore, a new cache data strategy is needed, which has a better balance among the search engine retrieval performance, cache data strategy efficiency and application flexibility.Aiming at the problems of existing search engine cache architectures, a three-level memory cache architecture TLMCA is proposed. It puts the most frequently accessed query results, posting lists and intersections data into the memory, so as to return the retrieval results to the user as soon as possible. Compared with the traditional secondary memory cache architecture, the retrieval performance of TLMCA is improved by 27%, and the intersection cache has little impact on the hit ratio of query result cache and posting list cache.In order to improve the intersection cache data selection efficiency, and enhance the flexibility of intersection data to the different applications, a new intersection cache data selection strategy based on Top-N frequent itemsets mining of FP-Growth algorithm, and the corresponding cache check process are proposed, which uses the greedy strategy to reduce system overhead when hitting the intersection cache data. At the same time, the intersection cache data also has some characteristics, when the maximum length of the intersection data item is 3, the retrieval performance can be best.In order to ensure the validity of intersection data in continuous retrieval flow, the intersection data cache replacement strategy based on incremental frequent itemset mining is proposed. A prefix tree structure of Trie-Tree makes full use of the previous model which has been established before to reduce the cost of incremental frequent itemsets mining. At the same time, based on the feedback of cache hit ratio regulating mechanism, an occasion definition of intersection data cache replacement in the dynamic data stream is presented, which aiming to have a good balance between the offline analysis system overhead and online retrieval system performance.
Keywords/Search Tags:Search Engine, Intersection Cache, Frequent Itemset Mining, Incremental Frequent Itemset Mining
PDF Full Text Request
Related items