Research On Intersection Cache Based On Frequent Itemset Mining In Search Engines

Posted on:2016-04-10

Degree:Master

Type:Thesis

Country:China

Candidate:W W Zhou

Full Text:PDF

GTID:2348330479953428

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Modern search engines need to store vast amounts of data and receive high concurrency retrieval requirements. Due to the cheap price, large capacity and so many advantages of the hard disk(HDD), many search engines take HDD as their main storage medium. But HDD's reading and writing performance are relatively too low to the memory. As a consequence, search engines' main bottleneck lies in the low I/O speed of disk. In order to settle this problem, a lot of caching technologies have been used as optimization measures. However, some existing caching technologies have some underlying problems. The query results cache and posting list cache can't cope well with longer queries. The intersection cache data selection strategy is really inefficient and its flexibility to different applications is too poor. Therefore, a new cache data strategy is needed, which has a better balance among the search engine retrieval performance, cache data strategy efficiency and application flexibility.Aiming at the problems of existing search engine cache architectures, a three-level memory cache architecture TLMCA is proposed. It puts the most frequently accessed query results, posting lists and intersections data into the memory, so as to return the retrieval results to the user as soon as possible. Compared with the traditional secondary memory cache architecture, the retrieval performance of TLMCA is improved by 27%, and the intersection cache has little impact on the hit ratio of query result cache and posting list cache.In order to improve the intersection cache data selection efficiency, and enhance the flexibility of intersection data to the different applications, a new intersection cache data selection strategy based on Top-N frequent itemsets mining of FP-Growth algorithm, and the corresponding cache check process are proposed, which uses the greedy strategy to reduce system overhead when hitting the intersection cache data. At the same time, the intersection cache data also has some characteristics, when the maximum length of the intersection data item is 3, the retrieval performance can be best.In order to ensure the validity of intersection data in continuous retrieval flow, the intersection data cache replacement strategy based on incremental frequent itemset mining is proposed. A prefix tree structure of Trie-Tree makes full use of the previous model which has been established before to reduce the cost of incremental frequent itemsets mining. At the same time, based on the feedback of cache hit ratio regulating mechanism, an occasion definition of intersection data cache replacement in the dynamic data stream is presented, which aiming to have a good balance between the offline analysis system overhead and online retrieval system performance.

Keywords/Search Tags:

Search Engine, Intersection Cache, Frequent Itemset Mining, Incremental Frequent Itemset Mining

PDF Full Text Request

Related items

1	The Research And Application Of Association Rules Mining Algorithms Based On Directed Itemset Graph
2	Research On Mining Frequent Itemsets Algorithm Based On Bittable
3	The Research On IDS Based On Mining Max Frequent Itemset Using Big Step Pruning Strategy
4	Multi-Relational Frequent Pattern Mining Algorithm And Its Application Research
5	Research On Frequent Itemset Mining Algorithm And Its Parallelization Based On Spark
6	Research On Frequent Itemset Mining Based On Differentially Private Model
7	Research On Privacy Preserving Approaches For Frequent Itemset Mining And High-Utility Itemset Mining
8	Research And Realization Of Parallel Algorithm For Mining Frequent Closed Itemsets
9	An Algorithm For Mining Frequent Itemsets From Data Streams
10	Research And Application Of Frequent Itemset Mining Algorithm