Font Size: a A A

Crawl Schedule Research For Real-time Vertical Search Engine

Posted on:2011-11-12Degree:MasterType:Thesis
Country:ChinaCandidate:J Q ZhouFull Text:PDF
GTID:2178360302974661Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The development of real-time vertical search engine meets the requirement of the searching of mass time-sensitive data. The crawl-task schedule related technologies are the key of real-time vertical search engine, and can notably affect the efficiency and user-experience of real-time vertical search engine products. However, currently the research of crawl-task schedule related technologies haven't been addressed in the research area, and the real-time vertical search products are confronted with the problems like expiration of data and waste of crawl resources.This paper addresses the problems of crawl-task schedule, and does specific summarization and research in the related areas. First, analyze the basic problems of data crawl, summarize the basic crawl strategy and data change rate prediction algorithms. Then a new vertical search engine object cache optimization strategy was proposed named OLCO strategy: Based on relationships between objects and their properties, we use a popular object prediction model to predict the tendency of popular object distribution; since data changes by a Poisson process, we deduce a procedure to maximize the data freshness and an optimal strategy to distribute and balance resource. At last a new self-adaptive crawl-task schedule model was proposed name SACD model: using the concept of self-adaptive, this model solved the problems like complex configuration, high maintenance cost in real-time vertical search products.Many experiments have been performed to verify the OLCO strategy and SACD model using data from real real-time vertical search engine products, and results show that with new strategies and model adopted, the increase in time complexity is relative limited, while the average freshness of user query result and query precision ratio are much better than traditional strategies, the new strategy and model is valuable.
Keywords/Search Tags:data crawl, cache strategy, real-time search, vertical search, vertical search engine
PDF Full Text Request
Related items