Font Size: a A A

Crawl Technology Research For Real-time Vertical Search Engine

Posted on:2012-03-19Degree:MasterType:Thesis
Country:ChinaCandidate:F ChenFull Text:PDF
GTID:2218330368487855Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, especially in times of Web 2.0, both format and content have great changes in websites. There are more and more sites begin to use dynamic pages, and these new pages allow ordinary users to create, modify and publish the content, which get rid of the shackles that only published by server. To create dynamic pages, the technology of Ajax emerged. This technology has great improved user's experience and reduced response time. In addition, web pages in Web 2.0 website access information are more diverse, and timeliness of information is also more stringent than the traditional way. To solve these problems, the traditional crawler technology must be improved in two aspects—crawl dynamic web page and timeliness.In aspect of crawling dynamic web page, the crawler need to perform dynamic script to get the page content, the switches between pages are no longer all based on the "<a>" tags and URL is no longer the unique identification of a web page. This paper presents a model which uses the embedded browser to achieve the resolution of the dynamic script, and propose an efficient approach to crawl valid page for websites with dynamic scripts. First, by training we can get the elements and triggered the events which can jump to the effective page. Then we summed up the XPath features of these elements and the events we have to trigger, and in the application stage, we only trigger these specific events. Finally, we proved the efficiency and performance of this model through experiments.On the other hand, we focus on the basic problems of data crawl, and predicting the frequency of data changes by analyzing the historical process. In real-time vertical search areas, object changes more sensitive than the traditional search engines, so we propose a model to predict the tendency of object distribution. Considering the weight and the changing frequency of objects, this paper proposes a crawl strategy based on Poisson process which improves the resources utilization and data freshness.
Keywords/Search Tags:Dynamic Script, Embedded Browser, Real-time Search, Data Crawl, Poisson Process
PDF Full Text Request
Related items