Font Size: a A A

Research Personalized Web Crawler Based On Rules Engine

Posted on:2011-05-08Degree:MasterType:Thesis
Country:ChinaCandidate:S J ZhaoFull Text:PDF
GTID:2198330335489809Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Currently the Internet has become a public necessity of life, everyone's working life need to find information from the Internet, search engines to find information in the course of the Internet played a very important role.Google led to a variety of comprehensive search engine to help users find information from the Internet, but the search results only where the site is information, this approach is ideal for static pages, but now more and more dynamic pages, users need to search is unstructured and structured information in web pages, for example, information about the different ticketing websites, real estate information, commodity information, etc., now to get this information through the vertical search engine focused crawler to achieve, but now these vertical search engines Information from two of the general strategy is to use the theme of a web crawler to crawl, and then the analysis of web pages crawled extraction; the other is focused crawler when crawling web pages to extract. A wider front crawl the web, but the analysis is slow, nothing more pages, the efficiency is relatively low, the latter now generally used in a way, this way, high accuracy, capture accurate information extraction can page faster.Either way, information extraction are highly relevant, but the current widespread theme crawler configuration is not flexible, user participation is not enough and other issues, the paper by studying the search engines and rule engine technology, is proposed to establish by rule engine search engine configuration mechanism, to achieve the configuration of the subject can be personalized reptiles purposes.Papers will be focused crawler to crawl personalized ground rules for the process design editor module, the rule engine module and reptiles crawling module composed of three parts. Developed first by the rule editor module rule base needed to crawl, and then the facts will crawling task execution data and rule base are submitted to the rule engine module, and finally from the rules engine module reptiles crawl under the rules govern the operation of the module.To simplify the rule base settings to reptiles crawling module into small tasks by the five completed treatment were pre-crawl, crawl, content extraction processing, write, and the index processing, post processing, each small Common tasks will correspond to the rules engine conversion algorithm processing mode, so users can set rules for libraries, work flexibility to adjust the reptile, and finally focused crawler with the personalized user control, so everyone can make their own set their own reptile, without affecting other users can also share their own set of rules library.In this way replace the traditional configuration mode, to achieve greater configuration flexibility, the purpose of reducing the difficulty of users, the last example shows use of the feasibility of this approach.
Keywords/Search Tags:Search engines, subject crawler, rules engines, vertical search
PDF Full Text Request
Related items