Research Personalized Web Crawler Based On Rules Engine

Posted on:2011-05-08

Degree:Master

Type:Thesis

Country:China

Candidate:S J Zhao

Full Text:PDF

GTID:2198330335489809

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Currently the Internet has become a public necessity of life, everyone's working life need to find information from the Internet, search engines to find information in the course of the Internet played a very important role.Google led to a variety of comprehensive search engine to help users find information from the Internet, but the search results only where the site is information, this approach is ideal for static pages, but now more and more dynamic pages, users need to search is unstructured and structured information in web pages, for example, information about the different ticketing websites, real estate information, commodity information, etc., now to get this information through the vertical search engine focused crawler to achieve, but now these vertical search engines Information from two of the general strategy is to use the theme of a web crawler to crawl, and then the analysis of web pages crawled extraction; the other is focused crawler when crawling web pages to extract. A wider front crawl the web, but the analysis is slow, nothing more pages, the efficiency is relatively low, the latter now generally used in a way, this way, high accuracy, capture accurate information extraction can page faster.Either way, information extraction are highly relevant, but the current widespread theme crawler configuration is not flexible, user participation is not enough and other issues, the paper by studying the search engines and rule engine technology, is proposed to establish by rule engine search engine configuration mechanism, to achieve the configuration of the subject can be personalized reptiles purposes.Papers will be focused crawler to crawl personalized ground rules for the process design editor module, the rule engine module and reptiles crawling module composed of three parts. Developed first by the rule editor module rule base needed to crawl, and then the facts will crawling task execution data and rule base are submitted to the rule engine module, and finally from the rules engine module reptiles crawl under the rules govern the operation of the module.To simplify the rule base settings to reptiles crawling module into small tasks by the five completed treatment were pre-crawl, crawl, content extraction processing, write, and the index processing, post processing, each small Common tasks will correspond to the rules engine conversion algorithm processing mode, so users can set rules for libraries, work flexibility to adjust the reptile, and finally focused crawler with the personalized user control, so everyone can make their own set their own reptile, without affecting other users can also share their own set of rules library.In this way replace the traditional configuration mode, to achieve greater configuration flexibility, the purpose of reducing the difficulty of users, the last example shows use of the feasibility of this approach.

Keywords/Search Tags:

Search engines, subject crawler, rules engines, vertical search

PDF Full Text Request

Related items

1	Research And Optimize On Vertical Search Engine Based On Coreseek
2	Research On Design And Implementation Of The Extensible Distributed Vertical Search Engines
3	Research On Topic-Specific Search Engines
4	The Research And Design On Personalized Search For Meta Search Engines
5	Merging multiple search results approach for meta-search engines
6	Enterprise Information Vertical Search Engines Research And Implementation
7	Research And Realization Of Chinese And English Vertical Search Engines On The Police
8	The Research And Application Of Enterprises Documents Search Engines Based On Lucene
9	The Current Situation And Issues Of Chinese Search Engines And The Countermeasures To Develop Them
10	The Design Of Specific Topic Web Crawler And Its Transmission Group