Font Size: a A A

The Design And Implementation Of The Complex Rules-Driven Focused Crawler System

Posted on:2017-12-23Degree:MasterType:Thesis
Country:ChinaCandidate:Q LiuFull Text:PDF
GTID:2348330509457566Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Focused crawler, also known as the theme crawler and the direction of this crawler is often with a strong purpose. It can collect web information in a certain order, strives to capture all the information relevant to the subject, and it will get the most relevant web pages at prior, ignore the small correlation ones.This project implements a focused-crawler system, this system can be a specific range of sites for real-time detection. The system adopted compound rules to guide the direction of the crawler. At the same time, the content of the grab will be presented to the users through the website, the users can tag the page content and configure the parameters of system operating.The whole system is divided into two modules, web content acquisition module and display query module. The function of the web content acquisition module is to obtain the web page information which the users need from the network and then to analyze and record it. It mainly includes the extraction of text, elimination of similar web pages, link analysis, content analysis, storage, warehousing, scheduling and some other operations. System through the coordination among these modules, crawled information from the Internet web page to analysis and process, focused on the theme of content pages crawled. Display query module is mainly responsible for the accession of the contents from the pages which are displayed to the users to view. Implementation of display query module based on the SSH framework, the data can be displayed by way of chart in the middle of the page.The whole project has been completed. Project aims to achieve the needs of users, and has been stable online operation for six months, crawled to a number of different pages to 300000, as for the monitoring of domain name is 5000.
Keywords/Search Tags:Focused crawler, topic, complex rules, subject relevance, page, domain
PDF Full Text Request
Related items