Font Size: a A A

Research On The Key Technology Of Focused Crawler System

Posted on:2018-12-27Degree:MasterType:Thesis
Country:ChinaCandidate:H Z LiFull Text:PDF
GTID:2348330518495469Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of network technology, the number of pages on the Internet and information growing at an exploration rate. In order to meet people's personalized search needs, vertical search engine came into being, which only collect and retrieve the relevant topics, and can help people find relevant topic information more quickly and more accurately. The Web crawler is the core module used by the search engine to collect data. The goal of the general spider is to cover the whole network as much as possible, while the focused crawler for the vertical search engine which aims to collect only the data related to the topic.Firstly, the paper briefly introduces the general spider model, the basic principle and the its limitation, and then introduces the similarities,differences and the key technologies of focus crawler and general spider.And the related knowledge and thematic crawling strategy of the subject judging module are studied in detail. In the past researches, the weakness of the value chain prediction module, the subject judgment and the index page judgment are all done in the specific web pages, which leads to the problem of the resource waste of spider. What's more, the traditional subject judging method doesn't have high accuracy judgment. According to the problems above, we do the following work in this paper.(1) The text features are represented by a continuous distributed word vector which is better than the traditional vector space method. And a short text classifier based on convolutional neural network is constructed for subject discrimination.(2) Put forward a classification method based on anchor text information that can be used to judge web page type (content type and index type), and the text type based on anchor text information and the short text classifier based on convolution neural network. Design a focused crawler system based on anchor text. The focused crawler uses only the information of the anchor text to determine whether the page is related to the subject and what type, optimizing the crawler logic.(3) The classification algorithm and spider system are designed by using python language. The convolution neural network classifier, web page type classifier and focused crawler system are tested respectively. The experiments showed that the accuracy of the classifier and the crawler of the harvest rate of 90% or more.Based on the research of this passage, Combining with the actual requirements of scientific project declaration, we applied the focused crawler designed in this passage to a scientific research service platform and accomplished scientific research news and related scientific project information collection job.
Keywords/Search Tags:focused crawler, word embedding, convolutional neural network, index page judgement
PDF Full Text Request
Related items