Design And Implementation Of Multithreading Web Crawler Oriented Topic

Posted on:2018-07-28

Degree:Master

Type:Thesis

Country:China

Candidate:G B Cai

Full Text:PDF

GTID:2348330515994120

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Web crawler is a program that automatically crawls web pages and it is an important part of search engines which use it to download web pages from the Internet.In recent years,the rapid development of the Internet makes the network information appears explosive growth.Because the generic web crawler has been difficult to get the information quickly and accurately from the ocean of data,the topical crawler(also known as the focused crawler)started to be used.The topical crawler filters a topic-independent link based on a certain web page analysis algorithm,retains useful links,places them in a queue.Moreover,all web crawlers capture will be systematically stored,analyzed,filtered,and indexed for subsequent inquiries and retrieval.Firstly,this paper introduces general development and some related techniques of web crawler.Secondly,this thesis focuses on the drawbacks of the generic crawler,analyzes the working principle and related technology of the topical crawler,and gives the work flow and overall architecture design of the topical crawler.And it provides some modules include basic function architecture,crawl web pages,front-ends display,database design and system interface design.By the analysis of algorithm about topic relevance judgment,this thesis takes advantage of the vector space model for page content processing.A vector space model is used to represent the content of a web page as a vector,and a similarity is defined for these vectors,so that the similarity of the content can be determined.In this paper,the Fish-Search algorithm based on content evaluation is adopted to achieve this goal.In the process of URL,the PageRank algorithm based on link analysis is adopted,and the importance of the web page can be evaluated according to the results of quantity hypothesis and quality hypothesis.To summarize,this paper use the search strategy which combines content-based Fish-Search algorithm and the PageRank algorithm based on link analysis to achieve the relevance of the subject evaluation to ensure the relevance of the subject and download page,which effectively avoid the "topic drift " and improve the precision ratio and recall ratio.In multi-threaded processing,the Python thread pool used in this paper is more friendly to 10 intensive tasks,and can effectively improve the efficiency.

Keywords/Search Tags:

web crawler, topical crawler, topic relevance, PageRank, Multithreading

PDF Full Text Request

Related items

1	Research And Realization Of Topical Crawler Based On Content And Hyperlink
2	Research On The Topical Crawler For The Cultural Fields
3	Research On Topical Web Crawling
4	Design And Implementation Of The Theme Crawler For Procurement Clues In The Automotive Field
5	Research And Implementation Of Scientific Topic Search Engine Crawler Based On Nutch
6	The Research Of Topical Crawler Search Strategy In Web Page
7	Research On Topic Focused Web Crawler And Related Technologies
8	Research And Implementation On Algorithms Of Topical Crawler
9	Optimization And Implement Of The Topic Web Crawler Correlation Algorithms
10	Research On The Topic Crawler Algorithm Based On Vector Space Model