Font Size: a A A

Design And Implementation Of Multithreading Web Crawler Oriented Topic

Posted on:2018-07-28Degree:MasterType:Thesis
Country:ChinaCandidate:G B CaiFull Text:PDF
GTID:2348330515994120Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Web crawler is a program that automatically crawls web pages and it is an important part of search engines which use it to download web pages from the Internet.In recent years,the rapid development of the Internet makes the network information appears explosive growth.Because the generic web crawler has been difficult to get the information quickly and accurately from the ocean of data,the topical crawler(also known as the focused crawler)started to be used.The topical crawler filters a topic-independent link based on a certain web page analysis algorithm,retains useful links,places them in a queue.Moreover,all web crawlers capture will be systematically stored,analyzed,filtered,and indexed for subsequent inquiries and retrieval.Firstly,this paper introduces general development and some related techniques of web crawler.Secondly,this thesis focuses on the drawbacks of the generic crawler,analyzes the working principle and related technology of the topical crawler,and gives the work flow and overall architecture design of the topical crawler.And it provides some modules include basic function architecture,crawl web pages,front-ends display,database design and system interface design.By the analysis of algorithm about topic relevance judgment,this thesis takes advantage of the vector space model for page content processing.A vector space model is used to represent the content of a web page as a vector,and a similarity is defined for these vectors,so that the similarity of the content can be determined.In this paper,the Fish-Search algorithm based on content evaluation is adopted to achieve this goal.In the process of URL,the PageRank algorithm based on link analysis is adopted,and the importance of the web page can be evaluated according to the results of quantity hypothesis and quality hypothesis.To summarize,this paper use the search strategy which combines content-based Fish-Search algorithm and the PageRank algorithm based on link analysis to achieve the relevance of the subject evaluation to ensure the relevance of the subject and download page,which effectively avoid the "topic drift " and improve the precision ratio and recall ratio.In multi-threaded processing,the Python thread pool used in this paper is more friendly to 10 intensive tasks,and can effectively improve the efficiency.
Keywords/Search Tags:web crawler, topical crawler, topic relevance, PageRank, Multithreading
PDF Full Text Request
Related items