Font Size: a A A

The Design And Implementation Of WEB Crawler And Topic Search Engine Based On Nutch

Posted on:2017-06-26Degree:MasterType:Thesis
Country:ChinaCandidate:R WangFull Text:PDF
GTID:2348330518496235Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of Internet and technology of Web 2.0,there has been a booming increase in the amount of information stored on the Internet,thus the search engine is becoming more and more important.One research topic in the field of Network Information Retrieval is the subject-oriented Web Crawler and the search engine technology.Different from traditional search engines which collects webpages indiscriminately,subject-oriented search engines only collects webpages customized by the user and related to particular topics.As the subject-oriented search engine can improve the accuracy,depth and breadth of the query,it can greatly enhance the efficiency of people's work and life.Nutch is an open source web crawler system which is based on Lucene,by combining with Solr indexing server,it is a framework with high-standard modules.Although Nutch Integrates Plugins with a variety of functions,the weakness that it can not parse script content and filter the theme.As a result,the final search results will be impacted.This paper designs and implements a subject-oriented search engine based on the opern source search engine-Nutch.The main work of this article includes the following aspects:1.This paper makes some research related to the subject-orientd search engine and analyzes the working principle of the open source web crawler Nutch.It introduces and analyzes the important components of the topic search engine and Chinese word segmentation.2.This paper analyzes that original web crowler can not crawl the dynamic links and contents in the pages,it designs and implements a JS parser plugin based on Nutch Plugin System.This parser plugin can parses the script in the pages to extract the dynamic links by regular expression when the web crawler crawls pages.Besides,for the Ajax request,the dynamic pages will be staticized through Htmlunit,as a result,the dynamic contens can be crawled.3.According to Bayesian classifier algorithm,the training documents are used to generate the Bayesian model.Then the subject of the webpage crawled by the improved web crawler are distinguished before indexes are established on the webpage.If the subject matches the target subject,then the webpage is stored,otherwise,the webpage is ignored,based on which the Bayesian-classifier-based subject crawler is realized.4.The dictionary-based IKAnalyzer is used to testify and improve the functionality of Nutch in the segmentation of Chinese words,based on which the segmentation result is improved.5.This paper designs and implements a subject-oriented search engine based on Nutch,some relevant expriments are done for testing the web crawler performance and the precision of the system.It shows that:this system is effective.Alough the JS parser function and topic filtering function lower the crawling efficiency,the precision of the search engine system designed by this paper has been greatly improved compared to the original Nutch system and general search engine Baidu.
Keywords/Search Tags:Foucus search engine, Ajax, Focus, Bayes, Chinese Word Segmentation
PDF Full Text Request
Related items