The Design And Implementation Of WEB Crawler And Topic Search Engine Based On Nutch

Posted on:2017-06-26

Degree:Master

Type:Thesis

Country:China

Candidate:R Wang

Full Text:PDF

GTID:2348330518496235

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the development of Internet and technology of Web 2.0,there has been a booming increase in the amount of information stored on the Internet,thus the search engine is becoming more and more important.One research topic in the field of Network Information Retrieval is the subject-oriented Web Crawler and the search engine technology.Different from traditional search engines which collects webpages indiscriminately,subject-oriented search engines only collects webpages customized by the user and related to particular topics.As the subject-oriented search engine can improve the accuracy,depth and breadth of the query,it can greatly enhance the efficiency of people's work and life.Nutch is an open source web crawler system which is based on Lucene,by combining with Solr indexing server,it is a framework with high-standard modules.Although Nutch Integrates Plugins with a variety of functions,the weakness that it can not parse script content and filter the theme.As a result,the final search results will be impacted.This paper designs and implements a subject-oriented search engine based on the opern source search engine-Nutch.The main work of this article includes the following aspects:1.This paper makes some research related to the subject-orientd search engine and analyzes the working principle of the open source web crawler Nutch.It introduces and analyzes the important components of the topic search engine and Chinese word segmentation.2.This paper analyzes that original web crowler can not crawl the dynamic links and contents in the pages,it designs and implements a JS parser plugin based on Nutch Plugin System.This parser plugin can parses the script in the pages to extract the dynamic links by regular expression when the web crawler crawls pages.Besides,for the Ajax request,the dynamic pages will be staticized through Htmlunit,as a result,the dynamic contens can be crawled.3.According to Bayesian classifier algorithm,the training documents are used to generate the Bayesian model.Then the subject of the webpage crawled by the improved web crawler are distinguished before indexes are established on the webpage.If the subject matches the target subject,then the webpage is stored,otherwise,the webpage is ignored,based on which the Bayesian-classifier-based subject crawler is realized.4.The dictionary-based IKAnalyzer is used to testify and improve the functionality of Nutch in the segmentation of Chinese words,based on which the segmentation result is improved.5.This paper designs and implements a subject-oriented search engine based on Nutch,some relevant expriments are done for testing the web crawler performance and the precision of the system.It shows that:this system is effective.Alough the JS parser function and topic filtering function lower the crawling efficiency,the precision of the search engine system designed by this paper has been greatly improved compared to the original Nutch system and general search engine Baidu.

Keywords/Search Tags:

Foucus search engine, Ajax, Focus, Bayes, Chinese Word Segmentation

PDF Full Text Request

Related items

1	Chinese Word Auto-segmentation Design And Algorithm Realization For Chinese Network Information Retrieval
2	Research On Chinese Word Segmentation Of Search Engine
3	Applied Research Of Chinese Word Segmentation In Agricultural Vertical Search Engine
4	The Research And Realization Of Chinese Word Segmentation System Applies In Chemical Professional Search Engine
5	A Design And Application Of Personalized Information Retrieve And User Recommendation On Search Engine
6	The Campus Network Core Search Engine Technology - Chinese Word Segmentation
7	Study And Implementation On Chinese Word Segmentation Algorithm Of Search Engine Based On Nutch
8	The Research And Application Of Chinese Word Segmentation Technology In Search Engine
9	Research And Implementation Of Several Key Technologies In Intelligent Chinese Search Engine
10	Word Segmentation-based Enterprise Document Search Engine Design And Realization