Font Size: a A A

The Application And Research Of Distributed Web Crawler In Agricultural Searching System

Posted on:2017-01-25Degree:MasterType:Thesis
Country:ChinaCandidate:L T YuanFull Text:PDF
GTID:2348330488478226Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Under the background of the rapid growth of Internet information, it is not realistic to cover and collect all kinds of informations of the Internet. Some search engine companies such as Google, Baidu can collect only less than 20% of Internet informations [1]. General search engines are playing an increasingly critical role in people's lives, but with the diversification of people's needs, and the limitations of general engines, it's usually hard to meet the accurate search needs. Commodity's informations are various on the Internet. Agricultural products change a lot in price because of different origin and producting time, it is a problem worthy to study that how to get agricultural informations on the Internet which we are interest in, as there are a large number of agricultural products trading sites. If you can quickly obtain information about related products by simply enter the keyword, it will be a very good service.Due to the huge amount of information network, even producing a very narrow field is also involved with vast amounts of informations. The establishment of a parity system is the primary source of information, is provided in front of the vast amounts of information, single Machine reptiles are evidently limited. This paper proposes a Web Crawler and distributed systems combined.Achieving our web crawler in a multi-machine cluster of distributed systems, there by improving the efficiency of collecting information download rate.The system in this paper is constructed in functional and characteristics on the mature search engine's architecture named Nutch, it can build indexes for agricultural information, and providing search and query capabilities. On account of the product informations which this paper studies are the specific areas of product, it involves filtering of URL and relevance judgment with topic when building the search function and query function. We combine HTMLParaser and regular expressions to filter URL,and judge the theme's correlation with the content of Web pages through establishing the model of the space vector. After collecting Web informations, it will involves segmentation operation during the establishment of complete pages search and search services provided, since the original segmentation word operation of Nutch can not achieve a better result in Chinese, we chose IKAnalyzer to complete the Chinese which support the Chinese better. Index operations provide organization optimization in advance for gathering information, so that the information can quickly be positioned. Nutch system has defaults retrieval called full-text search which supported by Lucene. The system can build indexing of the web page information and provide support for efficient search.Finally, we create the Nutch oriented agricultural commodity information distributed crawler system for experimental verification deployment. It is found that distributed with respect to the ordinary reptile reptiles still exist parallel advantage.The entire system can also be good for the network in agricultural products collection and indexing, retrieval services to provide users with agricultural products, product information and a certain sort of compare features personalized search service.
Keywords/Search Tags:Nutch, distributed system, web crawler, agricultural product
PDF Full Text Request
Related items