Font Size: a A A

Research On Agricultural Vertical Search Engine Based On Nutch

Posted on:2015-05-10Degree:MasterType:Thesis
Country:ChinaCandidate:Z H ZhangFull Text:PDF
GTID:2298330467458944Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the continuous development of the Internet technology, Network Information Resources are increasing exponentially, and Searching Engine play a very important role in helping us querying the Internet information. In our country, rural population is very large, and agriculture is the basic industry, so speeding up the pace of the construction of agricultural informatizationcan contribute to solve the issues of agriculture effectively, integrate agricultural information resources, and make the agriculture of our country approach to information agriculture gradually. Effective solving the agricultural users of various agricultural problems in production and life, can promote the development of agricultural informatization. A search engine in the field of agriculture has been studied and developed in this paper.The development of this article is based on Nutch, a open source software. Nutch is based on Java language, which is a lightweight and stable running of search engine. It has a higher recall ratio and precision ratio. Nutch, however, is insufficient in web analytical aspects and the extract of abstract, and it can’t satisfy the requirement of the agricultural users search. In this paper, we improved the two modules of the Nutch, and realized the function of search words query expansion.The works in this paper as the following:(1) In the web crawling strategy, using breadth-first algorithm for web information crawl, and restricting crawl layers, you can crawl the web as much information as possible on the agricultural site.(2) In parsing techniques on the webpage, we use the STU-the DOM tree model, and using the HTML parser converts HTML to a DOM tree with semantic attribute. By structure filtering and content pruning, and information relevant to the subject matter being retained, it achieves the function of web page subject information extraction.(3) In the technology of extraction of abstract, the extraction based on statistics method has been used for the extraction of the text. In this paper, the extraction of the process is based on text characteristic. By separating clauses of the text, word frequency statistics, we calculated the weight of the weight of words and sentences, and identify the sentence of abstraction. According to the order of the sentences in the text appear, put out it, and form the final abstract, achieving the function of the extraction of abstraction.(4) In the query expansion technology, we realize the user query expansion for agriculture by building the domain ontology. Firstly we should build agriculture domain ontology, and according to the hierarchical relationships of each concept, in the agricultural ontology of synonyms for agricultural users retrieval, getting thesaurus、hyponym and examples related to the word of the words of agricultural users search term, it can implement the semantic query expansion.In this paper, an agricultural vertical search engine based on Nutch has been carried out in YanShan County Hebei Province. The results showed that the search technology can realize the integration of agricultural information resources, and effectively filter the web information that has nothing to do with agriculture. When a user retrieves agricultural information, this system can extract the abstract of the search results. It is convenient for user’s access. It also save user’s time, provide the agricultural users query related words, and provides an accurate search way for the users.
Keywords/Search Tags:search engine, Nutch, agriculture, vertical search
PDF Full Text Request
Related items