Research And Implementation Of Vertical Search Engine Based On Distribution

Posted on:2012-08-28

Degree:Master

Type:Thesis

Country:China

Candidate:Z J Zhao

Full Text:PDF

GTID:2218330371455081

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

In the case of rapid expansion of global Internet resources, search engine technology provides a good interface for people to find the necessary information. However, facing with many problems of the general search engine, such as huge amount of information, searching precision and depth, the vertical search engine come into being. Not only is it a new search engine model, its feature is "specialized, intensive and deep" and only for some specific topics, but also it has a high retrieval speed, centralized information and high available information. Meanwhile, facing with huge amounts of datas, distributed technology is a good starting point, although it may increase system overhead and design complexity, but it can greatly improve the efficiency of web crawling and information retrieval, and its research has tremendous commercial value and broad application prospects.This paper studies the current search engine technology, and designs system architecture of the distributed vertical search engine with learning related vertical search engines and distributed technology, which consists of web page collection, information retrieval and a back office management system and other components, used to implement web page crawling, information search, and provide recommendations, export, secondary development interface these features for information integration. In short, the system is powerful, easy to use, stable and user-friendly.One the one hand, also known as web spider, web crawler is the main source of system data, and is a highly independent business product module, which directly affects the quantity and quality of information collection. Depending on technologies of the open source project Nutch, it achieves a distributed RMI-based vertical search spider, which can extract specific meta-data information by dynamic script analysis JavaScript and the XPath technology. In addition to, it can also use programmed web parser to extract meta-data, which will eventually be processed into the database. On the other hand, Indexing and retrieval build on currently popular open source project Lucene, and from the idea of hadoop name nodes and task nodes, the distributed project communicates using the way of RPC. The name node regularly checks that which task node is available through heartbeat-determining in the distributed system. Improved scores sorting algorithm of Lucene has been applied in sorting, and lightweight, pure Java developed, embedded database HSQLDB is greatly effective in duplicate data-removing. Besides, a feature-rich, style-beautiful front-side display of information retrieval web page has been designed for users.

Keywords/Search Tags:

vertical search engine, distributed, page crawling, information retrieval

PDF Full Text Request

Related items

1	Vertical Search Engine For Crawling The Web Page Design And Implementation
2	Design And Implementation Of Travel Vertical Search System Based On MongoDB
3	Research On Topic Web Page Crawling Strategy For Vertical Search Engine
4	Research On Focused Crawling Technique For Vertical Search Engine
5	The Research Of Vertical Search Engine Based On The Education Information
6	A Vertical Search Engine In The Field Of News
7	Study On Focused Crawling Technique For Vertical Search Engine
8	Research And Implementation Of Vertical Search Engine
9	Research And Implementation Of The Strategy-Extensible Search Engine
10	The Optimization And Achieve For Focused Crawling Algorithm Based On The Website Content Framework