Font Size: a A A

Research And Implementation Of Vertical Search Engine Based On Distribution

Posted on:2012-08-28Degree:MasterType:Thesis
Country:ChinaCandidate:Z J ZhaoFull Text:PDF
GTID:2218330371455081Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In the case of rapid expansion of global Internet resources, search engine technology provides a good interface for people to find the necessary information. However, facing with many problems of the general search engine, such as huge amount of information, searching precision and depth, the vertical search engine come into being. Not only is it a new search engine model, its feature is "specialized, intensive and deep" and only for some specific topics, but also it has a high retrieval speed, centralized information and high available information. Meanwhile, facing with huge amounts of datas, distributed technology is a good starting point, although it may increase system overhead and design complexity, but it can greatly improve the efficiency of web crawling and information retrieval, and its research has tremendous commercial value and broad application prospects.This paper studies the current search engine technology, and designs system architecture of the distributed vertical search engine with learning related vertical search engines and distributed technology, which consists of web page collection, information retrieval and a back office management system and other components, used to implement web page crawling, information search, and provide recommendations, export, secondary development interface these features for information integration. In short, the system is powerful, easy to use, stable and user-friendly.One the one hand, also known as web spider, web crawler is the main source of system data, and is a highly independent business product module, which directly affects the quantity and quality of information collection. Depending on technologies of the open source project Nutch, it achieves a distributed RMI-based vertical search spider, which can extract specific meta-data information by dynamic script analysis JavaScript and the XPath technology. In addition to, it can also use programmed web parser to extract meta-data, which will eventually be processed into the database. On the other hand, Indexing and retrieval build on currently popular open source project Lucene, and from the idea of hadoop name nodes and task nodes, the distributed project communicates using the way of RPC. The name node regularly checks that which task node is available through heartbeat-determining in the distributed system. Improved scores sorting algorithm of Lucene has been applied in sorting, and lightweight, pure Java developed, embedded database HSQLDB is greatly effective in duplicate data-removing. Besides, a feature-rich, style-beautiful front-side display of information retrieval web page has been designed for users.
Keywords/Search Tags:vertical search engine, distributed, page crawling, information retrieval
PDF Full Text Request
Related items