Font Size: a A A

Research And Design Of Distributed Vertical Search Engine Based On Hadoop

Posted on:2013-10-21Degree:MasterType:Thesis
Country:ChinaCandidate:Y W JieFull Text:PDF
GTID:2268330392965643Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of Internet and the increasing maturity of network technology, thereappear more and more web sites and large amount of information on the Internet. However,because of the development of network technology and rapid growth of network information, thenumber of network information user is also increasing. By contrast, the traditional engine hasmany problems such as limited coverage range, too many and multifarious returning results, longupdate cycles, and query ambiguity problems.In the mean time, with the constantly increasing diversity of the information, the retrievaldemands of different users are so various that the traditional search engine can’t meet differentretrieval needs specifically. And the commercial search engines which are successfully operatingnowadays are mostly centralized architecture systems on a single server performancerequirement. The system has high requirement to single server performance and is prone tofailure and other shortcomings, such as bad expansibility. To be against these shortcomings, adistributed vertical search emerged at the right moment, which perform best, expand easily,classify comprehensively and deeply, and update instantly.Distribution refers to multiple servers construct a cluster and the cooperation amongdifferent servers. Vertical searching is a professional search to a certain area, which ischaracterized by specialized, refined, and deep. It has the characteristics of the industry and isthe segmentation and extension of the search engine. This project built a distributed cluster byHadoop and then did source code analysis on open-source component Nutch and Solr. After that,the project did further study on relative theories and key technologies of research engine. And onthis basis, it borrowed the existing academic achievements and refined the determination onsubject correlation and web crawlers and other issues. It also built the steel domain ontologylibrary by taking the advantage of domain ontology knowledge and expanded the user query,which made the location and searching more accurate. Finally, the project modified the sourcecode of open-source component by the Hadoop-based design. It implemented a distributedvertical search engine prototype. After comparing to the search results from Baidu commercialsearch engine and the analysis and evaluation of the experimental results, it is proved that thissystem has obvious theme orientation and much better precision compared to the general searchengines.
Keywords/Search Tags:Distributed, Vertical Search, Domain Ontology, Hadoop, Nutch, Solr
PDF Full Text Request
Related items