Font Size: a A A

Research On Key Technologies Of Search Engine Based On Hadoop

Posted on:2016-03-20Degree:MasterType:Thesis
Country:ChinaCandidate:Y WanFull Text:PDF
GTID:2348330476455779Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With continuous improvement of the degree of society informatization, the connection of all trades and professions to the internet is getting close, leading to the dramatic increase of internet information every day. As there are massive amounts of information, the retrieved results of the general search engine are of enormous quantity, which is disorderly and unsystematic. Therefore it is becoming difficult to find the information what the users are looking for in a faster and accurate way. The ranking display of the search data by the general search engine is influenced by the business impact; meanwhile the massive data of the webpage can not reflect the specialty of the key words.Thus the vertical search engine targeting for the specific needs of the specific industry was born at the right moment. However, the majority of the small and medium-sized enterprises rather than internet companies, they have no time to make an efficient search product like Baidu or Google; while the general search engine on the internet can not fully satisfy their needs for searching. The existing information can not be searched and the internal information can not be submitted to the general search engine for seek of security. For small and medium-sized enterprises with some data but weak in development ability, how to provide them with an effective search and calculation service is a popular and difficult problem in the research of the search field, to get rid of troubles caused by information lost and information overload.This paper first introduces the development history of search engine,and the more popular open source search engine components,Hadoop,Nutch and Solr.Then further study of the web link Page Rank sorting algorithms,after analyzing the traditional algebra principle of Page Rank algorithm and distributed implementation,aimed at the defect of its average distributed value,this paper proposes a combination of web ontology content and link relations PR strategy transition probability distribution.After detailed introduces the internal structure and function of distributed Nutch crawler module,combined with Hadoop platform so as to complete the network data by means of graphs of concurrent download,are fast crawl,crawl the characteristics of the large amount of web pages.Data retrieval and data index is an important part of the search engine,its performance will directly affect the enterprise of data processing and the user experience. Solr as encapsulates the Lucene search interface has the great ability in searching and indexing,after introduced the architecture and features of Solr,we design a distributed clustering framework,using the Solr as a retrieval tool,with extensible ability and fault tolerance ability,can deal with the advantages of the mass retrieval request,using,realized the distributed vertical search engine. By the experiment compared with stand-alone search, it is proved that the time spent on the data acquisition, data index and data search is far less than that of the stand-alone search for distributed search when the webpage reaches to 20 thousand or even more. Additionally, the distributed search system has a strong individuality and it is easy to extend, which guarantees the system stability and security, satisfy well for the search requirements of the small and medium-sized enterprises, thus to bring the enterprise a cost-effective search service.
Keywords/Search Tags:search engine, Distributed, Hadoop, Solr, Page Rank
PDF Full Text Request
Related items