Font Size: a A A

Research And Implementation Of Trusted Search Engine Based On Nutch

Posted on:2016-05-08Degree:MasterType:Thesis
Country:ChinaCandidate:W B TianFull Text:PDF
GTID:2348330503954613Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
With the development of Internet technology and the popularity of personal computers, how to fast, accurate and easy access to the information they need from such a huge information repository is an important issue faced by Internet users. The search engine has become the most popular information retrieval tool used by people. The tool involves multiple areas of information retrieval, data mining, distributed processing, natural language processing theory and technology, which has comprehensive and challenging.In recent years, blog, WeChat, micro-blog media rapid development,ordinary users to publish information is more and more low threshold.Redundant, obsolete, false information flooding in the Internet, but the traditional search engines can not distinguish well the advantages and disadvantages of these information. With the growth in the volume of information, search engine results credibility is getting worse. To solve these new problems, improve the reliability of search results is an important development direction of search engine in the future.In this paper, to improve the reliability of the search results is the main research point. Search engines exist for emphasising history page, topic drift and other issues.I will improve algorithms of acquisition subsystem and retrieval subsystem algorithms to improve the reliability of the search results of search engines. Complete trusted search engine design and deploy. The main work can be divided into the following three aspects.1) Trusted data acquisition system based on open source, TS algorithm is proposed in this paper. Improvement of the the algorithm is based on index properties. In the link based on webpage creation time, web depth, the average click rate three properties, realize the diversification of evaluation factors. Improved the OPIC algorithm on the shortcomings of emphasising history page,easy to cheat, to make search results more reasonable.2) Trusted retrieval subsystem, the principle of similarity calculation based on vector space model, to construct the calculation formula of Lucence,combined with TS algorithm to webpage score. The new search algorithm of data acquisition subsystem make webpage score values reflected in the search results, completed the design of trusted retrieval system.3) Implementation of trusted search engine, according to the Trusted data acquisition subsystem and Trusted retrieval subsystem, complete the deployment of trusted search engineFinally, test the trusted search engine and analyse the result. Through the analysis of the results, trusted search engine improves the original algorithm on the shortcomings of emphasising history page, poor timeliness,and improve the credibility of the results of the search.
Keywords/Search Tags:Trusted search, OPIC algorithm, web crawler, Nutch, TS algorithm
PDF Full Text Request
Related items