Font Size: a A A

Design And Implementation Of The Sohu News Search Engine Based On Crawler

Posted on:2013-03-06Degree:MasterType:Thesis
Country:ChinaCandidate:W Y LinFull Text:PDF
GTID:2268330392453794Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With internet information gaining its speed in increment, Search Enginetechnology, by which netizen can speedily locate usable information, is now a majorconcern among netizen. The Sohu News search engine emerges as the time requires.For general user, commercial search engine can meet with the requirement ofthem, but for special user, such as small businesses and scientific research agency,commercial search engine cannot meet with their requirement as its low pertinenceand it couldn’t change based on the requirement of users. Anyway, open-sourcesoftware meets this requirement, such as Lucene, developer can work out special areasearch engine base on the requirement of different users. This article is designed baseon open-source software and then put into practice.Firstly, this article introduce search engine; include its history, development trendand its classification. Secondly, describes the system requirements analysis andclear-cut its functional requirements and non-functional requirements. Thirdly,complete the design of system frame and related system architecture. Finally, designeach function module and then put into practice.Based on The Sohu News search engine, many function works withcustomization, such as Heritrix data fetching module, HTMLParser datapreprocessing module, index, data generation module of data base and the coremodule of search etc.To improve the experience of user, this article work out an improved page sortingalgorithm under Lucene text matching algorithm and PageRank algorithm, which it’shas been thought over the time factor of news search engine. Besides, algorithmproject has been worked out base on Lucene&Hadoop Distributed storage and Distributed Computing algorithm. So the searching result would be more reasonableand accurate.
Keywords/Search Tags:Search engine, sorting algorithm, Lucene, Hadoop
PDF Full Text Request
Related items