Font Size: a A A

Topic Search Technology Application For Web Information Resources

Posted on:2015-01-22Degree:MasterType:Thesis
Country:ChinaCandidate:L J WangFull Text:PDF
GTID:2298330452494405Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of network technology, information processing became moreimportant, the shortcomings of the general search engine was more and more obvious.Generated massive amounts of information every day on the Internet, but general searchengine need a long time to update the new information. Personalized services cannot meetthe growing needs, The topic search will be a good solution to this problem.This article have a good study of the general search engines, including the developmentof general search、the basics technical of search engines and some important articles. Thenfocuses on the key technologies of topic-specific search engines, and compare the keytechnologies with the general search engine. Base on this, make a deep analysis for topicsearch engines.On the basis of the above analysis, this study redesigned topic search engine’s somemodules base on Nutch. They are topic found, deletion of duplicated web pages, Chineseword segmentation. Thanks to the redesign the topic search have a highly–accurate、highly-recall and efficient. The main work and innovation points the paper as followed:1) PageRank and HITS algorithm is the mature algorithm in Web crawler analysis field.However, there are some problems of these algorithm, in this paper suggestedcountermeasures. Suggests a new strategy IPR(Improved PageRank)of Web crawler, the newstrategy can save the important information of some theme.2) There are all kinds of website abound, the same content may be report by severalwebsites, This will cause repeated crawling and access a Web page, so deletion of duplicatedweb pages is a focus of this paper. This paper have research the Web information extractionand Web page similarity comparison, reducing the repetition rate of the store Web pages.3) Chinese word segmentation is one of the key technology of Chinese search engine, agood Chinese word segmentation can make de search engine have a high-recall rates,efficient. This paper proposed an improved maximum matching of word segmentationalgorithm IMMM, combined with topic search engine improve the segmentation accuracy.
Keywords/Search Tags:Topic Search, Web Crawler, Theme Determines, Deletion ofduplicated web pages, Chinese Word Segmentation
PDF Full Text Request
Related items