Font Size: a A A

Research And Implementation On Key Techniques Of Topic Search Engine

Posted on:2011-05-27Degree:MasterType:Thesis
Country:ChinaCandidate:X SunFull Text:PDF
GTID:2178360305481862Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the fast increase of the network information, a large number of duplicate messages and spam appear on Web, which makes finding available information become difficult. General search engines are faced with challenges, such as information collection scale, updating speed, specialization need and so on. To overcome these limits, topical search engine which aim at special themes and personalized information retrieval comes into being. The search engine (the fourth generation of search engine) based on topical crawler has become a hotspot and difficulty in the field of current search engine and Web data mining. In this paper, our research revolves round this hotspot and difficulty.In this paper, we briefly introduce the ingredient of general search engines firstly, and state its operation principle in detail. Then we elaborate some key technologies of the topical search engine such as topical crawler, information extraction, text classification, web page ranking. In the study of several text classification algorithms, this paper improves the Naive Bayes algorithm. Whereas the keywords in the number of HTML tags in Web pages can reflect the theme of a page better, the improved algorithm gives these words larger weight factors. The experiment and data analysis shows that the improved Naive Bayes algorithm enhances the classification accuracy to a large extent.This paper focuses on the search strategies of the topical crawler, and discusses the search strategies based on the content and link respectively. Given the existence of the topic isolated island problem in the web page, a new URL search strategy based on the content and link analysis is proposed, which can make the web spider pass through the tunnels to crawl more topic-relevance pages to resolve the topic isolated island problem. In this way, the proposed search strategy can increase the coverage rate of the topical resources, and meanwhile avoid the phenomenon of the topic drift better.Finally, we perform an experiment and analyze the proposed URL search algorithm. Taking ODP category index as experiment environment, the breadth-first strategy, the best first search strategy and the proposed URL search strategy based on the content and link analysis are evaluated and compared with each other. The results show that the proposed search algorithm can improve the target recall standards, which makes the topical search engine return more topic-relevance pages when ensuring precision.
Keywords/Search Tags:Topical Search Engine, Web Crawler, URL Search Strategy, Topic Isolated Island, Naive Bayes Classifier
PDF Full Text Request
Related items