Research And Implementation On Key Techniques Of Topic Search Engine

Posted on:2011-05-27

Degree:Master

Type:Thesis

Country:China

Candidate:X Sun

Full Text:PDF

GTID:2178360305481862

Subject:Computer application technology

Abstract/Summary:

With the fast increase of the network information, a large number of duplicate messages and spam appear on Web, which makes finding available information become difficult. General search engines are faced with challenges, such as information collection scale, updating speed, specialization need and so on. To overcome these limits, topical search engine which aim at special themes and personalized information retrieval comes into being. The search engine (the fourth generation of search engine) based on topical crawler has become a hotspot and difficulty in the field of current search engine and Web data mining. In this paper, our research revolves round this hotspot and difficulty.In this paper, we briefly introduce the ingredient of general search engines firstly, and state its operation principle in detail. Then we elaborate some key technologies of the topical search engine such as topical crawler, information extraction, text classification, web page ranking. In the study of several text classification algorithms, this paper improves the Naive Bayes algorithm. Whereas the keywords in the number of HTML tags in Web pages can reflect the theme of a page better, the improved algorithm gives these words larger weight factors. The experiment and data analysis shows that the improved Naive Bayes algorithm enhances the classification accuracy to a large extent.This paper focuses on the search strategies of the topical crawler, and discusses the search strategies based on the content and link respectively. Given the existence of the topic isolated island problem in the web page, a new URL search strategy based on the content and link analysis is proposed, which can make the web spider pass through the tunnels to crawl more topic-relevance pages to resolve the topic isolated island problem. In this way, the proposed search strategy can increase the coverage rate of the topical resources, and meanwhile avoid the phenomenon of the topic drift better.Finally, we perform an experiment and analyze the proposed URL search algorithm. Taking ODP category index as experiment environment, the breadth-first strategy, the best first search strategy and the proposed URL search strategy based on the content and link analysis are evaluated and compared with each other. The results show that the proposed search algorithm can improve the target recall standards, which makes the topical search engine return more topic-relevance pages when ensuring precision.

Keywords/Search Tags:

Topical Search Engine, Web Crawler, URL Search Strategy, Topic Isolated Island, Naive Bayes Classifier

Related items

1	The Design And Research Of Topic Web Crawler In Vertical Search Engine
2	The Topical Web Crawler Research In Vertical Search Engine
3	The Design And Implementation Of Topical Search Engine
4	Research And Implementation Of Scientific Topic Search Engine Crawler Based On Nutch
5	The Research Of Topical Crawler Search Strategy In Web Page
6	Research On An Algorithm Of Focused Crawler In Vertical Search Engine
7	The Research And Implementation Of Topical Web Crawler Based On Improved Shark-Search Algorithm
8	Research On Topic Search Engine Based On Shark Optimization Algorithm
9	Research On Topical Crawler Combining Web Page Content And Hyperlink
10	The Pests And Insects Topical Search Engine Research Based On Distributed Acquisition Strategy