A Focused Crawler Based On Statistical Machine Translation And Topic Propagation

Posted on:2014-01-31

Degree:Master

Type:Thesis

Country:China

Candidate:X Y Gan

Full Text:PDF

GTID:2268330395989221

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the development of the search engines, the use of the Internet is becoming more and more convenient. Through a search engine, we can easily gather a lot of relevant information from the Internet. The Internet contains massive data, but modern search engines only indexed part of the resources. Crawler, one part of the search engine, plays a vital role in search engines. How to crawl more related pages people interested in with limited resources, has become a popular topic both in industry and academic research. Focused crawler has emerged in such circumstance.This paper focuses on the research and implementation of a focused crawler based on topic relevance prediction of anchor text and topic propagation in link structures. As anchor text is short in most cases, existed focused crawlers using anchor text introduce the concept of link context, which contains the surrounding text of the given anchor text. However, the link context may introduce relevant context to an irrelevant anchor text. We regard the topic relevance prediction on a given anchor text as an encoding process in the noisy-channel coding theorem, and thus proposed a prediction method based on statistical machine translation. Only use the web content such as anchor text, is likely to discard the pages which are not relevant to the topic but contains a large number of related links. The analysis of the web link structure is one approach to relieve the above problem. On the basis of some previous work, we proposed a focused crawling algorithm based on topic propagation. The work of the topic relevance prediction of anchor text is also integrated into the framework.Finally we realized a focused crawler prototype. We make comparison of our proposed algorithm with some other focused crawling algorithms. Experimental results show that our proposed algorithm achieved some improvement.

Keywords/Search Tags:

Focused Crawler, Anchor Text, Statistical Machine Translation, TopicPropagation, Topic Relevance Prediction, Text Classification

PDF Full Text Request

Related items

1	Design And Implementation Of Focused Crawler For Blogs
2	Focused Crawling
3	Based On The Theme Of The Html Tags Crawler Design And Realization
4	The Research And Implement Of Topic-focused Web Crawler Based On SVM Classification Algorithm
5	Research And Implementation Of Emotional Classification Of Microblog Text Based On Topic Relevance
6	Research On English Text Summarization And Machine Translation Based On Machine Learning
7	Research And Design Of Machinery-Text Acquisition And Classification
8	Research And Implementation Of Focused Crawler Based On Word2Vec
9	Research On Topic Focused Web Crawler And Related Technologies
10	Design And Implemention Of Focused Crawler To Application Store