Font Size: a A A

Research On The Key Technology Of Focused Crawler

Posted on:2016-11-17Degree:MasterType:Thesis
Country:ChinaCandidate:C R WangFull Text:PDF
GTID:2308330464465775Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The Internet is a sharing platform with a lot of information, users find information from this platform with a search engine. But with the high demond of users for high-quality, personalized information, the general search engine with rough and general query results can not meet people’s demond for personalized information, at the same time, domain-specific vertical search engines have emerged. Topic crawler is an important part of a vertical search engine, it gets professinal resources for the vertical search engine, and the performace of the crawler, directly, affect how well the vertical search engine is. This paper focuses on the research of the key technology that the topic crawler touched on. The main work of this paper is as follows:(1)The general keywords based topic description model have so many keywords and lack of correlation, then reduces the accuracy description of the subject.To conquer this problem, keywords set is obtained by training topic pages, then this set is expressed by level information and integrated by thesaurus, then the dimensionality is reduced and the accuracy of the subject description is improved.(2)This paper analyzes the traditional weight calculation method of the TF-IDF, to conquer the problems of the ”equally” keywords and its poor class of high frequency words, the location information function and regulatory factor is introduced into the weight calculation, a correlation method is improved named M-TFIDF—mainly improved the precision of the weight calculation of terms. By judge the angle of the page vector and theme vector, the theme of page is judged.(3)This paper studys the web search strategy based on web content—Shark-Search and the web search strategy based on web links— HITS, in view of the lack of overall “myopia” of Shark-Search and topic drift of HITS, this paper proposes a combination of web search strategy(M-SH). The new strategy improves the boundedness of Shark-Search and HITS, the topic relevance prediction of URL and the improved anchor text is added to it. The new web search strategy called M-SH improves the precision of URL’ topic relevance prediction.(4)Finally, the experiments is carried by two aspects of offline and online, one for the topic relevance judgment and then open source web crawler NWeb Crawler is developed secondary development. By introducing the term consolidation,M-TFIDF and M-SH is respectively compared with the original method, it is proved that the method of the paper proposed is effective by caculated harvest rate, recall rate and F value.
Keywords/Search Tags:topic crawler, topic description, correlation calculation, topic prediction
PDF Full Text Request
Related items