Font Size: a A A

Research On The Key Technology Of Theme Crawler

Posted on:2014-09-17Degree:MasterType:Thesis
Country:ChinaCandidate:Z D HuangFull Text:PDF
GTID:2268330425966543Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Nowadays, the dissemination and release of information become more and morefast,which because of the rapid development of internet. The network information quantityscale become so huger that becomes more difficult for information retrieval now. Fortunatelythe users can use the search engine for rapid information retrieval, and take it as a tool of thedaily life and often use it. The network reptiles as one of the important parts of the searchengine is mainly responsible for the Internet webpage collection. The quality of searchengine service depends largely on the crawler crawling performance and the quality ofcollected webpage. So the crawler system is an important part of a search engine, and it isworthy of studying and improvement. In recent years, the limit of network size result in anincreasing burden on general reptiles. While the theme crawler will be more targeted to selecta specific area to crawl,then obtain the information required by the users. Further more,thetheme crawler can obtain higher operation efficiency. So the theme crawler has attractedwidespread attention. A new path in the theme crawler areas is being carried out with highresearch value and pragmatic value.This article focuses on the research of the technology andcharacteristics that the theme crawler touched on. The main work and results as follows:(1) Implemented an improved PageRank algorithm.The improved PageRank algorithm isput the whole web page of the Internet into a number of blocks, and then uses thedivide-and-conquer,calculated each block of the PageRank value, then according to eachblock of the weights of the relative importance,calculating the PageRank value of the wholeweb page.(2) Improve a correlation algorithm, mainly to establish the basis of the theme of theappropriate dimension vector, and then compressed into the search to articles with the sametheme reference vector dimension, and then use the correlation formula obtained by crawlsthe web meets the requirements.(3) When the reptiles crawling to a very large number of pages, how to eliminate theduplicate URL. This paper is mainly with the MD5algorithm to establish index, then theindex set up into the tree structure, make index stored in memory, and the data stored in thepart of hard disk, which reduces the space complexity.(4) By improving relevant algorithm, simulation and brief implements a mobile phonetheme crawler system, with the code, and the demonstration analysis of the experimental data, this paper demonstrates the validity and rationality of the theory.
Keywords/Search Tags:Theme crawler, PageRank Algorithm, Correlation Calculation, URL cancelatioin
PDF Full Text Request
Related items