Research On The Key Technology Of Theme Crawler

Posted on:2014-09-17

Degree:Master

Type:Thesis

Country:China

Candidate:Z D Huang

Full Text:PDF

GTID:2268330425966543

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Nowadays, the dissemination and release of information become more and morefast,which because of the rapid development of internet. The network information quantityscale become so huger that becomes more difficult for information retrieval now. Fortunatelythe users can use the search engine for rapid information retrieval, and take it as a tool of thedaily life and often use it. The network reptiles as one of the important parts of the searchengine is mainly responsible for the Internet webpage collection. The quality of searchengine service depends largely on the crawler crawling performance and the quality ofcollected webpage. So the crawler system is an important part of a search engine, and it isworthy of studying and improvement. In recent years, the limit of network size result in anincreasing burden on general reptiles. While the theme crawler will be more targeted to selecta specific area to crawl,then obtain the information required by the users. Further more,thetheme crawler can obtain higher operation efficiency. So the theme crawler has attractedwidespread attention. A new path in the theme crawler areas is being carried out with highresearch value and pragmatic value.This article focuses on the research of the technology andcharacteristics that the theme crawler touched on. The main work and results as follows:(1) Implemented an improved PageRank algorithm.The improved PageRank algorithm isput the whole web page of the Internet into a number of blocks, and then uses thedivide-and-conquer,calculated each block of the PageRank value, then according to eachblock of the weights of the relative importance,calculating the PageRank value of the wholeweb page.(2) Improve a correlation algorithm, mainly to establish the basis of the theme of theappropriate dimension vector, and then compressed into the search to articles with the sametheme reference vector dimension, and then use the correlation formula obtained by crawlsthe web meets the requirements.(3) When the reptiles crawling to a very large number of pages, how to eliminate theduplicate URL. This paper is mainly with the MD5algorithm to establish index, then theindex set up into the tree structure, make index stored in memory, and the data stored in thepart of hard disk, which reduces the space complexity.(4) By improving relevant algorithm, simulation and brief implements a mobile phonetheme crawler system, with the code, and the demonstration analysis of the experimental data, this paper demonstrates the validity and rationality of the theory.

Keywords/Search Tags:

Theme crawler, PageRank Algorithm, Correlation Calculation, URL cancelatioin

PDF Full Text Request

Related items

1	Stock Research Engine Based On Theme Crawler
2	Design And Implementation Of The Theme Crawler For Procurement Clues In The Automotive Field
3	Research On Related Theme Of Search Engines
4	Research And Implement Of The Theme Crawler For Automotive Industry
5	The Design And Implementation Of The Topic-focused Web Crawler System
6	Optimization And Implement Of The Topic Web Crawler Correlation Algorithms
7	Research And Implementation To The Improvement Strategy Of PageRank Algorithm Related Theme
8	Research On Topic Focused Web Crawler And Related Technologies
9	The Design And Implementation Of Web Crawler Based On Pagerank Algorithm In The Project Of Malicious URL Detection
10	Research And Implementation Of Multithreading Web Crawler Based On Theme