Research And Implementation Of Focused Crawler Based On Distributed Strategy

Posted on:2019-05-23

Degree:Master

Type:Thesis

Country:China

Candidate:J J Zhang

Full Text:PDF

GTID:2428330566459511

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Under the background of the data era,more and more enterprises and individual person are aware of the value of data,so that the importance of data is becoming more and more prominent.Through this,the problem of how to share the source of Internet has become an object in many net researching areas.Using the Internet technology to realize the sharing such as the data with the network bandwidth and the machine's computing space or other aspects,let source get integration and utilization efficiently.Because of the widely distribution of data source,the big amount of data which be crawled have the variety of data types,that make the massive information is disordered.It's inconvenient for users who want to search the information in a specialized filed,and it's also difficult for people to get accurate search result.Nowadays,more and more technical website has been appeared.In order to meet the specific users' needs which about the specific searching topic,the focused crawlers have been produced.Compare with the general crawlers,the focused crawlers are more suitable for the development of the current net environment and the needs of the Internet users.And the searching and extract abilities can of the focused crawlers also be more accurately from the massive information,meanwhile,using the distributed processing technology to speed up the crawlers' crawling and storage efficiency,so that the crawler can have a better application under the big data background.Focus on the general crawler always crawls the horizontal information of the web site that will led to some problems,such as the search results are scattered,and the relevance to the topic are not very strong,that is to say,the crawlers would crawl a large number of content,but the topics' correlation is not very high.According to the problems,this paper analysis and design the algorithm to calculate the topic relevance.Design an algorithm to calculate the topic-related degree by integrating the links' structure with the ways connected the web pages and the contents of the web pages.Based on this algorithm to implement the topic-based focused crawlers.Then focused on the problem of how to work collaborative for the multi-crawler.Using the distributed structure to develop a crawler,that will help the crawler implements the load balancing and the information interaction in pages crawling and storage.For the protection strategies some web site may use to protect their web pages,would cause the crawler cannot crawl the webpage,so research some error recovery mechanism which is been studied to solve the problem and get the page's data.

Keywords/Search Tags:

Focused Crawler, Distributed Strategy, the Relevance of Topic

PDF Full Text Request

Related items

1	Research And Implementation Of Focused Crawler Based On Word2Vec
2	Research On Topic Focused Web Crawler And Related Technologies
3	Research On The Topic Crawler Algorithm Based On Vector Space Model
4	Design And Implementation Of IT-oriented Distributed Topic Crawler
5	The Design And Implementation Of The Complex Rules-Driven Focused Crawler System
6	A Focused Crawler Based On Statistical Machine Translation And Topic Propagation
7	Design And Implementation Of Focused Crawler For Blogs
8	The Design And Implementation Of The Topic-focused Web Crawler System
9	Research And Implement Of Focused-crawler Relevance Algorithm In Search Engine
10	Customizable Focused Crawler