Font Size: a A A

Research And Implementation Of Focused Crawler Based On Distributed Strategy

Posted on:2019-05-23Degree:MasterType:Thesis
Country:ChinaCandidate:J J ZhangFull Text:PDF
GTID:2428330566459511Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Under the background of the data era,more and more enterprises and individual person are aware of the value of data,so that the importance of data is becoming more and more prominent.Through this,the problem of how to share the source of Internet has become an object in many net researching areas.Using the Internet technology to realize the sharing such as the data with the network bandwidth and the machine's computing space or other aspects,let source get integration and utilization efficiently.Because of the widely distribution of data source,the big amount of data which be crawled have the variety of data types,that make the massive information is disordered.It's inconvenient for users who want to search the information in a specialized filed,and it's also difficult for people to get accurate search result.Nowadays,more and more technical website has been appeared.In order to meet the specific users' needs which about the specific searching topic,the focused crawlers have been produced.Compare with the general crawlers,the focused crawlers are more suitable for the development of the current net environment and the needs of the Internet users.And the searching and extract abilities can of the focused crawlers also be more accurately from the massive information,meanwhile,using the distributed processing technology to speed up the crawlers' crawling and storage efficiency,so that the crawler can have a better application under the big data background.Focus on the general crawler always crawls the horizontal information of the web site that will led to some problems,such as the search results are scattered,and the relevance to the topic are not very strong,that is to say,the crawlers would crawl a large number of content,but the topics' correlation is not very high.According to the problems,this paper analysis and design the algorithm to calculate the topic relevance.Design an algorithm to calculate the topic-related degree by integrating the links' structure with the ways connected the web pages and the contents of the web pages.Based on this algorithm to implement the topic-based focused crawlers.Then focused on the problem of how to work collaborative for the multi-crawler.Using the distributed structure to develop a crawler,that will help the crawler implements the load balancing and the information interaction in pages crawling and storage.For the protection strategies some web site may use to protect their web pages,would cause the crawler cannot crawl the webpage,so research some error recovery mechanism which is been studied to solve the problem and get the page's data.
Keywords/Search Tags:Focused Crawler, Distributed Strategy, the Relevance of Topic
PDF Full Text Request
Related items