Font Size: a A A

Design And Research Of Distributed Web Crawler Based On Hadoop

Posted on:2019-09-09Degree:MasterType:Thesis
Country:ChinaCandidate:Z ChengFull Text:PDF
GTID:2428330548482614Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
For the web crawler,a Hadoop-based distributed web crawler is designed and researched.With the advent of the Internet era,Internet technology has rapidly developed and network resources on the web have also grown rapidly.Facing the various resources on the Internet,it is no longer easy to be able to find the necessary resources quickly and accurately.Search engines can complete the retrieval of information,and search engines are inextricably linked with crawlers.Only by crawling the Internet to crawl more information resources on the Internet,the search engine can complete the corresponding operations.Therefore,the design of web crawlers is very important.Today,more and more companies and organizations have begun to design efficient crawler systems to crawl billions of web resources on the Internet.Due to the huge data resources on the Internet,the crawling and parsing efficiency of traditional stand-alone web crawlers has far exceeded the current demand.Therefore,the development is based on Hadoop clusters,and the web crawler is designed and optimized in terms of DNS resolution,URL capture,file parsing,URL acquisition,URL processing,and judgment modules.For the DNS parsing module,the domain name is parsed in advance by creating a thread that parses the next round of URLs to be crawled on the node at the beginning of the previous round of crawling tasks,and the parsed results are stored in the DNS cache.Through this pre-processing process and the design of the buffer area,the crawler system does not need to wait for the completion of the process of resolving the domain name at the beginning of the next round of crawling tasks,and directly goes to the DNS buffer area to obtain the corresponding IP address mapping.This can effectively increase the efficiency of the crawler system.In the crawling module,the PageRank algorithm is used to judge the importance of the URL,and the URL to be crawled is crawled according to the size of the PageRank value.Through the analysis of various forms of web pages in the Internet,the initial PageRank value of each web page is obtained by using the iterative idea.Then according to the direction of outgoing links in the Internet,the page's PageRank value will be increased accordingly.Since the size of the PageRank of the webpage reflects the number of outgoing links to the webpage in the Internet,it also determines the importance of the webpage,so that the fetching order is the order of the degree of importance of the webpage.This greatly reduces the chance that unrelated information from spam sites appears in search results.In the file parsing module,the external interface provided by Apache Tika is realized and the tools in it are called to complete the parsing and processing of different files.Extract the URL resource information in the parsed information and match the format of the regular expression to determine whether the URL is valid.In the processing module,for the case where the degree of similarity of web pages in the network is high or even repeated,the Simhash algorithm in the fingerprint generation algorithm is adopted to perform word segmentation operation on the web page text information and set a weight level for the segmentation words.Through the hash value and weight level corresponding to each segmentation word,the weighted calculation is performed and the corresponding weighted value is obtained.By combining the weights of the participles,the text information can be converted into a serial string of a certain length.By reducing the dimension,the comparison of the similarity of the text information can be converted into the comparison of the corresponding binary sequence strings.The Hamming distance is obtained by counting the number of digits at the corresponding position in the binary sequence string.The reptile system will determine pages with a Hamming distance of less than 3 as duplicate webpages and discard them without repeating the crawl.The URL acquisition module obtains the URL information through the calculation of the MapReduce algorithm,and changes the status of the URL that has been crawled to the crawled state.The captured URL resource is passed to the judging module,and it is determined whether to perform different processing on the URL resource by judging whether the URL seed database already contains the URL resource.If you include this resource,the crawler system discards the resource directly;if it does not,it adds the resource to the end of the URL seed library.The crawler function is tested by adding a different number of URLs to the URL seed database.When the results show that the number of URLs in the URL seed database is different,the URL resources in the URL seed database can be crawled to the Internet.The crawl process is strictly followed in the order of PageRank.When two duplicate URLs are placed in the URL seed database,the Hamming distance after calculation by the Simhash algorithm is less than 3,to determine that the two web pages are duplicate web pages,and only one grab operation is performed.When the crawling process is completed under a cluster with different number of nodes,the crawl rate under different clusters is calculated by counting the number of crawled pages and crawling time,and the more the number of nodes,the higher the rate is.The smaller the fluctuation,and the final crawl rate will be basically stable near a certain fixed value and make a conclusion that it fluctuates up and down.Through the comparison of the rate,it is concluded that the greater the number of nodes,the greater the crawl rate of crawlers,but as the number of nodes increases,the growth rate of the crawl rate will continue to decrease.
Keywords/Search Tags:Hadoop, Distributed, Reptile, PageRank, Simhash
PDF Full Text Request
Related items