Font Size: a A A

Research And Application Of URL De-gravity Algorithm Based On Bloom Filter Algorithm

Posted on:2020-12-26Degree:MasterType:Thesis
Country:ChinaCandidate:H J MengFull Text:PDF
GTID:2428330575492715Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology,a variety of network information is also growing exponentially.While massive and complex network information provides people with more available information,it is more and more difficult to retrieve effective information.In this case,we need an information retrieval tool to retrieve effective information,that is,search engine.As the main tool of information retrieval,in a broad sense,search engine includes all systems or tools that can achieve retrieval in the network.That is to say,it can respond to users' search requests through the Internet,and effectively return the corresponding systems and technologies.From a narrow point of view,search engine in the network refers to a kind of software,which has the function of automatic search.It can also be considered as a manual way to retrieve information.The main functions that can be achieved include collecting information resources on the World Wide Web,indexing and analyzing the results obtained.At the same time,organizing the information obtained by the index can finally form.As a database,it can be used as an information service system to achieve effective retrieval of network users by means of websites.As the core technology of search engine,web crawler provides great convenience for users information retrieval.The main research content of this paper is about the related technology of web crawler Analyze the research background and significance of the network reptile field,as well as the current situation and development trends at home and abroad,and analyze and study the existing modules of the network reptile system.According to the initial torrent URL,the network reptile system gets the lower link in the web page and puts it into the list of URLs to be climbed.It moves layer by layer until it reaches the maximum level of the system or reaches the page where the final information is required,and then parses the page.Finally gets the information that the user needs on the page.In the process of collecting web page data by reptiles,there may be a lot of duplication of URL links obtained at various levels.For example,when climbing information from a website book according to the book classification page,the same book is likely to have multiple classification tags.When crawling books under different labels,multiple crawling to the same book may occur.This leads to the need for the system to repeatedly acquire and resolve the same page during execution,resulting in time and storage space.Great waste.In addition,we are familiar with the single-line mode of data processing in the process of URL processing is more time-consuming,resulting in lower system execution efficiency.In view of the above problems,this paper mainly does the following work from the point of view of improving the retrieval efficiency and accuracy of reptile system:1.To study the factors that affect the efficiency of reptiles,considering that the main workload in the reptile system is to acquire and resolve web pages,when the URL is repeated,multiple parsing of the same page causes the waste of CPU resources,reducing the efficiency of the reptile system and wasting the system.Storage space.In order to solve this problem,this paper compares various URL de-gravity strategies,and then studies the Bloom filtering algorithm that is more suitable for URL de-weighting and improves its shortcomings.A multi-eigenvalue Hash map is proposed.Split Bloom filter algorithm,The validity of the improved algorithm is verified theoretically and experimentally.2.In In order to improve the operating efficiency of the reptile system,in the process of removing duplicate URL links by using the improved Bloom filter algorithm,the strategy of parallel dynamic task assignment is proposed to achieve URL weight loss.This parallel method is different from the common parallel method of realizing mutually exclusive reading and processing of URLs through message queues.The parallel strategy is to divide the URL dataset and then assign a thread to each data block,thus avoiding the problem of additional consumption data reading and weight loss waiting time when multiple threads mutually reject readings of the same URL dataset.After the data is divided into blocks,a monitoring thread is added for each data block to detect the URL surplus in the data block,so as to facilitate the timely data collection for the processed data block,and to realize the dynamic data volume adjustment between threads.This method improves the parallel execution effect.3.Finally,according to the requirements of the project,a web crawler system for the book information of a certain website is designed and implemented.After obtaining the URL of the book details page,the improved Blon filter algorithm and parallel dynamic task assignment strategy proposed in this paper are applied to the system.It is proved by experiments that the improved Bloom filter algorithm can achieve better URL weight r emoval effect,but also reduce the error rate of URL weight removal and improve system performance.In t his paper,parallel dynamic task adjustment is used in the process of URL weight-loss algorithm,which eff ectively improves theCPU utilization rate,accelerates the URL weight-loss speed,and also improves the sy stem execution efficiency.
Keywords/Search Tags:Data retrieval, Web crawlers, URL de-duplication, Bloom filters, Parallel
PDF Full Text Request
Related items