| In the era of rapid development of modern internet technology,all kinds of resources on the internet explosive growth,the internet has accumulated a wealth of jujube information.Traditional theme crawler can capture web pages highly related to the theme,but it can not meet the needs of users to obtain jujube information quickly,accurately and effectively.Traditional topic crawlers can only capture pages highly related to topics in the process of page retrieval.However,the existing traditional theme crawler algorithms have some problems,such as theme drift,ignoring new pages,date link de duplication and so on.For jujube related pages,combined with the advantages of different algorithms,The(HITS)Hyperlink-Induced Topic Search algorithm and link data De duplication algorithm are improved to make the improved algorithm have better page capture performance.The main contents of this paper are as follows:Firstly,the theory and technology of general crawler are studied,and the difference between general crawler and theme crawler is analyzed.Then it analyzes the related technologies used in the implementation of theme crawler,mainly analyzes the processing of web pages and the calculation of theme correlation.Secondly,through the research of traditional theme web crawler technology,the following problems are found: 1.Hyperlink-Induced Topic Search algorithm ignores the new page and "topic drift".2 for jujube tree link,Traditional memory data De duplication methods are inefficient.In view of the above problems,this paper studies the jujube tree theme crawler algorithm,combined with the advantages of different algorithms,improves the jujube tree theme crawler algorithm,so that the improved algorithm shows better performance in the jujube tree web crawler.Then,In this paper,the traditional theme crawler algorithm is deeply studied,and the shortcomings of the existing theme crawler algorithm are found and improved.The Hyperlink-Induced Topic Search algorithm and the shark search algorithm combined with the time factor are proposed to make the combined algorithm crawl on the web page,and the search time is closely related to the jujube theme.It solves the problem of ignoring new pages in traditional algorithm,eliminates the phenomenon of "topic drift" in traditional algorithm,and improves the accuracy and recall rate of jujube tree crawler algorithm.Aiming at the low efficiency of traditional memory data De duplication,a bloom filter de duplication method based on redis is proposed.Bloom filter represents jujube tree link as a binary vector and stores it in redis,which improves the efficiency of jujube tree link de duplication.Finally,The overall crawler function of jujube theme web crawler system is realized,and the improved algorithm is applied to the realization of key function modules.On the basis of constructing the system,the improved algorithm proposed in this paper is verified by experiments,and the improved algorithm is applied to the realization of key functional modules.On the basis of constructing the system,the experimental results show that the algorithm is feasible and effective in calculating the topic correlation of jujube tree and improving the efficiency of jujube tree link data De duplication. |