| With the rapid development of information technology,the number of web pages on the Internet has grown tremendously,a large number of web pages are similar.These similar web pages have become the biggest obstacle for people to get effective information rapidly.Therefore,the detection and deduplication of similarity web pages has become an important research topic worldwide.This paper deeply analyzes,studies and verifies the deduplication technology of similarity web pages.Firstly,based on the traditional Simhash method,this paper improves the formula for similarity calculation and comprehensively calculates the similarity of documents from multiple aspects.The process of Simhash algorithm to obtain document features is improved,the weight of words is calculated synthetically by TF-IDF technique and the topic relevance of words.The idea of hashing to buckets is adopted in the retrieval step.In the case of uneven distribution,the elements in the bucket are hashed twice,which can reduce the number of candidate pairs and make the distribution more uniform.Secondly,based on the traditional hierarchical clustering algorithm,this paper improves the calculation method of distance between documents,and proposes an improved algorithm based on minimum heap.The minimum heap is used to store the distance in the distance matrix and then sort according to the size of the distance,which reduces the amount of calculations.Finally,this paper combines the improved Simhash algorithm,the improved hierarchical clustering algorithm and the newly proposed scoring algorithm to propose a technology blog webpage deduplication algorithm,which is applied to the deduplication of CSDN technology blog webpage according to actual needs.Experiments show that the algorithm is scientific and effective. |