Research On Web Page Deduplication Technology Based On Simhash And Hierarchical Clustering Algorithm

Posted on:2020-05-27

Degree:Master

Type:Thesis

Country:China

Candidate:Y C Wang

Full Text:PDF

GTID:2428330590995515

Subject:Signal and Information Processing

Abstract/Summary:

PDF Full Text Request

With the rapid development of information technology,the number of web pages on the Internet has grown tremendously,a large number of web pages are similar.These similar web pages have become the biggest obstacle for people to get effective information rapidly.Therefore,the detection and deduplication of similarity web pages has become an important research topic worldwide.This paper deeply analyzes,studies and verifies the deduplication technology of similarity web pages.Firstly,based on the traditional Simhash method,this paper improves the formula for similarity calculation and comprehensively calculates the similarity of documents from multiple aspects.The process of Simhash algorithm to obtain document features is improved,the weight of words is calculated synthetically by TF-IDF technique and the topic relevance of words.The idea of hashing to buckets is adopted in the retrieval step.In the case of uneven distribution,the elements in the bucket are hashed twice,which can reduce the number of candidate pairs and make the distribution more uniform.Secondly,based on the traditional hierarchical clustering algorithm,this paper improves the calculation method of distance between documents,and proposes an improved algorithm based on minimum heap.The minimum heap is used to store the distance in the distance matrix and then sort according to the size of the distance,which reduces the amount of calculations.Finally,this paper combines the improved Simhash algorithm,the improved hierarchical clustering algorithm and the newly proposed scoring algorithm to propose a technology blog webpage deduplication algorithm,which is applied to the deduplication of CSDN technology blog webpage according to actual needs.Experiments show that the algorithm is scientific and effective.

Keywords/Search Tags:

Web Page Deduplication, Simhash, hierarchical clustering algorithm, scoring algorithm, technology blog

PDF Full Text Request

Related items

1	Research On Technology Blog Webpage De-duplication Technology Based On Simhash And CNN
2	Research On Deduplication Algorithm Based On Similarity And Chunking
3	Research In Data-deduplication Based On Storage System
4	Micro-blog Hot Topics Detection Method Based On Hybrid Clustering
5	Research On Clustering Algorithm And Its Application In Page Clustering
6	A Kind Of Hierarchical Data Deduplication Technology Research
7	Research And Implementation Of Network Scanning Technology Based On Intelligent Crawling Algorithm
8	Clustering Algorithm In Data Mining Research
9	An Enhanced Clustering Algorithm With Parallelization Improvement And Its Application In Micro-blog User Clustering
10	Research And Application Based On K-means Algorithm And Hierarchical Clustering Algorithm