Research On Detection Of Similar Web Pages Based On Text Structure Tree

Posted on:2017-01-24

Degree:Master

Type:Thesis

Country:China

Candidate:Y Ma

Full Text:PDF

GTID:2348330503982761

Subject:Computer technology

Abstract/Summary:

With the continuous development of Internet, network information increases exponentially. On one hand a large amount of information brings more news channels, on the other hand also brings the difficulty of retrieving on duplicated web pages. Duplicated web pages not only is a waste of time, makes the quality of index reduced, but also can make the sorting low-ranking. So in the face of the growing web information, how to remove duplicated web pages rapidly and accurately becomes an important problem of the Internet.Firstly, when big web sites reprint or copy pages each other, the reprinted pages just perform add or delete operations in terms of content, but rarely make an adjustment in the text structure of the web pages. To this phenomenon, it analyzes the causes of duplicated web pages and elaborate the advantages and disadvantages of the traditional algorithm. On the basis, we propose the algorithm of duplicated web pages based on text structure tree, which can improve the precision rate and recall rate effectively.Secondly, combining the characteristics of the text structure tree, it puts forward to the algorithm of duplicated web pages based on text structure tree and the key words. For preprocessed web pages, it introduces the prefix filtering algorithm to delete duplicated web pages preliminarily and set up the text structure tree for the reserved web pages. When it analyses the keywords adopting TF- IDF method which is a statistical method, it introduces the tag weight. At the same time in order to avoid the label weight too large to make the last word weight comparability reduced, it standardizes the weight of the label. For the key sentences of extraction, it adopts the key sentence extraction method according to paragraph length scale. After that, with the MD5 algorithm comparing the similarity "fingerprint".Then, combining the characteristics of the text structure tree, it proposes the duplication algorithm of bloom filter based on text structure tree. After preprocessing of the web page text structure tree is established. When web page feature string extraction, using the extraction method of starting with one character and ending with two characters. And utilizing the Bloom Filter algorithm to computate and compare the "fingerprint" similarity of each layer of the text structure tree nodes. In the case of a certain error, the Bloom Filter algorithm can reduce the time and space complexity effectively.Finally, the proposed algorithm is analyzed and validated by real data for the results and the time of duplicated web pages.

Keywords/Search Tags:

detection of web pages, prefix filtering, text structure tree, web fingerprint similarity, Bloom Filter

Related items

1	Detection Of Near-replicas Of Web Pages Based On Text Structure
2	Research On Detection And Elimination Of Similar Web Pages Based On Text Structure
3	The Research On Bloom Filter Based On The Tree Structure
4	Research On Elimination Of Similar Web Pages Based On Text Structure And Extraction Of Long Sentences
5	A Fast IP Lookup Algorithm Based On Pivot-pushing And Bloom Filter
6	The Implementation And Application Of Removing Duplicated Web Pages Based On Bloom Filter
7	K-Prefix Tree Full-Text Search Method And Application
8	Bad Text Filtering System Research And Implementation
9	Content Synchronization In Distributed Systems
10	The Design Of Bloom Filter Algorithm For Key-value Storage