Font Size: a A A

Research On Detection Of Similar Web Pages Based On Text Structure Tree

Posted on:2017-01-24Degree:MasterType:Thesis
Country:ChinaCandidate:Y MaFull Text:PDF
GTID:2348330503982761Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the continuous development of Internet, network information increases exponentially. On one hand a large amount of information brings more news channels, on the other hand also brings the difficulty of retrieving on duplicated web pages. Duplicated web pages not only is a waste of time, makes the quality of index reduced, but also can make the sorting low-ranking. So in the face of the growing web information, how to remove duplicated web pages rapidly and accurately becomes an important problem of the Internet.Firstly, when big web sites reprint or copy pages each other, the reprinted pages just perform add or delete operations in terms of content, but rarely make an adjustment in the text structure of the web pages. To this phenomenon, it analyzes the causes of duplicated web pages and elaborate the advantages and disadvantages of the traditional algorithm. On the basis, we propose the algorithm of duplicated web pages based on text structure tree, which can improve the precision rate and recall rate effectively.Secondly, combining the characteristics of the text structure tree, it puts forward to the algorithm of duplicated web pages based on text structure tree and the key words. For preprocessed web pages, it introduces the prefix filtering algorithm to delete duplicated web pages preliminarily and set up the text structure tree for the reserved web pages. When it analyses the keywords adopting TF- IDF method which is a statistical method, it introduces the tag weight. At the same time in order to avoid the label weight too large to make the last word weight comparability reduced, it standardizes the weight of the label. For the key sentences of extraction, it adopts the key sentence extraction method according to paragraph length scale. After that, with the MD5 algorithm comparing the similarity "fingerprint".Then, combining the characteristics of the text structure tree, it proposes the duplication algorithm of bloom filter based on text structure tree. After preprocessing of the web page text structure tree is established. When web page feature string extraction, using the extraction method of starting with one character and ending with two characters. And utilizing the Bloom Filter algorithm to computate and compare the "fingerprint" similarity of each layer of the text structure tree nodes. In the case of a certain error, the Bloom Filter algorithm can reduce the time and space complexity effectively.Finally, the proposed algorithm is analyzed and validated by real data for the results and the time of duplicated web pages.
Keywords/Search Tags:detection of web pages, prefix filtering, text structure tree, web fingerprint similarity, Bloom Filter
PDF Full Text Request
Related items