Detection Of Near-replicas Of Web Pages Based On Text Structure

Posted on:2009-11-11

Degree:Master

Type:Thesis

Country:China

Candidate:L X Wei

Full Text:PDF

GTID:2178360272963564

Subject:Computer software and theory

Abstract/Summary:

Google has wan the enthusiasm of the majority of the web users by its better practicability when it appeared in 1998. People often can find the information what they want at once navigated by Google. But in recent years, with the information on the web is expanding violently, people have begun suffering from the disadvantages of existing SEs(Search Engines), the worst is that the results returned include plenty of replicas. Some of the replicas are uniformed entirely, the others are similar partly. The main reason that replicas exist on the web is Illegal copy. For the SE itself, these replicas have wasted lots of rare resources and lowered the indexing efficiency. For users, it is not significant at all, but they have to browse them. And the fact that they exist on the web is infringement to the intellectual property rights. So, removing the replicas quickly and exactly is not only necessary to the development of SE but also protective to the intellectual property rights.In recent years, some scholars have proposed several methods to detect the replicas on the web, these methods have acquired better results for the replicas that are uniformed entirely, but worse results for the replicas that are similar partly.In this article, we have brought forward a dynamic method to detect and remove the replicas according to the features of web pages themselves and the ways that they are uniformed in. In this method, the text style of each kind of web pages are analyzed and classified firstly, then each text of the web pages are denoted into the form of text structure tree according to its text style, at last similarity degree is calculated by extracting the features from the text structure tree dynamically to detect the replicas. In this way, the replicas of web pages are detected successfully. On the basis of a lot of true data and experiments, we have acquired the following productions and conclusions:1. By analyzed manually, the styles of text of web pages are sorted into four kinds, and several more detailed kinds included in each kind. For each kind, the corresponding value distributing algorithm for each paragraph is proposed.2. The texts of web pages are denoted into the form of text structure trees, and the algorithm to carry out the denotation is brought forward.3. The method to detect replicas of web pages by extracting features from the text structure tree is brought forward, and the corresponding algorithm of layer fingerprint to calculate the similarity degree is proposed.4. The method in this article is evaluated by amount of web pages, and is compared with the other existing methods with the same data.We have collected 12,000 web pages for test data manually. The experiment shows that not only replicas uniformed entirely are detected better but also similar partly are detected better.

Keywords/Search Tags:

Replicas of web pages, Text structure tree, Layer fingerprint, Detecting and removing of replicas of web pages

Related items

1	Research On Finding Near-Replicas Of Documents On The Web
2	Research On Detection And Elimination Of Similar Web Pages Based On Text Structure
3	Research On Detection Of Similar Web Pages Based On Text Structure Tree
4	Features Extraction And Duplicate Pattern Detection Of Web Pages
5	Research On Elimination Of Similar Web Pages Based On Text Structure And Extraction Of Long Sentences
6	Research On D2D Caching Algorithm Based On Replicas And Social
7	Research On The Technology Of Incremental Web Pages Crawler
8	Research Of Web Page Purification And Replicas Detection In Search Engine
9	Research On Replicas Placement And Cache Optimization Of HDFS
10	Mining The Link Spamming And Malicious Web Pages Based On Topology Structure Of Massive Internet Web Pages