Features Extraction And Duplicate Pattern Detection Of Web Pages

Posted on:2012-01-11

Degree:Master

Type:Thesis

Country:China

Candidate:J J Li

Full Text:PDF

GTID:2218330368982076

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

The development of Internet brought us new way of obtaining information and communicating. The number of sites and web pages is increasing at an alarming rate. The appearances of search engines have helped people access the information. But the large number of duplicate web pages, which are produced by copying, causes a lot of repeated same pages stored in search engines. These duplicate pages not only waste memory, but also damage the user experience. Duplicate web pages detection has become the most important step of search engines.Duplicate web pages detection contains text extraction, features extraction and duplicate pattern detection. This paper mainly research on the last two problems. The first, text which contains interferer information will reduce precision of approaches. The reason is that web pages are made of html tags, advertisements, introductions, will reduce precision of approaches. This paper considers concatenation between paragraphs, sentences and keywords, put forward an approach which combines structure of the text and importance of the content by layer information filtering. This method works efficiently and counteracts pages noise. Features extracted can cover the original text equably.The second one is duplicate pattern detection. Methods for resolving this problem can be categorized into two types:methods based on sets and methods based on feature codes. The first type of methods considers feature codes as elements of a set, which are stored with their attributes value. Then, Implements duplicate pattern detection by calculating similarity of two sets. These methods ignore the order of feature codes, and do not adapt to detect duplicate web pages. The second type of methods considers feature codes as a string. So these methods can not deal with duplicate web pages which are duplicated continuous. In order to resolve these problems, this paper uses LCS(Longest Common Subsequence). It considers a sentence as a feature code, so it reduces dimensionality of calculating and assures that one sentence can not be split. This paper also improved the LCS approach to make itself and the whole system more efficient.Finally, the experiments results show good performance for this method, which include high precision and good removing rates.

Keywords/Search Tags:

duplicate web pages removing, layer filtering, longest common subsequence, inverted list, Balanced Binary Tree

PDF Full Text Request

Related items

1	The Research On Algorithms For The Longest Common Subsequence Problem And Variants
2	The Research On The Longest Common Subsequence Query Algorithm
3	Approximate Longest Common Subsequence Query Processing And Optimization On Biological Sequence
4	Explorations on the longest common increasing subsequence problem
5	Study On Parallel Algorithms For Longest Common Subsequence On Heterogeneous Cluster Computing Systems
6	Parallel Algorithm For Multiple Longest Common Subsequence And Application Research On Hadoop Platform
7	Algorithms For Sorting By Short Swap And Longest Common Exemplar Subsequence
8	Open Source Software Vulnerability Detection Method For Binary Program
9	Detection Of Near-replicas Of Web Pages Based On Text Structure
10	Research On Collaborative Filtering Recommendation Algorithm Based On Matrix Factorization