Font Size: a A A

Features Extraction And Duplicate Pattern Detection Of Web Pages

Posted on:2012-01-11Degree:MasterType:Thesis
Country:ChinaCandidate:J J LiFull Text:PDF
GTID:2218330368982076Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The development of Internet brought us new way of obtaining information and communicating. The number of sites and web pages is increasing at an alarming rate. The appearances of search engines have helped people access the information. But the large number of duplicate web pages, which are produced by copying, causes a lot of repeated same pages stored in search engines. These duplicate pages not only waste memory, but also damage the user experience. Duplicate web pages detection has become the most important step of search engines.Duplicate web pages detection contains text extraction, features extraction and duplicate pattern detection. This paper mainly research on the last two problems. The first, text which contains interferer information will reduce precision of approaches. The reason is that web pages are made of html tags, advertisements, introductions, will reduce precision of approaches. This paper considers concatenation between paragraphs, sentences and keywords, put forward an approach which combines structure of the text and importance of the content by layer information filtering. This method works efficiently and counteracts pages noise. Features extracted can cover the original text equably.The second one is duplicate pattern detection. Methods for resolving this problem can be categorized into two types:methods based on sets and methods based on feature codes. The first type of methods considers feature codes as elements of a set, which are stored with their attributes value. Then, Implements duplicate pattern detection by calculating similarity of two sets. These methods ignore the order of feature codes, and do not adapt to detect duplicate web pages. The second type of methods considers feature codes as a string. So these methods can not deal with duplicate web pages which are duplicated continuous. In order to resolve these problems, this paper uses LCS(Longest Common Subsequence). It considers a sentence as a feature code, so it reduces dimensionality of calculating and assures that one sentence can not be split. This paper also improved the LCS approach to make itself and the whole system more efficient.Finally, the experiments results show good performance for this method, which include high precision and good removing rates.
Keywords/Search Tags:duplicate web pages removing, layer filtering, longest common subsequence, inverted list, Balanced Binary Tree
PDF Full Text Request
Related items