Font Size: a A A

Research On Mobile Search Oriented WAP Duplicate Data Detection

Posted on:2011-09-01Degree:MasterType:Thesis
Country:ChinaCandidate:M CaiFull Text:PDF
GTID:2178360305960304Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology and the rapid growth of World Wide Web, electronic information resources have shown explosive growth trend and data size has increased from GB level to TB level and even to PB level. Due to the diversity of resources, the information sharing similar content may be exhibited in all kinds of pages in terms of various forms, result in a large redundancy of information. The existence of duplicate information seriously affects the experience of users and also increases the cost of internet. While existing methods of duplicate data detection are mostly proposed for web pages on personal computers, the methods to deal with duplicated WAP pages on mobile phones are largely under-explored.This paper proposes a WAP oriented duplicate data detection method, including a feature extraction method and a WAP oriented SimHash algorithm. The duplicate data detection method was validated on a real world data set and excellent performance was observed and analyzed. The work and contributions are the following:1. This paper proposes a category wise feature extraction method, which includes two parts:a feature extraction step which draws features according to the category of WAP pages; a feature filtering step which select features based on Visual Based Page Segment Algorithm (VIPS) algorithm. The method not only takes into account the different definition of duplication in different categories of WAP pages, but also considers the semantics information of WAP pages. The obtained features have an appropriate size, a low computational complexity as well as a good representation.2. This paper presents a WAP oriented duplicate data detection method. The method combines the category wise feature extraction method with the SimHash algorithm. We propose a measure to evaluate the performance of the detection method. The measure benefits for the setting of the threshold of similarity of WAP pages.3. The duplicate data detection method was validated in a real world data, and excellent performance was observed in our experiments. This demonstrates the effectiveness of the WAP oriented duplicate data detection method.
Keywords/Search Tags:Duplicate Detection, Feature Extraction, WAP page, SimHash, VIPS
PDF Full Text Request
Related items