Font Size: a A A

Research On NLP-Based Duplicated Web Pages Deletion Algorithm

Posted on:2010-08-26Degree:MasterType:Thesis
Country:ChinaCandidate:Y H WengFull Text:PDF
GTID:2178360278466399Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
Because of many duplicated web pages existing on the web, search engines need to find and delete them, not only for saving process time and hardware resource, but also for ensuring that users can get the result information without many replicas.In this paper, we propose a method to find and delete duplicated Chinese web pages which is based on "semantic fingerprint". Existing algorithms on duplicated web page deletion are highly efficient but with unsatisfied effectiveness, classical copy detection algorithms for textual document are much better in precision but its efficiency is much more worse, our algorithm take a good balance between the two measures through the combination of traditional Information Retrieval and copy detection technology. The experimental results of the prototype system shows that our algorithm prototype work well under a proper settings, and it is very suitable for the application in the Competitive Intelligence System (CIS).
Keywords/Search Tags:duplicated web page deletion, semantic fingerprint, information retrieval, copy detection
PDF Full Text Request
Related items