Font Size: a A A

Research On Web News Extraction And Duplicates Elimination

Posted on:2012-03-26Degree:MasterType:Thesis
Country:ChinaCandidate:J D HuFull Text:PDF
GTID:2178330332976007Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of WWW, more and more organizations release their information though internet. Search engine becomes the most important tool for our information retrieval. But the useless info, such as navigations, advertisements, and duplicated pages bring extra burden for search engine. So the web news extraction and duplicates elimination are two important parts in the field of search engine. In the paper, we propose an algorism for web news extraction based on the Maximum Subsequence Sum problem, which has nothing to do with the conventional XML DOM, this will improve its robustness and efficiency.We also propose an algorism for web duplicates elimination by combining the merits of Shingling and I-Match, which we called Long Keyword Sentence. The algorism reduces the number of tokens but improves the accuracy of document sketch. We also propose the method of duplicate documents pre-clustering based on the length of documents.Experiments on about 150 thousands web pages get from 20 portals show that our algorisms improve both precision and recall compared to current state-of-the-art approaches.
Keywords/Search Tags:Search Engine, Web Extraction, Duplicates Elimination, Maximum Subsequence Sum, Long Keyword Sentence, Pre-Clustering
PDF Full Text Request
Related items