Research On Web News Extraction And Duplicates Elimination

Posted on:2012-03-26

Degree:Master

Type:Thesis

Country:China

Candidate:J D Hu

Full Text:PDF

GTID:2178330332976007

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of WWW, more and more organizations release their information though internet. Search engine becomes the most important tool for our information retrieval. But the useless info, such as navigations, advertisements, and duplicated pages bring extra burden for search engine. So the web news extraction and duplicates elimination are two important parts in the field of search engine. In the paper, we propose an algorism for web news extraction based on the Maximum Subsequence Sum problem, which has nothing to do with the conventional XML DOM, this will improve its robustness and efficiency.We also propose an algorism for web duplicates elimination by combining the merits of Shingling and I-Match, which we called Long Keyword Sentence. The algorism reduces the number of tokens but improves the accuracy of document sketch. We also propose the method of duplicate documents pre-clustering based on the length of documents.Experiments on about 150 thousands web pages get from 20 portals show that our algorisms improve both precision and recall compared to current state-of-the-art approaches.

Keywords/Search Tags:

Search Engine, Web Extraction, Duplicates Elimination, Maximum Subsequence Sum, Long Keyword Sentence, Pre-Clustering

PDF Full Text Request

Related items

1	Research And Application Of An Elimination Algorithm For Redundant Information On Search Engine's Result
2	Study And Applications Of Duplicate Web Page's Elimination And Clustering Algorithm In Search Engine System Of Colleges And Universities
3	The Design And Implementation Of Chinese Personal Name Search Engine
4	Research On Text Extraction Method Based On Key Sentence And Keyword Association
5	Research On Near-Duplicates Detection Algorithm Of Search Engine
6	The Study Of Key Technologies For Chinese Domain-Oriented Search Engine
7	Implementation And Optimization Of A Large-scale Enterprise Search Engine
8	Research Of Chinese Meta Search Engine Based On Clustering
9	The Design And Implementation Of Vertical Search Engine Based On Duplicated Web Pages Elimination
10	Research On Search Results Clustering And Label Extraction