Font Size: a A A

Web News Extraction Based On Structrual And Visual Consistency

Posted on:2011-07-31Degree:MasterType:Thesis
Country:ChinaCandidate:J F WangFull Text:PDF
GTID:2178360302974612Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the massive popularity of the Internet, thousands of news sites have emerged, which continuously published mass news pages. However, computer programs can not directly understand the news headlines and news bodies from Web news pages. Therefore, in the context of Web information retrieval, there is a huge demand for automatic extraction techinques.We propose two kinds of Web news extraction algorithms. Then we design and implement an Web news extraction system which effectively combines theses two extaction algorithms. 1) Based on the structrual consistency of news pages, we propose a template-dependent news extraction algorithm. It induce templates by making full use of the structrual consistency property: Dynamic pages are produced by filling some predefined templates with structured data. And we introduce a small number of users labeling to help distinguish important and useless parts in the templates. 2) Based on the visual consistency of news pages, we propose a template-independent news extraction algorithm. With the generalization ability of machine learning, we explore the visual consistency property: news headlines and news bodies are always designed into consistent styles and layouts. Special features and models are designed for identifying news headlines and news bodies. Finally, experiments on 7594 Web news pages from 24 news sites show that our news extraction system has a high extraction accuracy.
Keywords/Search Tags:Automatic information extraction, classification, Support Vector Machine, Web mining, wrapper
PDF Full Text Request
Related items