Font Size: a A A

Research On Web Article Automatic Extraction Method Based On Page Segmentation

Posted on:2013-09-27Degree:MasterType:Thesis
Country:ChinaCandidate:J W OuFull Text:PDF
GTID:2298330392969042Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Since the1990s, Internet technology has been rapidly developed. Now, webpage styles are more diverse than before. The web pages are filled with a lot ofunrelated information: navigational links, advertisements and so on. The uselessinformation makes it difficult to locate the useful information of the pages correctly.Now, we can only save a whole page to find the necessary information of it. In orderto solve the problem, we do the following works:By analyzing a large number of web pages, we proposed a page segmentalgorithm which is based on DOM tree structural features and visual features. Thealgorithm segments the pages to the small particles. It produces basic processingunits for recognition algorithms. Experimental results showed that the page segmentalgorithm could produce suitable basic processing units for recognition algorithmsto identify the body of the articles. It provides good support for recognitionalgorithms.After segmenting the pages, we extracted the structural and visual features ofthe pages, and proposed a method to identify the body of the web article. Themethod uses clustering algorithm and heuristic rules to produce an automaticwrapper.After extracting the body of the web article successfully, we analyzed thestructural and visual features of the title, summary, illustration, illustration title andrelated links of the web articles. We proposed methods to identify them.Finally, we implemented an automatic wrapper which can extract the web articleautomatically. The wrapper system includes four modules: feature extraction module,page segmentation module, semantic segmentation recognition module andinformation extraction module.
Keywords/Search Tags:visual information, page segmentaton, information extraction, classification, clustering
PDF Full Text Request
Related items