Font Size: a A A

Research And Implementation Of Web Page Segmentation Algorithm Mfps Based On Multi-feature

Posted on:2009-10-28Degree:MasterType:Thesis
Country:ChinaCandidate:J J YuFull Text:PDF
GTID:2198360308977801Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, Web has become an important source of information. In order to meet the growing demand for getting information from the Internet, web information extraction technology inevitably becomes the hot spot of the research at present. As web pages have the characteristic of diversity, complexity and semi-structure, these make web information extraction difficult. It's an urgent issue to make web pages' characteristic not impact on web information extraction technology and extract accurate information from web pages.In the course of extracting information from complicated web pages, web page segmentation technology becomes one direction at present. However, most of web page segmentation algorithms use the single feature information for web page segmentation, so that they can't properly deal with complex types of web pages. In light of this situation, this thesis proposes a new web page segmentation algorithm MFPS based on multi-feature. This algorithm divides the web page into independent semantic blocks at first, and then extracts appropriate blocks for the need of the application. At first, this thesis analyzes multiple features of web pages including layout feature, view feature, semantic feature and document structure feature, and proposes web page semantic blocks'model based on multi-feature. Based on that, this paper proposes web page segmentation algorithm MFPS, and expounds its basic thinking and process. Then, this thesis focuses on MFPS's realization, which analyses and solves the problem of identifying similar blocks, expounds nodes sequence merge approach including single line type, multi-line type, multi-block type and line-block cross type, and expounds the identification methods including semantic type, segmentation type and multi-feature information. Based on that, this paper gives MFPS's formal description and experiment analysis. Finally, based on MFPS, this paper proposes a page type identification algorithm PTIBID based on the block's importance degree. This algorithm can effectively identify types of pages and extract information attributes to meet actual needs of web information extraction through analyzing block's structure and multi-feature information produced by MFPS. The experiment results show that with the existing web page segmentation algorithm compared, MFPS has characters such as more accurate segmentation, more reasonable block's structure, better adaptability and so on. This proves MFPS can provide effective support for web information extraction technology.
Keywords/Search Tags:web information extraction, page segmentation algorithm, multi-feature analysis, similar blocks' identification, page type identification
PDF Full Text Request
Related items