Font Size: a A A

Research On Vision-based Web Page Information Extraction Technology

Posted on:2010-05-27Degree:MasterType:Thesis
Country:ChinaCandidate:P DuFull Text:PDF
GTID:2178360278497049Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, web has become the largest information source in the world. In people's job, learning, living and entertainment, web as the carrier of information technology has become an important tool. Web development to human life brought great convenience, it can transcend time and space limits to share substantial information. This information is exponentially growth. So, how to effectively use this information to become an important research topic, there is substantial technologies and applications in web as a source of information, in recent years, the web information extraction technology attracted a growing number of researchers attention.Due to the data of the web pages belong to semi-structured data, lack of strict norms of syntax structure, so the traditional natural language processing technology is not applicable to web information extraction well. Web pages is recognized,explained and shown by Brower and viewed by users,in which there is a lot of vision characters. So if the vision information in web pages can be used for information extraction,complex linguistic knowledge would be avoided. Therefore,the focus of the study is to use the natural language processing and the vision formation together to overcome the shortcomings of each other and realize extracting information from web pages. In this paper, the study combines the natural language processing technology and visual characteristics of HTML pages in the web page to extracting information from web pages, conducted the following research work. First,research on web information extraction technology development, analysis of their research, summed up the current web information extraction technology deficiencies exist. Second, research in the semi-structured web page information of visual features and some heuristic rules based on visual features to page block. For the web page coarse-grained block to fine-grained conversion problem and web page of the reorganization of the smallest sub-block problem, propose an algorithm of web page information extraction based on the visual characteristics(VWDREA, Vision-based Web Page Data Region Extraction Algorithm),which use the visual characteristics rules of web, analyze the visual factors of web block and determine the subject to be extracted data region. And research information collection on semantic block of web page and search topic algorithm and extracted topic algorithm .The final, in this article the author make a summary and outlook to the web information extraction technology.
Keywords/Search Tags:Information Extraction, Vision Character, Vision-based Web Page Data Region Extraction Algorithm, Data Region, Topic Extraction
PDF Full Text Request
Related items