Research On Vision-based Web Page Information Extraction Technology

Posted on:2010-05-27

Degree:Master

Type:Thesis

Country:China

Candidate:P Du

Full Text:PDF

GTID:2178360278497049

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet, web has become the largest information source in the world. In people's job, learning, living and entertainment, web as the carrier of information technology has become an important tool. Web development to human life brought great convenience, it can transcend time and space limits to share substantial information. This information is exponentially growth. So, how to effectively use this information to become an important research topic, there is substantial technologies and applications in web as a source of information, in recent years, the web information extraction technology attracted a growing number of researchers attention.Due to the data of the web pages belong to semi-structured data, lack of strict norms of syntax structure, so the traditional natural language processing technology is not applicable to web information extraction well. Web pages is recognized,explained and shown by Brower and viewed by users,in which there is a lot of vision characters. So if the vision information in web pages can be used for information extraction,complex linguistic knowledge would be avoided. Therefore,the focus of the study is to use the natural language processing and the vision formation together to overcome the shortcomings of each other and realize extracting information from web pages. In this paper, the study combines the natural language processing technology and visual characteristics of HTML pages in the web page to extracting information from web pages, conducted the following research work. First,research on web information extraction technology development, analysis of their research, summed up the current web information extraction technology deficiencies exist. Second, research in the semi-structured web page information of visual features and some heuristic rules based on visual features to page block. For the web page coarse-grained block to fine-grained conversion problem and web page of the reorganization of the smallest sub-block problem, propose an algorithm of web page information extraction based on the visual characteristics(VWDREA, Vision-based Web Page Data Region Extraction Algorithm),which use the visual characteristics rules of web, analyze the visual factors of web block and determine the subject to be extracted data region. And research information collection on semantic block of web page and search topic algorithm and extracted topic algorithm .The final, in this article the author make a summary and outlook to the web information extraction technology.

Keywords/Search Tags:

Information Extraction, Vision Character, Vision-based Web Page Data Region Extraction Algorithm, Data Region, Topic Extraction

PDF Full Text Request

Related items

1	Extraction Algorithm, Based On Visual Features Of The Web Page
2	Research On Region Of Interest Extraction
3	Subject-oriented Mode Of The Xml Page And Data Extraction
4	Research On Page Segmentation Based On CEF
5	Research On Text Region Extraction Based On Edge Information
6	Robust region extraction: Extracting model and domain parameters in the presence of noise and multiple populations
7	Structure Information Extraction- Study And Implementation On Semi-auto Wrapper
8	Image Linear Feature Extraction Used In Road Detection Of Vision Navigation
9	Research On Web Information Extraction Technology Based On Deep Web
10	Research On Feature Extraction And Matching Method Of Spherical Stereo Vision