Font Size: a A A

Research And Implementation On Web-based Information Extraction Using Vision Characters

Posted on:2009-10-28Degree:MasterType:Thesis
Country:ChinaCandidate:W ZhangFull Text:PDF
GTID:2178360242966530Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the fast development of internet technology, web has become the largest virtual database in the world. How to use the web information effectively has become an important research topic. So it appears more and more technologies and applications based on web, including web information extraction, which have attracted much attention from researchers in recent years.Due to the lack of syntax structure of the information in web pages, especially in the semi-structure pages, the traditional natural language processing technology is not applicable to the web information extraction well. And web pages is recognized, explained and shown by browers and viewed by users, in which there is a lot of vison characters. So if the vision information in web pages can be used for information extraction, complex linguistic knowledge would be avoided. Therefore, the focus of the study is to use the natural language processing and the vison information together to overcome the shortcomings of each other and realize extracting information from web pages.The study has mixed the natural language processing technology and the vision characters of html pages together to extract information from web pages. The researches the author did in this paper can be concluded as follows:1. Propose a method of Data Region Extraction based on Vision Characters (DREV). Named entity tagging in natural language processing is used to provide simple semantic information at first. Then, the data region can be confirmed after using the vision characters rules and analyzing Entity Density(ED) of each block.2. Propose an algorithm of Records Extraction based on Constraint Satisfaction (RECS), which analyzed entites in data region. According to the idea of Constraint Satisfaction Problem (CSP), constraint rules are given to group and extract records for semi-structure web pages.3. Design and Implement a prototype system VisionWebIE based on the framework of Gate, which is an open source project of Sheffield University. It is proved that the prototype system improves the recall rate, the precision rate and the ability to adapt to the changes.
Keywords/Search Tags:Information Extraction, Vision Characters, Entity Recognition, Constraint Satisfaction, Ontology
PDF Full Text Request
Related items