Font Size: a A A

A Gate-based Information Extraction System: Research And Implementation

Posted on:2008-02-21Degree:MasterType:Thesis
Country:ChinaCandidate:D X XuFull Text:PDF
GTID:2178360212990706Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the fast development of Internet technology, web has become the largest virtual database in the world. How to use the web information effectively has become an important research topic. So it appears more and more technologies and applications based on web, including web information extraction, which have attracted much attention from researchers in recent years.Due to the web pages are lack of a standardized structure, the traditional natural language processing technology is not applicable to the Web information extraction well. And most of the content of web pages is shown in the form of a list of attributes, so we can use such structure in information extraction to avoid using complex linguistic knowledge. Therefore, how to use various methods mixed for information extraction has become one focus of the study.We use the natural language processing and the structural characteristics of html pages mixed to extract information from web in this paper, and the researches are as follow.1. Propose a method to analysis DOM tree for information extraction. The method is based on named entity tagging and the extraction rules are based on XPath.2. Propose an algorithm to determine the position of Blocks of Interest, which is based on the classification of competition. The algorithm can significantly reduce the noise impact on the results.3. Design and Implement a prototype system based on the framework of Gate, which is an open source project of Sheffield University. The prototype system improves the recall rate, the extraction efficiency and the ability to adapt to the changes.The system being finished in this paper meets the standards of Gate, and the results of experiments are satisfactory. It can be used as components to be deployed into the information systems. So it is worthy of further study.
Keywords/Search Tags:Information Extraction, Gate Framework, Ontology, Named Entity Recognition, XML
PDF Full Text Request
Related items