Font Size: a A A

Research And Implementation Of A Generic Web Information Extraction System

Posted on:2008-09-27Degree:MasterType:Thesis
Country:ChinaCandidate:L LiuFull Text:PDF
GTID:2208360242958369Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of Internet, WWW has become the world's largest information dissemination and sharing of information space and one of the most import knowledge repositories. It is highly desirable to achieve efficient information extraction. It has become an important research issue of how to offer efficient information automatically from Internet to the users. The information extracted by IE (Information Extraction) systems not only can provide for the end user, but also is the base to build an intelligent query system and a data mining system as providing rich source of data for them. The IE system has a nice prospect.This paper presents the development, key technologies, difficulties and evaluation criteria of information extraction, compares and analysis kinds of web information extraction technology. Clarify the rules-based Web information extraction technology and the applications in the paper.Based on theoretical analysis, the paper designs and implements a system of GSIES (General Information Extract System). It also gives detailed introduction about the rules definition, information collection and extraction. The system designs the rules definition and the information collected independently, and a unified user-friendly interface and the core of IE. It also designs an expansion of information library to collect accurate information for better match. Then user the algorithm of Bloom Filters to make a same URL can not be dealt with again, and then the data are stored in the local database. At last it tests the model, and gives experimental results.
Keywords/Search Tags:Information Extraction, DOM, XML, Regular Expression, Bloom Filters
PDF Full Text Request
Related items