Research And Implementation Of A Generic Web Information Extraction System

Posted on:2008-09-27

Degree:Master

Type:Thesis

Country:China

Candidate:L Liu

Full Text:PDF

GTID:2208360242958369

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the development of Internet, WWW has become the world's largest information dissemination and sharing of information space and one of the most import knowledge repositories. It is highly desirable to achieve efficient information extraction. It has become an important research issue of how to offer efficient information automatically from Internet to the users. The information extracted by IE (Information Extraction) systems not only can provide for the end user, but also is the base to build an intelligent query system and a data mining system as providing rich source of data for them. The IE system has a nice prospect.This paper presents the development, key technologies, difficulties and evaluation criteria of information extraction, compares and analysis kinds of web information extraction technology. Clarify the rules-based Web information extraction technology and the applications in the paper.Based on theoretical analysis, the paper designs and implements a system of GSIES (General Information Extract System). It also gives detailed introduction about the rules definition, information collection and extraction. The system designs the rules definition and the information collected independently, and a unified user-friendly interface and the core of IE. It also designs an expansion of information library to collect accurate information for better match. Then user the algorithm of Bloom Filters to make a same URL can not be dealt with again, and then the data are stored in the local database. At last it tests the model, and gives experimental results.

Keywords/Search Tags:

Information Extraction, DOM, XML, Regular Expression, Bloom Filters

PDF Full Text Request

Related items

1	The Research And Implementation Of Web Information Extraction System Based On The Regular Expression
2	Research On WEB Entity Information Extraction Algorithm And Its Application
3	The Research Of Web Information Extraction Technique And Application Based On NFA Regular Matching
4	The Application And Research Of Regular Expression In Webpage Extration
5	Research On Multi-dimensional Regular Expression Matching Algorithm For Network Security
6	The Design And Implementation Of Regular Expression Engines Based On Deterministic Finite Automata
7	Research And System Realization Of Key Technology Of Information Extraction Optimization
8	The Research On Web Information Extraction Technology
9	A Web-based News And Information Extraction System Design And Realization
10	The Application Research Of Regular Expression In Telecommunication Services Processing