Font Size: a A A

Research Of Internet-Based Information Extraction Technology

Posted on:2006-08-23Degree:MasterType:Thesis
Country:ChinaCandidate:Y J LiFull Text:PDF
GTID:2168360152975691Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of Internet, the Internet has become one of the most important knowledge repositories. It is highly desirable to achieve efficient information extraction. It has become an important research issue of how to offer efficient information automatically from Internet to the users. The information extracted by IE (Information Extraction) systems not only can provide for the end user, but also is the first step to build an intelligent query system and a data mining system. The IE system has a nice prospect, and the research on IE technique becomes the focus of Natural Language Processing internationally.This paper presents the history, key technologies, difficulties and evaluation standards of information extraction, reviews the state of Internet information extraction, and compares kinds of foregone Internet information extraction technology synthetically.A new technique for supervised wrapper generation is proposed in this paper. It assists the user to semi-automatically create wrapper programs by providing a fully visual and interactive user interface. Neither manual fine-tuning nor knowledge of the internal language is necessary. In this convenient user-interface very expressive extraction programs can be created. The user can work directly and solely on browser-displayed example pages. With this system, very expressively visual wrapper generation is possible: It allows to extract target patterns based on surrounding landmarks, on the contents itself, on HTML attributes, on the order of appearance and on semantic and syntactic concepts.Using Maximum Entropy (ME) model to conduct Chinese chunk parsing is proposed in this paper. Firstly it defines Chinese chunks and lists all chunk categories and tags used in the model. Then, it discusses how to select useful features. At last, it introduces the procedure and algorithms of feature selection. This paper uses a set of extraction patterns to locate specific information and relations among different information items automatically. It combines the XML technique with database technique to construct the information database, which further improves the performance of the system. It also gives detailed introductions and descriptions on web information expression.Based on theoretical analysis, the paper designs and implements the practical system of SBIES (the Sham Battle Information Extraction System). It also gives detailed introduction on the system. At last it tests the model, and gives experimental results.
Keywords/Search Tags:Internet Information Processing, Information Extraction, Maximum Entropy Principle, Pattern Matching
PDF Full Text Request
Related items