Web Text Of The Rule-based Information Extraction Technology Research

Posted on:2012-02-23

Degree:Master

Type:Thesis

Country:China

Candidate:X C Li

Full Text:PDF

GTID:2218330368497726

Subject:Software engineering

Abstract/Summary:

With the development of the Internet techniques, the information on the Internet increases exponentially. One of important researches focuses on how to automatically deal with these great capacities of online documents. Text information extraction is a natural language processing task that involves automatically extracting specific types of information from text, such as events and facts, forms structured data, and then populates database slots for queries.This thesis research on Named Entity Recognition and Entity Realation Extraction System.Named Entity Recognition is the base of Entity Realation Extraction.Entity Relation Extraction is one of the important tasks in IE. The major task of Entity Relation Extraction is to search and determine the particular relations between named entities. As a basic research, Entity Relation Extraction is of great significance for Information Retrieval, Q&A System, Information Filtering, Automatic Summarization, Machine Translation and the construction of digital library. At present, the major approaches of Entity Relation Extraction are Repository-based algorithms, Feature-based machine learning algorithms, Kernel-based machine learning algorithms and Pattern-based bootstrapping algorithms.This thesis presents a Chinese named entity recognition and relation extraction system . Chinese named entity recognition integrates the Hidden Markov Models (HMM) and rules which are automatic extracted from the training corpus. The whole process of recognizing can be divided into two steps. First to use the hidden Markov model for part-of-speech tagging , and then made use of match rules to amend and convert the result of the HMM step.System's efficiency has been much improved as a result of the combination of the two models. Two kinds of machine learning algorithms, Winnow and Support Vector Machine (SVM), were used to extract entity relation from the training data of ACE (Automatic Content Extraction) Evaluation automatically. Both of the algorithms need appropriate feature selection. When two words around an entity were selected, the performance of the both algorithms got the peak. The average weighted F-Score of Winnow and SVM algorithms were 73.08% and 73.27% separately. We can conclude that when the same feature vector is used, the performance of different machine learning algorithms gets little difference. So we should pay more attention to find better features when we use the automatic learning methods to extract the entity relation.

Keywords/Search Tags:

Information Extraction, Chinese named entity, HMM, mach rule

Related items

1	The Research Of Chinese Named Entity Recognition And Information Extraction
2	Research On Chinese Named Entity Recognition
3	Research On Chinese Named Entity And Entity Relationship Extraction
4	Research On Chinese Named Entity Recognition For Information Extraction
5	Chinese Named Entity Recognition Based On Feature Fusion And Span Encodin
6	Research On Automatic Extraction Of Chinese Named Entities And Entity Relations
7	The Research On Named Entity Recognition In Chinese Information Processing
8	Web Chinese Information Extraction, Named Entity Recognition And Application
9	The Research Of Chinese Named Entity Recognition And Its Relation Extraction
10	Chinese Named Entity Recognition Based On Neural Network