Font Size: a A A

Web Text Of The Rule-based Information Extraction Technology Research

Posted on:2012-02-23Degree:MasterType:Thesis
Country:ChinaCandidate:X C LiFull Text:PDF
GTID:2218330368497726Subject:Software engineering
Abstract/Summary:
With the development of the Internet techniques, the information on the Internet increases exponentially. One of important researches focuses on how to automatically deal with these great capacities of online documents. Text information extraction is a natural language processing task that involves automatically extracting specific types of information from text, such as events and facts, forms structured data, and then populates database slots for queries.This thesis research on Named Entity Recognition and Entity Realation Extraction System.Named Entity Recognition is the base of Entity Realation Extraction.Entity Relation Extraction is one of the important tasks in IE. The major task of Entity Relation Extraction is to search and determine the particular relations between named entities. As a basic research, Entity Relation Extraction is of great significance for Information Retrieval, Q&A System, Information Filtering, Automatic Summarization, Machine Translation and the construction of digital library. At present, the major approaches of Entity Relation Extraction are Repository-based algorithms, Feature-based machine learning algorithms, Kernel-based machine learning algorithms and Pattern-based bootstrapping algorithms.This thesis presents a Chinese named entity recognition and relation extraction system . Chinese named entity recognition integrates the Hidden Markov Models (HMM) and rules which are automatic extracted from the training corpus. The whole process of recognizing can be divided into two steps. First to use the hidden Markov model for part-of-speech tagging , and then made use of match rules to amend and convert the result of the HMM step.System's efficiency has been much improved as a result of the combination of the two models. Two kinds of machine learning algorithms, Winnow and Support Vector Machine (SVM), were used to extract entity relation from the training data of ACE (Automatic Content Extraction) Evaluation automatically. Both of the algorithms need appropriate feature selection. When two words around an entity were selected, the performance of the both algorithms got the peak. The average weighted F-Score of Winnow and SVM algorithms were 73.08% and 73.27% separately. We can conclude that when the same feature vector is used, the performance of different machine learning algorithms gets little difference. So we should pay more attention to find better features when we use the automatic learning methods to extract the entity relation.
Keywords/Search Tags:Information Extraction, Chinese named entity, HMM, mach rule
Related items