Font Size: a A A

Research On Entity Relation Recognition In Information Extraction

Posted on:2011-03-12Degree:DoctorType:Dissertation
Country:ChinaCandidate:Q ZhangFull Text:PDF
GTID:1118360305966706Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The research of information extraction is becoming more and more important with the rapid growth of the Internet. The goal of a typical information extraction system aims to mine useful information from un-structured or semi-structured documents and to save it in a structured form, such as relational database or XML. There are many applications of information extraction, for example, Scholar Search, Product Search, Text Mining and Ontology Construction etc. The research on Information Extraction algorithms and techniques has become a hot topic nowadays because of its broad application.Information extraction can be divided into many sub-tasks, like event tracking and coreferencing. But named entity recognition and relation extraction can be seen as its two main sub-tasks due to their popularity in the literature. Named entity recognition aims at recognizing all kinds of the named entities, such as PERSON, LOCATION, ORGANIZATION and so on. And the goal of relation extraction is to find the relations among different entities. Actually, named entity recognition can be seen as a pre-step in relation extraction process.Currently there exits many solid works in information extraction area and their applications have become more and more touchable to the end users. But there are still unsolved problems. For systems which apply pattern match techniques, expansion capability to new entity types or new relation types is always not easy. And for those statistical learning based approaches, they rely too much to specific and usually small training corpus which also limit their expansion capability.In this dissertation, we studied and summarized the achievements in the information extraction literature, pointed out the key problems we are facing and also proposed our solutions to some of these problems, mainly in named entity recognition, relation and its temporal property extraction.In this dissertation, we first introduced the history of the development of IE systems and their achievements. We also analyzed the key techniques, related work and existing problems of named entity recognition and relation extraction respectively.Boundary detection is always a hard issue in Chinese named entity recognition. In this dissertation, we proposed a candidate entity generation algorithm based on web page structure and regarded the recognition problem as a classification problem. We also designed an entity association technique based on two principles in DOM-Tree: principle of proximity as tree distance and the principle of no conflict in context, which increased the association accuracy.The usage of deep linguistic feature is a key issue in relation extraction. We proposed a relation extraction algorithm based on link grammar and used the dependency relation between words as recognition features. In the process of relation extraction, we also considered the temporal property of every relation instance.Ontology construction is one of the main applications of information extraction systems. In this dissertation, we consider a new ontology model with temporal attributes. To gather relation data, we proposed pattern based and statistical learning based approaches respectively for semi-structured data and free text. In the scenario where temporal information is missing, we proposed a two level inferencing technique: page level and ontology level temporal information inferencing.At last, in order to meet the systems with undefined relation type, we proposed a dynamic relation type recognition process based on semantic role labeling tool. And conditional random field model is used as a labeling tool to recognize relation instances from single sentences.The contribution of this dissertation can be concluded in the following aspects: 1. Proposed a Chinese entity recognition and association algorithm. A candidate generation algorithm is given based the web page structure; an entity association approach is designed based on two principles:principle of proximity as tree distance and principle of no conflict in context.2. Proposed a relation extraction algorithm based on deep linguistic feature. Dependency relation between words is taken as recognition cue; the temporal property of relation instances is also considered.3. Proposed a timely ontology model. The extraction approaches of both unstructured data and semi-structured data are given; temporal inferencing is considered if the temporal information is missing.4. Proposed a new relation extraction algorithm with relation type undefined. The relation type is recognized dynamically by semantic role labeling tool; the relation instances are labeled by conditional random fields.
Keywords/Search Tags:Information Extraction, Named Entity Recognition, Relation Extraction, Link Grammar, Conditional Random Fields
PDF Full Text Request
Related items