Font Size: a A A

Research On Open Relation Extraction And Classification Based On Word Embeddings

Posted on:2020-10-03Degree:DoctorType:Dissertation
Country:ChinaCandidate:P Q LiuFull Text:PDF
GTID:1368330605481279Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Relation extraction plays an important role in the area of Information extraction(IE),which turns the unstructured information expressed in natural language text into a structured representation.Traditional approaches to IE focus mostly on a predefined set of target relations.Consequently,extracting relations that are not predefined requires the user to manually define new extraction rules or to annotate new training data by hand.Since large scale text in real world contains rich relations,it is difficult to define all these relations in advance.In order to acquire these information in the text,Open Relation Extraction(ORE)was proposed by Banko et al.in 2007.A multitude of ORE systems have been developed in the last decade.Nevertheless,there are still some problems in performance of these systems.For instance,some extracted tuples are incorrect,and the efficiency of these systems can not satisfy the requirement of large scale text processing.On the other hand,although ORE systems can output all kinds of relations,a specific downstream NLP task is interested in only a few of these relation types.Thus,open relation classification is quite important to the downstream tasks,and there is yet no report on open tuples classification in the literature.Based on previous research,the main contributions of this thesis are presented as follows.(1)An ORE algorithm based on word vector is proposed.In order to improve the accuracy,the recently presented ORE systems are mainly based on supervised learning or dependency parsing.These systems require not only annotated training data,but also external NLP tools,which may lead to decline in efficiency and error propagation.In this thesis,a novel method is presented for web scale open information extraction,which employs cosine distance based on Skip-gram word vector as the confidence score of the extraction.This thesis also presents the mathematical analysis for the new method with Bayes Inference and machine learning theory.It turns out that the proposed algorithm approximates Maximum Likelihood Estimation of the joint probability distribution over the elements of the candidate extraction.Experiments show that the distance-based method leads to further improvements over the newly presented ORE systems on three benchmark datasets,in terms of effectiveness and efficiency.The proposed approach achieves the best F-measure of 67.0%on WEB-500&NYT-500 dataset.(2)A semi-eager learning approach for open relation classification is presented in this thesis.Although ORE systems can extract all kinds of relations in the text,a downstream NLP task is interested in only a few of these relation types.Thus,the task of open relation classification is quite important to the downstream applications.In this thesis,a novel semi-eager learning algorithm(SemiE)is proposed to tackle the problem of open relation classification.SemiE is also a vector based algorithm,and it stores only the "class center" for each category.The training time complexity is 0(n),and both the time complexity and space complexity for prediction are O(k),with k the number of categories and n the number of training examples.The experimental results on three datasets show that SemiE outperforms the state-of-the-art methods in the task of open relation classification.It obtains an F1-score of 84.6%on SemEval-2010 Task 8 dataset at much lower computational cost.(3)To further improve the performance of extraction and classification,this thesis presents an information-quantity based model to transform relation phrases into vectors by using word embedding.Since the entities and relations in open tuples are mostly expressed in the form of phrases,the thesis proposed a novel approach based on information quantity to encode the phrases into embeddings.The presented model calculates phrase representation based on the information quantity of the words within the phrase.That is to say,the new model calculates a series of weighted distribution for words sequence in text,which emphasizes the encoding of "important" words and deemphasizes the encoding of"unimportant" words.Finally,the new model is applied in Open Information Extraction and classification methods.It achieves the superior F1-score of 69.0%on WEB-500&NYT-500 dataset,and achieves the best F1-score of 85.1%on SemEval-2010 Task 8 dataset.(4)A prototype system of Open Relation Extraction based on word embeddings is designed and developed.Following the correspondence models proposed in this thesis,a prototype system for ORE is developed on Linux.This system takes HTML or pure text as input,and outputs a set of open tuples.Each open triple is associated with its confidence score calculated based on Skip-gram word vectors.
Keywords/Search Tags:Open Relation Extraction, Open Relation Classification, word embeddings, semi-eager learning, phrase embeddings
PDF Full Text Request
Related items