Font Size: a A A

Sample Denoising And Model Optimization In Distant Supervision For Relation Extraction

Posted on:2017-05-23Degree:MasterType:Thesis
Country:ChinaCandidate:J F QuFull Text:PDF
GTID:2308330482995038Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Nowadays, there is a large amount of unstructured data on the network. Helping people to understand these data has become a problem need to be solved. The researchers put forward the concept of transforming unstructured data into structured form. Relation extraction is one of the essential steps. The types of traditional relation extraction are divided by the source of training data, called supervised methods, semi-supervised methods and unsupervised methods. However, under the circumstance of big data, the development of these methods have encountered bottleneck.According to the characters of current data, the researchers put forward the concept of distant supervision for relation extraction. Distant supervision methods make use of existing knowledge base and corpus to provide training data through heuristic match. The match is based on the assumption: if a sentence in the corpus contains an entity pair in the knowledge base, then the sentence must express the relation labels which the entity pair contains in the knowledge base. Obviously, the assumption is so strong that will bring too much noisy data.We summarize the noisy data problem of multi-instance and multi-label. Multi-label: the same entity pair may have multiple relation labels in the knowledge base and it is unclear which label is expressed in the sentence after heuristic match. Multi-instance: some sentence doesn’t express any relation label in the knowledge base. To deal with the above problem,this article gives the solutions from two aspects:(1)Sample redirection based on clustering: firstly, we determine the sets of relation label candidates by constructing undirected graph. The points in the graph represent relation labels,while the points connected by the edge mean that these two relation labels appeared in training data at the same time. After completion of constructing the undirected graph, we seek to find connected components to find the same type set of relation label candidates. Then, we classify the sentences belong to the same set of candidates to different clusters by K-means clustering using their feature vectors. Lastly, we adopt the tactics of majority vote, which use information given in the knowledge base, to define the relation label of every cluster so that we can determine the relation label of each statement. The approach mentioned above can not only solve the problem of multi-label, but also find the missing and potential relation labels of entity pairs in the knowledge base.(2)Adaptive model training process: the article redefined the relation extraction model.To solve the problem of multi-instance, we gradually release the relation labels given by redirection. For those mentions which may belong to the label of NA, we do not update theparameters strictly. These actions relieve the problem of multi-instance to a certain extent.In addition, we optimized the training process by using the stochastic gradient descent algorithm and other approximate solutions. These practice has greatly improved the efficiency of the algorithm.The results of comparing experiments show that the two method proposed by the article can solve the problem of multi-label and multi-instance well. The precision of the model obtained from the training method is better than earlier method.
Keywords/Search Tags:Unstructured Data, Relation Extraction, Distant Supervision for Relation Extraction, Multi-instance, Multi-label
PDF Full Text Request
Related items