Font Size: a A A

Research On Denosing Mechanism And Sample Distribution Imbalance Based In Relation Extraction Task

Posted on:2021-01-24Degree:MasterType:Thesis
Country:ChinaCandidate:M T LiFull Text:PDF
GTID:2428330620968127Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In the field of natural language processing,relation extraction technology detects the relation information between entity pairs in unstructured data and generates structured data in the form of entity-relation triple.The efficiency of subsequent information tasks is greatly improved by this process.Therefore,this task has been drawn extensive attention of studies.The traditional supervised model mainly relies on high-quality training set,which are difficult to obtain.In order to handle this problem,remote supervision method is widely used.Based on a small amount of labeled data,this method is able to quickly generate relation labels for large-scale corpus.However,compared with supervised dataset,this dataset contains noisy data,called Unknown Unknowns(UUs),due to the sample distribution bias between knowledge and real corpus.The feature of UUs is the model has no idea about locating UUs according to evaluation metrics,but from the angle of common sense,its label may be completely unreal.If UUs exist in training set,it will undoubtedly interfere with the final result of deep learning methods.meanwhile,the imbalanced data in real corpus will lead to the underfitting of neural network to few sample types,so it is impossible to obtains the real feature information of each relation.At present,the researches of relation extraction task focus on analyzing the context information of sentence which contains entity pair.And using strategies such as reinforcement and generative adversarial learning to help models identify noise information,so as to reduce the impact of noise data.However,because of the noise in test data,models cannot prove whether the selected data is real UUs.On the other hand,the understanding of natural language is highly abstract,and the model always ignore the test ambiguity.In view of the imbalanced data in text field,the relevant discussion is still in the active exploration stage.Aim to solve the above problems,the following works are implemented: 1.A context-based attention mechanism is proposed to detect coarse-grain UUs.We notice that word embedding is single and has no sematic relation with content.Therefore,we propose an entity-pair embedding,which is combined with the hidden feature into sentence-level weight information,improves the keyword contribution weight in the sentence,and reduces the impact of noise data on classification result.2.A human-in-the-loop based denoising framework is designed to clean noise data semi-automatically.Aim to identify UUs with low cost and high quality,we designed a human-in-theloop framework,which consist of three modules: coarse-grained potential UUs location,fine-grained UUs clean and deep learning module.First,coarse-grained module is used to detect potential UUs,and then fine-grained crowdsourcing cleans UUs.Through cooperation of coarse and fine-grained,the semi-automatic interaction between human and machine is realized and achieve the balance between cost and quality.3.A few-short learning-based relation extraction is implemented to improve the performance of few shot types.Aiming at the unavoidable data imbalanced problem,we propose a classification method based on few-shot learning.The relation types with few samples are defined as few-shot type.The prototype model of each type is obtained by learning the feature information from those samples,which can reduce the weight of other types and eliminate the impact of distribution bias on classification result.Overall,the denoising framework solves problem of hidden noisy data in dataset and provides data support for relation extraction task.Because of the independence of strategy and model in each module,the framework has good portability and can be used by other tasks.The few-shot learning methods reduces the impact of imbalanced distribution of real data and provides ideas for imbalanced problem in the text field.
Keywords/Search Tags:Relation Extraction, Human-in-the-Loop, Crowdsourcing, Attention Mechanism, Few-shot Learning
PDF Full Text Request
Related items