Font Size: a A A

Event Argument Extraction Methods For Low-resource Languages

Posted on:2023-12-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:X W LiaoFull Text:PDF
GTID:1528306824951999Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
An event refers to a change in the state of things,which includes the occurrence of action and a series of consequences arising therefrom.An event contains several arguments that describe the occurrence of the event and the change of state,such as the time,the place,the participants and the consequences of the event,and so on.Identifying events from texts and extracting event arguments to generate event objects has a wide range of applications in social governance,public opinion monitoring,and intelligence collection.With the increasingly close cooperation between China and ASEAN,using computers to extract event arguments from massive ASEAN language texts in a timely and accurate manner and build an event database is of great significance for promoting bilateral cultural exchanges and understanding the social dynamics of ASEAN countries in a timely and comprehensive manner.Due to the flexibility of natural language,and the lack of common sense of the world,mental activity,and consensus required by humans to understand the same thing,it becomes a very challenging task for machines to correctly extract event arguments from the text.At present,the research on event argument extraction mainly focuses on languages with rich resources such as English and Chinese,while there are few studies on event argument extraction in ASEAN low-resource languages.In addition to solving the common problems of general event argument extraction,low-resource event argument extraction is also limited by many factors such as difficulty in data acquisition,insufficiently labeled data,and lack of processing tools.Based on the theoretical achievements and the technical frameworks in the fields of natural language processing and deep learning,this dissertation studies the problem of event argument extraction under low-resource conditions,and proposes a set of domain dataset construction and event argument extraction methods suitable for low-resource scenarios.Specifically,the research content of this dissertation includes the following four aspects:1.The construction of domain datasets in low-resource scenarios.IID(Independently Identical Distribution)is a basic assumption in the field of machine learning,and inference examples and training examples satisfying this assumption are the basic guarantees for the generalization ability of a model.However,in the current practice of constructing domain datasets,we only assume that examples satisfy the IID assumption,and we have not taken effective measures to ensure this.To make the event examples satisfy the IID assumption,we first proposes a sentence representation model based on the difference semantics model to obtain a more accurate sentence representation,then constructs a domain discriminator based on the representations of sentences,and finally uses the domain discriminator to filter examples for constructing a domain dataset and being input into event argument extractor at inference stage.The domain dataset construction method proposed in this dissertation has a very important reference value in practice.In this dissertation,we constructed a Vietnamese Covid-19 theme dataset containing 7761 event sentences and 9387 events,and it is the first known event dataset in ASEAN low-resource languages.2.An event argument extraction method based on span selection is studied under the condiction of low-resource.In our research,the distant supervision event argument extraction framework based on lexical knowledge injection is proposed based on the theoretical analysis that knowledge injection can effectively improve the performance of the model.In the framework,lexical knowledge such as word boundaries and parts of speech are encoded and injected through vector concatenation.We use the proposed distant supervision framework to extend two state-of-the-art span selection based event argument extraction models,and the experimental results show that the injection of lexical knowledge can generally improve the performance.Unsupervised statistical language models are proposed to deal with extremely low resource situations in which there are no lexical resources.Take Vietnamese as an example under the condiction of extreme low-resource,our unsupervised statistical language model can achieve 66% in word segmentation accuracy.3.An event argument extraction method based on text generation frameworks is studied under the condiction of low-resource.The event argument extraction method based on text generation has the advantages of a unified model and framework,good integrity,and simple data labeling.This advantageous method can indiscriminately handle the case where the event sentence contains one or more events.According to the distribution of event arguments in the context and the characteristics of low-resource,we improve the current event argument extraction model based on text generation.Specifically,the improvements include injecting lexical knowledge into event sentences,inserting the encodings of contextual context and external knowledge into the input of models,and injecting event structural knowledge into models during the training and inference stage.Experiments are conducted on the Vietnamese event dataset constructed in this dissertation,and the results show that our improvements significantly improve the performance of models.4.Based on the archievements of our research,that is,the construction of event datasets based on domain discrimination,the inference example selection strategy,and the improved text generation-based event argument extraction model,we obtained news reports from Vietnamese news websites and performed event argument extraction.We finally constructed a Vietnamese Covid-19 theme event database containing 39,833 events and 181,695 event arguments.The event database can provide high-quality data for other information processing systems and it is the first Vietnamese thematic database known in China.
Keywords/Search Tags:low-resource language, event argument extraction, text generation, Vietnamese
PDF Full Text Request
Related items