Font Size: a A A

Research On Sequence-to-Text Inference And Generation Based On Matching And Transformation

Posted on:2022-04-26Degree:DoctorType:Dissertation
Country:ChinaCandidate:C Q DuanFull Text:PDF
GTID:1528306839478184Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Semantic relationship modeling among different sequences is one of the core problems in the field of natural language processing.In different application scenarios,this problem can be regarded as text inference or text generation.Specially,text inference aims to research how to predict the semantic relation between the input sequence and the target sequence when both of them are given.While text generation studies how to generate the target sequence according to the input sequence under the scenarios that only the input sequence is known but the target sequence is unknown.In this paper,we consider the natural language inference and machine translation which are typical tasks of text inference and generation respectively as the target tasks to explore inference and generation technologies.Furthermore,we extend them to multimodal scenarios and apply them to live comment matching and generation.Specially,the research content is as follows:1.Attention-fused deep matching network for text-to-text inference.This paper firstly researches text-to-text inference.Existing works always adopt a matching model which includes a RNN-based encoder layer and a shallow attention mechanism-based matching layer to tackle this task.When encoding a sequence,the RNN-based model is able to transmit information word by word and is not affected by distance in theory.However,in the application,it is found that with the increase of the length of the target sequence,the quality of the representation produced by the RNN-based model will deteriorate.This phenomenon is denoted as long-term context dependency.Meanwhile,the shallow attention mechanism is not sufficient to model the complex semantic relationship between sequences and we denoted it as insufficient model complexity.This paper takes natural language inference which is one of the typical text-to-text inference tasks to study how to alleviate the above problems.This task aims to predict whether a premise sentence can infer another hypothesis sentence.To mitigate the above issues,we propose an attention-fused deep matching network(AF-DMN)for the text-to-text inference task.In this model,we introduce a self-attention mechanism to exploit long-term context dependency.Meanwhile,we design a complex matching layer stacked with multiple computational blocks to imitate multi-turn interaction.Experimental results show that AF-DMN outperforms the baselines.Especially,it achieves a significant advantage on the long sentence pairs.2.Feature fusion for multimodal-to-text inference.In the real world,information is expressed in multimodal in many scenes.Therefore,this paper further extends text-totext inference to multimodal-to-text inference.The input of multimodal-to-text inference involves multiple sequences and these sequences belong to different modals.How to effectively integrate these sequences belonging to different modals is the main difficulty of the task.This paper takes the live comment matching to study the technology for multimodal feature fusion.Live comment matching aims to select a target comment from a candidate comment set based on the input video clip.To mitigate this task,we propose a multimodal feature based matching network model.This model adopts an architecture that is similar to that of the Transformer model,and it solely utilizes the self-attention mechanism to construct the encoder.This model is able to model the comments,vision,and audio jointly.To model the cross-modal interactions among different modalities,we design a multi-head cross attention mechanism.Meanwhile,we adopt a matching layer that consists of matching blocks to iteratively learn the attention-aware representation for each modality.Experiments show that the proposed model outperforms the state-ofthe-art methods.3.Future cost mechanism enhanced text-to-text generation model.Text-totext generation is another subtask of semantic relationship modeling among different sequences.Existing works on this task always leverage a sequence-to-sequence model to tackle this task.When training this model,they first adopt the prediction distribution at each position of the target sequence to compute cross-entropy respectively.After that,they exploit all of the cross-entropy to obtain the final objective function.However,the trained model tends to focus on ensuring the accuracy of the generated target word at the current time-step and does not consider its future cost which means the expected cost of generating the subsequent target translation(i.e.,the next target word).This paper takes machine translation,a typical text-to-text generation task,as the target,and tries to solve the above problem.Specially,we propose a simple and effective method to model the future cost of each target word for NMT systems.In detail,a future cost representation is learned based on the current generated target word and its contextual information to compute an additional loss to guide the training of the NMT model.Furthermore,the learned future cost representation at the current time-step is used to help the generation of the next target word in the decoding.Experimental results show that the proposed method significantly outperforms the Transformer model.4.Joint learning for multimodal-to-text generation and inference.This paper further extends the text-to-text generation to multimodal-to-text generation,and takes the live comment generation to study multimodal-to-text generation.Live comment generation aims to automatically provide viewers with comments which they are interested in.Since the comments can be various for a video,and it is intractable to find out all the possible references to be compared with the model outputs.Based on the hypothesis that a good generation model is capable of discriminating valid and invalid comments,existing works propose to use a ranking metric to evaluate this task.In spite that existing generation models have achieved remarkable performance in terms of this metric,there still exists a gap between them and the matching model.To improve the ability of the generation model on discriminating valid and invalid comments,we propose a joint framework based on the Transformer model.By sharing encoders of the generation model and the matching model,our method can jointly learn the live comment generation and matching.Results show that it is through joint learning that both the generation model and the matching model achieve a significant improvement on this task.
Keywords/Search Tags:Text inference, text generation, natural language inference, machine translation, multimodal learning, live comment
PDF Full Text Request
Related items