Font Size: a A A

The Study On Data Augmentation In Chinese Parsing

Posted on:2022-09-05Degree:MasterType:Thesis
Country:ChinaCandidate:H B ChenFull Text:PDF
GTID:2518306560992029Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Parsing is a key basic technology in natural language processing.Its goal is to automatically identify and analyze the relationships among various components of a sentence according to a given grammar,and finally construct the parse tree.Parsing has a wide range of applications in natural language tasks such as machine translation.At present,the mainstream neural network parsing model relies on large-scale annotated tree bank.However,unlike English,there is little surface information in Chinese such as the change of part-of-speech to express the syntactic structure.As a result,the scale of Chinese tree bank is small,the cost of manual annotation is very expensive,and it is difficult to expand in the short term.Therefore,how to use existing tree bank for data augmentation has become the focus of research.In the data augmentation of Chinese parsing,for a given annotated tree,the sentences generated by data augmentation method are required to meet the following conditions.First,the generated sentences are required to have diversified syntactic structures and ensure the integrity of the parse tree.Second,the generative sentence is required to have correct parse annotation.Third,the semantics of the generated sentence should be reasonable,which is to ensure that the generated sentence conforms to the habit of language expression.In view of the three requirements of data enhancement in Chinese parsing,the main work and contributions are summarized as follows.(1)We propose a data augmentation method based on lexicalized tree adjoining grammar.Lexicalized tree adjoining grammar is a kind of important form of grammar in computational linguistics.Using the "adjoining" and "substitution" operation based on the grammar,we can generate a new parse tree between different annotated trees.The knowledge of linguistics would ensure the generated parse tree has correct syntactic structure and grammar rules.It can well meet the first two requirements of data augmentation in parsing.Therefore,we design and implement lexicalized tree extraction algorithm and parse tree synthesis algorithm based on lexicalized tree adjoining grammar.At the same time,we analyze and summarize some unique expressions and annotated system in Chinese tree bank compared with English tree bank.We perform "pruning" optimization on lexicalized tree adjoining grammar at the algorithmic level,so as to avoid excessive noise in generating sentences.We conducted data augmentation based on the public data set CTB5.1,and constructed 338 K augmented data based on 18 K data of the original training set through this method.Finally,we conducted experiments on small samples and robustness.In a small sample data augmentation experiment,the augmented data obtained by this method can improve the accuracy of dependency parsing and the constituency parsing by 1.4% and 2.12%,respectively.In the robustness experiment,we manually selected 86 generative sentences to construct an extended test set and carried out the experiment.The experimental results show that the data augmentation can improve the accuracy of dependency parsing and constituency parsing by 1.02% and0.5%,respectively.indicating that the proposed data augmentation method can effectively improve the robustness of the Chinese parsing model.(2)We propose semantic rationality evaluation method based on language model.Language model is a probability-based discriminant model,which judges the semantic rationality of a sentence by probability.Therefore,in view of the third requirement of data augmentation in parsing,we propose to use language model to evaluate the semantic rationality of generating sentences,so as to select the semantically reasonable generated sentences as the final augmented data.In this paper,N-gram language model and RNN neural network language model are designed and implemented respectively.The 338 K generated sentences obtained from lexicalized tree adjoining grammar are screened to105 K and 94 K data scales by language models.Finally,we conducted experiments on small samples and robustness.The augmented data obtained by this method can improve the accuracy of dependency parsing and constituency parsing by 1.6% and 2.14% in small sample experiments.At the same time,in the robustness experiment,the accuracy of dependency parsing and constituency parsing in the extended test set increased by 1.43%and 0.44%,respectively,showing better robustness.In summary,we propose a data augmentation method based on lexicalized tree adjoining grammar in view of the lack of Chinese tree bank and the demand of data augmentation in parsing.Combined with the language model,we construct augmentation data of 338 K,105K and 94 K based on the current 18 K training set.Finally,we conduct experimental comparative analysis on the public data sets.The results show that the proposed method can effectively improve the performance and robustness of the current neural network Chinese parsing model.
Keywords/Search Tags:Dependency parsing, Constituency parsing, Data augmentation, Lexicalized tree adjoining grammar, Language model
PDF Full Text Request
Related items