| With the development of machine translation and the increasing demand for cross-language communication in today’s society,machine translation is more and more widely used.Machine translation quality estimation is a method of evaluating the quality of a translation based on the source text and machine translation without relying on standard reference.The "Predictor-Estimator" quality estimation model that uses neural network-based language models to automatically extract source and translation features is currently the mainstream.This model uses a powerful machine translation model to obtain language features,but at the same time,it does not pay attention to the context in the document level machine translation evaluation like ordinary machine translation models.Therefore,this paper seeks for the methods to combine the discourse relation information and evaluate local fluency from linguistic theory.We introduces the preferred center concept from centering theory,a computational linguistic theory,to capture the relation information between sentences to improve the quality estimation model.At the same time we constructs the Chinese-English document-level machine translation quality estimation data set to make up the shortage.The first part of the work is the extraction of the preferred center.Preferred center usually uses a set of extraction methods based on linguistic syntactic rules.However,this method is difficult to automatically extract for some vaguely defined syntactic components.In this paper,a sequence labeling model based on a pre-trained word representation model is used to obtain the preferred center,and a semi-supervised pseudo-label learning method and a small number of manual labels are used to solve the problem of the lack of the preferred center labeling data.The second part of the work is the construction of a machine translation quality estimation model that introduces discourse relation information between sentences and the construction of document level quality data sets.The proposed model is based on the predictor-estimator model architecture,adding a discourse relation predictor to obtain discourse relation features.The quality estimator adds the discourse relation features as an evaluation basis.The discourse relation features between sentences include the preferred center word representation above the sentence to be evaluated and the degree of coherence between the sentence to that calculated by the similarity and their difference between source and machine translation.Then,this paper constructs a document level Chinese-English quality estimation data set based on construction method of the quality estimation data set and the standard process of manual annotation.We conducted experiments on the proposed model and the preferred center extraction model.Compared with the context-free quality estimation model,the proposed model has a significant improvement on the document level quality estimation test set.Compared with the rule-based extraction method,the preferred center extraction model based on machine learning also has a greater improvement in the preferred center extraction accuracy and recall. |