Font Size: a A A

A Study Of Monolingual And Cross-lingual Textual Entailment Relationship Recognition

Posted on:2016-09-27Degree:MasterType:Thesis
Country:ChinaCandidate:J ZhaoFull Text:PDF
GTID:2308330461474017Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Textual entailment recognition aims to detect whether one text fragment can be sema-ntically inferred by the other given two topic-related text fragments. Recognition of textual entailment will be of great benefit to other natural language processing applications that involve sematic inference and it is the building block of many other applications, such as text summarization, question answer, etc.Textual entailment recognition task is categorized into monolingual and cross-lingual textual entailment task according to the types of written languages of text fragments. In cross-lingual textual entailment task two text fragments in a pair are written in difference languages. To address this problem, lots of researchers employ machine learning approaches where many similarity measurements based on different perspectives such as surface string and external corpus are proposed. Generally, the accuracy of this task is relatively low, e.g., 41%-45% for cross-lingual textual entailment task. We speculate that there are two possible reasons:(1) the features may not be very discriminating and only several kinds of features are used so far; (2) the size of available labeled data is quite small.To overcome the insufficient feature problem, we propose two novel feature types: sentence difference features which measure the differences between two sentences instead of similarity and word embedding based similarity features. As semantic distributed representation, word embeddings are widely used in many NLP tasks. Besides these two new kinds of features, we also combine other various features used in previous work and build the first supervised classification model based on the heterogeneous features.To overcome the shortage of labeled data, in this paper we creatively divide existing features into similarity and difference features and regards them as two sufficient and red-undant views, which lead to naturally integrate with the co-training framework. Hence, we come up with a co-training based classification model. Co-training initially trains two clas-sifiers upon two views and iteratively adds the confident predictions of one classifier to the other classifier’s training pool. Consequently, co-training alleviates the sufferings of data shortage problem through exploits test instances. In order to further exploit other large datasets, we also propose an alignment model based classification model which automatically labels data using an alignment model and then using these labeled data to help classification.To valid the effectiveness of the three models proposed in this paper, we adopted seven benchmark datasets from monolingual and cross-lingual textual entailment tasks and conducted plenty of experiments. The results on test datasets demonstrate that compared with baseline systems and the corresponding best performance yielded by other researchers our models significantly improve the accuracy. In addition, we apply our heterogeneous feature based model in other related natural language processing tasks and the results also outperform their corresponding best performance. In summary, the experimental results prove that our models are effective and well generalized.
Keywords/Search Tags:textual entailment recognition, heterogeneous features, Word embedding, co-training, alignment model, automatic labeling
PDF Full Text Request
Related items