Font Size: a A A

Research On Statistical Paraphrase Acquisition And Generation

Posted on:2010-05-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:S Q ZhaoFull Text:PDF
GTID:1118360332457819Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Paraphrases are different expressions of the same meaning, which is a very common phenomenon in natural languages. Paraphrases reflect the flexibility and diversity of the human languages. However, it also flings down a challenge to the Natural Language Processing (NLP) research. In recent years, many researchers have carried out research on paraphrasing and tried to apply it in other areas, suce as Machine Translation (MT), Question Answering (QA), Information Retrieval (IR), Information Extraction (IE), Natural Language Generation (NLG), and so forth.The research of paraphrasing can be classified as two main directions. The first is paraphrase acquisition, which aims at extracting paraphrases of different granularities and forms (such as paraphrase sentences, phrases, and patterns) from various corpora or resources using different methods. The second direction is paraphrase generation (which generally means sentence-level paraphrase generation) that aims to generate paraphrases for the given sentences. Our research covers the two directions above. Firstly, we propose effective methods for extracting fine-grained paraphrases from a variety of resources and corpora. The extracted paraphrases include paraphrase phrases, patterns, and collocations. Secondly, we applied the extracted paraphrases in statistical paraphrase generation. The main contents of our research work can be summaried as follows:1. Extracting paraphrase phrases based on multiple methods. Paraphrase phrase extraction is a hot topic in the research of paraphrasing, on which many methods have been presented. This work follows and improves the existing mainstream methods and extracts paraphrase phrases from multiple resources, which include monolingual parallel corpora, monolingual comparable corpora, bilingual parallel corpora, thesaurus synonyms, dictionary definitions, and search engine user queries. This work extracts a large volume of paraphrase phrases. More important, it has combined, compared, and analyzed different resources and paraphrase extraction methods, through which we find out the advantage and disadvantage of each method and resource, as well as the characteristics of the extracted paraphrase phrases.2. Extracting paraphrase patterns based on a pivot approach. Compared with paraphrase phrases, paraphrase patterns generally have a higher coverage in both paraphrase recognition and generation, since paraphrase patterns contain slots, which can be filled with different contents and thereby form different paraphrase instances. This paper proposes a pivot approach to exatracting paraphrase patterns from a large bilingual parallel corpus. The proposed approach fitst extracts English patterns and Chinese patterns respectively from the bilingual corpus after word alignment and dependency parsing. It then extracts English paraphrase patterns by using the Chinese patterns as pivots. The approach utilizes a log-linear model to compute the paraphrasing likelihood between two English patterns, which exploits feature functions based on Maximum Likelihood Estimation (MLE) and Lexical Weighting (LW). Experimental results show that a large volume of paraphrase patterns with high precision can be extracted using the proposed approach, which are useful in the following paraphrase generation task.3. Extracting paraphrase collocations based on binary classification. Paraphrase collocations are collocations that convey the same meaning using different surface words. Paraphrase collocations are important in various NLP applications. However, it has not been widely researched. This paper addresses the problem of paraphrase collocation extraction using collocations with"OBJ"relationship as a case study. Specifically, the proposed method recasts paraphrase collocation extraction as a binary classification problem, which combines multiple features based on translation, thesaurus, polarity words, and web mining. Experimental results show that the binary classification based method is effective for paraphrase collocation extraction. Especially, the exploited features are all helpful for improving the extraction performance.4. Proposing an application-driven statistical paraphrase generation method. Paraphrase generation is critical in plenty of NLP applications. However, the research of paraphrase generation is far from enough. This paper proposes a statistical paraphrase generation method based on the comparison between paraphrase generation and other research topics (especially machine translation). To our knowledge, this is the first statistical method specially designed for paraphrase generation, which has two distinguishing features. First, it uses a uniform statistical model to generate paraphrase sentences for distinct applications, so as to satisfy the different requirements in kinds of applications. Second, the method can easily combine multiple paraphrase resources extracted above to improve the paraphrase generation performance.In conclusion, this paper not only focuses on paraphrase resource extraction, but also tries to apply the extracted paraphrases in the paraphrase generation task. This research has achieved some preliminary results, which we hope can be helpful to other researchers in this area. We believe that the research of paraphrasing can make a great breakthrough as the NLP foundational techniques and the processing capability of large-scale data are improved. On the other hand, the progress of the paraphrasing techniques can also promote the development of other related research.
Keywords/Search Tags:paraphrase phrases, paraphrase patterns, paraphrase collocations, paraphrase generation
PDF Full Text Request
Related items