Font Size: a A A

The Construction Of Large-scale Chinese-English Comparable Corpora

Posted on:2011-12-07Degree:MasterType:Thesis
Country:ChinaCandidate:L ZhaoFull Text:PDF
GTID:2178330332961309Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of statistical techniques, large-scale corpora have become indispensable to the natural language processing research field. As a type of corpora, parallel corpora are key resource for cross linguistic comparative studies, translation disambiguation, machine translation and translation aids due to its intertranslatable contents.By comparison with parallel corpora, comparable corpora are more accessible, abundant and up-to-date. At present, the research on comparable corpora becomes more and more extensive. This paper constructs large-scale Chinese-English comparable corpora with the background of "Mining named entity translation pairs from comparable corpora".We proposed our approach for comparable corpora construction which make a combination of cross language information retrieval and feature filtering through the analysis of previous methods. Our intention here is to gain large-scale and high quality Chinese-English comparable corpora. Firstly, keywords are extracted from the source document (Chinese), and then the extracted keywords are translated and combined as query words through certain criteria. Next, retrieval system is employed to search relevant target documents (English). Meanwhile, the comparable document pairs are formed by the relevant target documents and initial source document. Finally, features like date, similarity are used to filter the aligned documents, pairs. Contributions of our work mainly contain the following three aspects:(1) In the process of keyword extraction, we propose a method to select both key phrases and single words. Different approaches are employed to build the phrasal and single word candidate sets which are sorted respectively in the following step. What's more, we utilize symmetric conditional probability and local maximum algorithm to modify the segmentation results which make an improvement on keyword extraction effectiveness.(2) Two methods are evaluated to filter the comparable document pairs based on various feature set. The first method depends on two features:publication date and similarity which calculated by retrieval system between the query and the target document. On these bases, the second method introduces a new feature KSD by taking the number of translated keyword and weight between document pairs into account. Experimental results show that the filtering method based on date, similarity and KSD is more effective than the first method. And the percentage of high relevant document pairs increases 17.6% by comparing with the first method. (3) We evaluate the quality of comparable corpora through random sampling using five level of relevance. Experimental results show that our approach is effective through a comparison with other construction methods.
Keywords/Search Tags:Comparable Corpora, Cross Language Information Retrieval, Feature Filtering, Keyword Extraction
PDF Full Text Request
Related items