Font Size: a A A

Corresponding Units In Chinese-English Parallel Texts--Corpus-driven Approach

Posted on:2008-07-12Degree:MasterType:Thesis
Country:ChinaCandidate:J S WuFull Text:PDF
GTID:2155360278962518Subject:Foreign Linguistics and Applied Linguistics
Abstract/Summary:PDF Full Text Request
In recent years, parallel corpora research has become a new trend of corpus linguistics. More and more researchers have been convinced of the great value of parallel corpora in many fields such as natural language processing (NLP), lexicon compilation and translation studies. In this thesis we developed the concept of Corresponding Unit (CU) which presented a new approach to the researches related to parallel corpus. It is innovative because it is based on real source and target language and makes use of the practice of translators.The Corresponding Units (CUs) are defined as"any identifiable chunks of texts or segments of texts that correspond to each other in the TL and SL, which encapsulate the completeness and the sameness of meaning of their counterparts in a syntagmatic construction."(李文中, 2006)The above definition defined the CU from a macro perspective, while in the practical operation of CU, a more workable definition need to be provided. In this thesis, we developed the concept of CU based on the parallel texts.For the convenience of research, we divide a CU into two parts; we call the unit in the language from which we want to translate, CU in source language (CUS), and the unit of the language into which we want to translate CU in target language (CUT).We propose that a CU is a dualistic pair consists of a CUS and a CUT, the CUS and the CUT have to fulfill the following qualifications:1) CUS in form is a word or a group of words in the source language with the properties of non-ambiguity, internal grammaticality and dynamic boundary.Non-ambiguity means that the CUS must be monosemous from perspective of source language. It should have only one kind of interpretation in the target language. However, whether this interpretation is ambiguous or not is not concerned.Internal grammaticality means that all words in the CUS constitute a valid syntactic structure.By dynamic boundary we mean a CU can be extended to form larger ones and a larger CU can be further analyzed into smaller ones. In the process of identification, we will extract the small one first and then extend it to extract larger ones. In the process of application, we will match the larger ones first.2) CUT is the translation of the CUS. This translation should be the unique translation of the CUS. If it has more translations, these translations should be synonymous and can replace each other.In this thesis we developed the theoretical framework and the methodology to deal with CUs. The criteria of identification, major properties and applications of CUs are all studied based on the pilot parallel corpus we compiled for this research.A corpus-driven study has been conducted based on the self-compiled pilot parallel corpus. The task is to identify these monosemous CUSs and CUTs, and to reuse them by creating a translation CUs database (CUbase). It aims to uncover the properties of CUs. All the sample data will be counted instead of being ignored.Our research questions are as follows:1) What the CUs will be and in which levels do CUs exist?2) How is the corresponding relationship between a CUS and CUT established and maintained?The research objectives are:1) To compile a pilot Chinese-English parallel corpus and construct CUbase based on it.2) To develop the theoretical framework and the methodology to deal with CUs.3) To apply the findings to related linguistic researches.Through our identification, we obtained a CUbase consists of 1064 pairs of CUs. We classified the CUs in to four groups according the form of the CUSs: CUs at the word level (CU-Ws), CUs at the MWU level (CU-MWUs), CUs at the clause level (CU-Cs) and CUs at the sentence level (CU-Ss). The corresponding relationship between the CUS and CUT are also examined in this thesis. The main relationships are 1) symmetrical and asymmetrical correspondence; 2) equivalent correspondence and correspondence with grammatical variations and 3) one-to-one correspondence and one-to-many correspondence.The concept of CU is oriented for application. The CUbase which is a collection of CUs of different ranks will be the basis for a new generation of Chinese-English dictionaries, both in printed and in electronic form and make traditional bilingual dictionaries redundant. It will not only facilitate, improve and speed up human translation but also make possible machine translation of real, natural language in a special domain; if the base is huge enough, it will work in all unrestricted domains. It will be used for a range of further Chinese-English language technology, including the word sense disambiguation, quality assurance of translations, and also for language learning.This thesis examines CUs from the perspective of corpus linguistics. The central theme of this dissertation is that CUs extracted from parallel corpora can be used in a bilingual lexicon or translation database for the benefit of translators and linguistic researchers. Nevertheless, more work needs to be done in this pioneering area. The methodology and results need to be tested using a larger scale general corpus and software should be developed to identify CUs in order to explore parallel corpora fully and more automatically.
Keywords/Search Tags:parallel corpus, Corresponding Unit, corpus-driven, corresponding relationship
PDF Full Text Request
Related items