Font Size: a A A

Research On The Automatic Construction Of Chinese-Japanese Parallel Corpus

Posted on:2013-01-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:C Y YinFull Text:PDF
GTID:1108330482473169Subject:Computer applications
Abstract/Summary:PDF Full Text Request
Bilingual corpus is the basis of the cross-lingual natural language processing research. The size, coverage and quality of the bilingual corpus directly affect the results of statistical models or algorithms. In addition, the bilingual translation knowledge has important practical value in cross-language studies. And the automatic acquisition of translation knowledge has become a bottleneck in the development of such researches.At present, the construction of bilingual corpus mainly concerns about the Chinese-English language pairs. The Chinese-Japanese parallel corpus is relatively scarce and the published Chinese-Japanese parallel corpus is relatively small. This situation has restricted the development of Chinese-Japanese bilingual statistical natural language processing research. Because Japanese and English have different linguistic characteristics, the existing Chinese-English bilingual corpora construction methods may not be suitable for the Chinese-Japanese bilingual corpus construction. This thesis presents an automatic construction method of Chinese-Japanese parallel corpus and an automatic translation knowledge acquisition method based on the bilingual corpus. It first studies methods of mining Chinese-Japanese bilingual information from the Internet and aligning Chinese-Japanese bilingual information in multi-level. Then, on the basis of the aligned bilingual corpus, several types of translation knowledge are extracted automatically.In detail, this thesis contributes in the following aspects:1. Bilingual information mining technology based on the Internet is studied. After analyzing the structure characteristics of implicit bilingual parallel web pages, a scheme based on document alignment is proposed. The scheme uses title alignment and paragraph alignment to find implicit parallel web pages may not be found by only calculating URL similarity or analyzing the DOM Tree. Moreover, these alignment steps do not need a bilingual dictionary. For bilingual web page, a bilingual alignment information mining scheme is presented. This scheme separates mixed bilingual text by typesetting features, and uses alignment technology to obtain aligned bilingual text. Integrating these two schemes, an Internet-based bilingual information mining system is implemented, and the output of the system contains a bilingual dictionary, as well as bilingual sentence pair and bilingual text.2. An approach of paragraph alignment and sentence alignment is proposed. By this approach, the bilingual information mined from the web is transformed into parallel corpora. A general paragraph alignment method is proposed, which uses the ratio of document information quantity to align paragraphs. It is simple and effective compared to the traditional alignment method. For sentence alignment, a method based on information quantity ratio and kanji-Chinese character mapping is presented for Chinese-Japanese bilingual news text. The types of Chinese to Japanese sentence alignment can be one-to-one, one-to-many, many-to-one and many-to-many. For the situation of one Chinese sentence aligning to many Japanese sentences, a Chinese sentence segmentation method is suggested. Based on the ratio of information quantity, a long Chinese sentence is segmented to change the one-to-many alignment into one-to-one alignment. This method can increase the ratio of short sentence pair in Chinese-Japanese parallel corpus.3. Automatic word alignment of Chinese-Japanese parallel corpus is studied. Because the size of current corpus is small, a hybrid method based on IBM word alignment statistical model is proposed. The method presents rules of how to adjust the result of the IBM model and improves the word alignment precision and recall. The impact of word segmentation in word alignment is analyzed. A long sentence pair segmentation scheme is proposed which is used to increase the ratio of short sentence pair.4. Automatic translation knowledge acquisition methods are studied. Based on the small parallel Chinese-Japanese parallel corpus, an automatic bilingual glossary construction method is proposed. Based on an iterative algorithm the method solves the problem of indirect association in a statistical process. At the same time, named entity translation can be extracted, as well. The extraction strategy focuses on the translation of person names, place names and organization names.This thesis shows the feasibility and significance of the proposed automatic Chinese-Japanese parallel corpus construction methods and automatic translation knowledge acquisition methods by experiments.
Keywords/Search Tags:parallel corpus, information mining, paragraph alignment, sentence alignment, word alignment
PDF Full Text Request
Related items