Font Size: a A A

A Bilingual-Constrained Approach For Detecting Chinese Abbreviations

Posted on:2013-02-20Degree:MasterType:Thesis
Country:ChinaCandidate:Y Q LiuFull Text:PDF
GTID:2298330434475688Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Chinese abbreviations are widely used in modern Chinese texts, and they are one of the main sources of unknown words. The correlated research is important for Chinese information processing. However, the current research suffers from the lack of abbreviation dictionary. So the automatic extraction of abbreviations becomes an important research topic.The traditional ways try to extract abbreviations from raw Chinese texts. As there is only a little information can be used, the abbreviation candidates and their source phrases are not well corresponded, which causes the low accuracy of the abbreviation dictionary extracted. In this paper, we study on the Chinese abbreviation detecting task based on parallel corpus. The main contributions are summarized as follows:1. We propose an approach to extracting Chinese abbreviations based on parallel corpus. We first extract the Chinese-English translation phrase pairs from the parallel corpus, and get the abbreviation candidates according to the same English translations, which ensures a good correspondence between the abbreviation and source phrases. Then, we extract an abbreviation dictionary from the candidates according to linguistic rules. The experiments showed that our approach can extract abbreviations with high accuracy, and could be an effective way to extract Chinese abbreviation and source phrase pairs.2. For the task of Chinese-English statistical machine translation, we propose an approach to recovering the abbreviations of the unknown words with the abbreviation dictionary extracted from the training corpus for the translation system. The recovered abbreviations improve the translation for they can be recognized by the translation system then. In order to avoid recovery mistakes and ensure the recovered words can be recognized by the system, we search for the source phrases in the phrase table of the system with the restrict of recovery information got from the abbreviation dictionary. The experiments showed that our approach can recover abbreviations with high accuracy and improve the translation system.
Keywords/Search Tags:abbreviation, automatic extraction, recovery, parallel corpus, machinetranslation
PDF Full Text Request
Related items