Font Size: a A A

Research On Bilingual Alignment For Biomedicine

Posted on:2010-11-11Degree:MasterType:Thesis
Country:ChinaCandidate:X ChenFull Text:PDF
GTID:2178360275458369Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In various fields of research on natural language processing,the importance for bilingual corpus is more and more obvious.Different applications call for aligned bilingual corpus of different granularities,including article level,paragraph level,sentence level,phrase level and word level.For a practical application,sentence level and word level aligned bilingual corpus is useful.Example-based machine translation,acknowledge acquisition and cross-language information retroviral and so on depend on bilingual corpus,and the quality and quantity of sentence level and word level aligned bilingual corpus directly determines the quality of bilingual corpus.So the bilingual alignment makes great effect on the performance of applications mentioned above.This paper is a part of the 863 Project "Semantics-based cross-language information retrieval platform".We aim to build a bio-medical bilingual dictionary of terms in order to improve the performance of query translation which is the first step of cross-language information retrieval.The main work contains two steps:bilingual sentence alignment and terminology extraction.Through exhaustive review and research on the technique of bilingual corpus alignment,we get following results:(1) The second chapter describes a sentences alignment model using maximum weight matching on bipartite graph.The length-based method combined with location information and anchor information which classify the programs and sentences is used to improve the function of similarity.The experiment gets a better alignment result.(2) The third chapter aligns sentences using GMM and learning transfer.We consider sentence alignment as a classification problem,which can be solved by using Gaussian mixture model classifiers and anchor information.This method can get a better result.At the same time,we train alignment model using the method of transfer learning which make the alignment model represent better.(3) Through analyzing biomedicine corpus using statistic method,we take out bilingual glossary using an iterative re-evaluation algorithm.Taking the character of biomedicine corpus into account,we evaluate the maximal word number three and we get a higher recall rate.
Keywords/Search Tags:Cross Language Information, Bilingual Corpus, Bilingual Alignment, GMM, Transfer Learning, Iterative re-evaluation algorithm
PDF Full Text Request
Related items