Font Size: a A A

Bilingual Term Extraction Based On Parallel Corpus

Posted on:2016-04-23Degree:MasterType:Thesis
Country:ChinaCandidate:F J HeFull Text:PDF
GTID:2308330476454948Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Automatic term extraction is an important subject in natural language processing. It is widely used in Machine Translation, Cross language information retrieval, terminology dictionary construction and so on. In this paper, we used parallel bilingual corpus, and this method is based on Monolingual term extraction. A co-occurrence matrix is statistically generated, and a similarity function is used to generate aligned dictionaries. the final results are pairs of translation terms.This paper analyzes in detail based on parallel corpus of bilingual terminology extraction technology, introduces the currently commonly used word alignment method. We regard candidate terms from the monolingual extraction as a word, and then encode the corpus, thus transforming phrase alignment into word alignment. We also introduce four similarity functions: Dice coefficient, χ2 test, LLR and mutual information, including their advantages and disadvantages.In this paper, we focus mainly on: using similarity function to align words, generating aligned dictionary, extracting bilingual terms. In order to improve accuracy,(1) we use a monolingual term extractor which is based on the same principle and method, reducing the imbalance to a certain extent.(2) Then, we add the results of the alignment of the HMM model, and it improves accuracy by promoting the situation when Chinese multi-word term aligns English single-word term.With the statement above, we design an automatic bilingual term extraction system and do three experiments: the effects of different similarity functions, the influence of different corpus scale on the results, the improved method of term extraction. We find that: χ2 test is the best function in our corpus; accuracy is promoted greatly at first, but it will keep stable when the scale increases continuously; the accuracy is promoted 2.5% when we add HMM model for word aligning.The automatic term extraction system has been put into use in the huajian IAT assistant translation system.
Keywords/Search Tags:Parallel corpus, word alignment, co-occurrence matrix, similarity function, bilingual term extraction
PDF Full Text Request
Related items