Font Size: a A A

Bilingual Dictionary Extraction For Special Domain Based On Web Text Data

Posted on:2006-06-12Degree:MasterType:Thesis
Country:ChinaCandidate:Y C ZhangFull Text:PDF
GTID:2168360152487482Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Bilingual dictionary is the basic resource of the NLP applications such as CLIR(Cross Language Information Retrieval) and MT(Machine Translation). With the development of modern economic and science, more and more new glossary comes forth. Manually bilingual dictionary compiling can' t meet the need. The research on the bilingual dictionary extraction based on parallel corpora is an important direction. However in comparison to the large amount of available monolingual text, the mount of the available bilingual parallel corpora is still relatively small, which limits the applying of this method. On the other side the monolingual resource in the Internet are very abundant. So how to extract bilingual dictionary from large scale of special domain monolingual corpora from the Internet has challenging meaning and actual value. This thesis builds the statistic model to extract bilingual dictionary from the special domain text from Internet.This thesis proposed a new method to extract the bilingual dictionary from the web-data. We apply the rule-based method to extract the bilingual dictionary from the mixed multi-language document, and got the precise of 82%. We apply word-relation-matrix method to extract the bilingual dictionary from the monolingual document and got the precise of 47%.This thesis Analyzed the influence of the seed word pairs on the extraction of the bilingual dictionary with abundant of experimentations. The experiments demonstrate that the quantity and the average frequency of the seed word pairs contribute to the result effectively.This thesis evaluated the result in CLIR. The results showed that the bilingual dictionary from this method improved the precise and recall of the CLIR for the query on the special domain and had no effect on the non-special domain query.
Keywords/Search Tags:Non-Parallel Corpus, Bilingual Dictionary, Vector Space Model(VSM), Seed Word
PDF Full Text Request
Related items