Font Size: a A A

The Extraction Of Bilingual MWEs From Comparable Corpora

Posted on:2012-11-10Degree:MasterType:Thesis
Country:ChinaCandidate:Q LiuFull Text:PDF
GTID:2218330368988278Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Extraction and alignment of multi-word expression is one of the most important subject of natural language processing, As the basic resource, bilingual multi-word expression is crucial to machine translation, information extraction and retrieval.In this paper, the sourcing to mine is comparable corpora, compared with parallel corpus, the comparable corpora are much lower price and broader sources of information and so on. Through Internet resources on the network to mine data, building into a large-scale, high-quality comparable corpus, While in comparable corpora,source documents and target documents mine scarcer information because fewer resources available, thus, from comparable corpora to further extract the word information has important research significance and application value.This paper firstly describes the knowledge of comparable corpus, as well as the characteristics of comparable corpus and the definition of multi-word expressions. And then, this article describes the sources of comparable corpora, building methods. Secondly, describes preprocessing with the document, with clustering algorithms to improve the context of documents by increasing the account of documents with the theme. After that, listing some methods on extracting Chinese and English MWEs in turns. Based on this, we put forward the method of this paper to extract Chinese and English MWEs. And experiments show that multi-word expression extraction is effective. The next step of this paper is to align the MWEs pairs which are extracted from comparable corpora. And finally propose a modified reordering algorithm for Chinese multi-word expressions candidate translate the results. Designing an experiment to prove the ultimate effectiveness of this method.Based on the comparable corpora, we construct a prototype system which can extract bilingual MWEs from comparable corpora automatically, and we perform three experiments on this system. (1) Clustering; (2) Chinese and English MWEs extraction; (3) Chinese and English MWEs alignment. There are 30 pairs of corpora documents, from which we extract 685 Chinese MWEs and 769 English MWEs. Through these MWEs we respectively get MWE pairs with Top-5, Top-10, and Top-30 at the rate of 24.1%,37.9% and 56.6%.
Keywords/Search Tags:comparable corpora, Chinese multi-word expressions, English multi-word expressions, multi-word expressions alignment, the context of heterogeneous
PDF Full Text Request
Related items