Font Size: a A A

A Word Distributed Representation Approach For Bilingual Lexicon Extraction From Comparable Corpora

Posted on:2018-02-02Degree:MasterType:Thesis
Country:ChinaCandidate:P ChenFull Text:PDF
GTID:2348330518987199Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As the bilingual lexicon is a kind of basic resource in natural language processing tasks such as cross-language information retrieval and machine translation, the bilingual lexicon extraction has always been the focus of researchers. So far, the performance of the algorithms for bilingual lexicon extraction from comparable corpora is still not satisfactory, and most of the researches are focused on the terminology extraction in specific areas. Recently, neural network has achieved good results in machine learning and other related fields. The word distributed representation is just one of the representative achievements of neural network applied in the field of natural language processing, and has been widely used in sub-domains of natural language processing such as semantic extension and emotional analysis.The word distributed representation can not only quantize the words in the form of vectors, but also can be used directly in the single language environment to calculate the similarity between two words and bring its own smoothing function, which is suitable for the bilingual lexicon extraction from comparable corpus. This paper applies the advantages of word distributed representation to the bilingual lexicon extraction from comparable corpus, in which the main work includes two parts:Firstly, we design and implement an algorithm for the bilingual lexicon extraction by using the word distributed representation to quantify the correlation between words.Under the single language condition, the word distributed representation can quantify the correlation between the words effectively, and the correlation between a word and other words reflects some of the semantic information of the word. Some scholars'studies have shown that the correlation between the words has a certain stability. Thus,this paper construct the interrelationship matrix between the source language and the target language by using the words' correlation as an important distinguishing feature of words. Then, the vectors of words' correlation from the source language and the target language are mapped to the same vector space by the seed dictionary. Finally, the bilingual lexicon extraction is completed by calculating the similarity of the words'correlation. The experimental results show that compared with the classical method based on vector space model, the method of lexicon extraction based on the word distributed representation and words' correlation has a remarkable improvement in accuracy, especially for high-frequency words.Secondly, on the basis of words' correlation model, a method of extracting bilingual lexicon based on the co-occurrence of words is proposed to improve the performance of lexicon extraction. Under the multi-lingual condition, the word co-occurrence is an important reflecting of the semantic information of the word, so we use it as a distinguishing feature of the word for completing the bilingual lexicon extraction from comparable corpus to optimize the final effect of the extraction. Based on this idea, this paper puts forward the method of quantifying the word co-occurrence between different languages, and combines the word co-occurrence as another important feature of the word into the words' correlation extraction model to form a new evaluation index of the word translation. Finally, it is proved by experiments that the model which integrates the word co-occurrence has a further improvement in the accuracy rate compared with the words' correlation model.
Keywords/Search Tags:Lexicon extraction, Comparable corpus, Word distributed representation, Co-occurrence information
PDF Full Text Request
Related items