Font Size: a A A

Research On Entity Translation Extraction From Comparable Corpora

Posted on:2015-04-13Degree:MasterType:Thesis
Country:ChinaCandidate:W WangFull Text:PDF
GTID:2298330422490921Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the advent of the information age, cross-language natural languageprocessing is playing an increasingly important role in people’s daily basis.Among all the NLP tasks, the entity translation technology has a vital role.This paper studies entity translation extraction from comparable corpora.Context-based approach is the most popular way in this research area and t hebilingual seed lexicon is one of the most important resources in this approach.However, little attention has been paid to its quality by researchers. In this paper,we focus on three main problems existing in the bilingual seed lexicon and putforward corresponding solutions.Firstly, for the problem of different granularities between the lexicon andthe corpora, we propose a new self-adaptive model. We use a word segmentationtechnique to adapt segmented corpora and then propose two strategies of weightallocation and corresponding filter. Secondly, we give a compression method toslove the dispresion problem of the bilingual lexicon. We use distributed wordrepresentation based on LDA model and neural language model, and utilize theeffective information of the bilingual lexicon to mine the semantic relationbetween words. Then a simple and efficient bottom-up hierarchical clusteringmethod is used to complete the compression task. This method has strongscalability since it dose not require external resources and is applicable to alltypes of named entities and OOVs. Finally, for the problem of insufficientcoverage of the bilingual seed lexicon, we use a list of words in the lexiconwhich have high correlation with uncovered words for substitution to expand thelexicon’s coverage.Experimental results show that the proposed methods can greatly improvethe quality of bilingual seed lexicon, making entity translation extraction tasksoutperform the standard approach by approximately7percentage in MRR value.
Keywords/Search Tags:entity translation, entity extraction, comparable corpora, distributedword representation, LDA model, neural language model
PDF Full Text Request
Related items