Font Size: a A A

Research On Sentence Alignment Method Based On Cross-lingual Word Embeddings

Posted on:2021-04-27Degree:MasterType:Thesis
Country:ChinaCandidate:Q LuFull Text:PDF
GTID:2428330605468399Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
Parallel corpora are the basic and core resources of Natural Language Processing(NLP)such as machine translation,cross-lingual retrieval,and crosslingual automatic question answering.The scale and the quality of bilingual parallel corpora determine the upper limit of the performance of these systems.However,building massive parallel corpora manually is very expensive and timeconsuming.Fortunately,there are large quantities of comparable corpora online(Wikipedia,multilingual subtitle websites,etc.).If parallel corpora can be detected and acquired automatically,it will greatly enrich the sources of the parallel data and improve the performance of natural language processing systems such as machine translation.Therefore,the research of this thesis has important scientific significance and practicality value.In recent years,researchers use neural network-based methods to develop new efficient sentence alignment methods.The sentence alignment methods use the neural frameworks to learn sentence representation.After learning the sentence representation,these approaches compare the sentence representations to identify parallel sentence pairs.However,the sentence representations generated from these approaches are fixed-length embedding and they cannot fully express the sentence,such as the sentence length,etc.On the other hand,when comparing vector similarity,using a single measure such as cosine similarity or Manhattan distance can not fully excavate the similarity relation between vectors.In this thesis,we introduce a two-level sentence alignment method based on cross-lingual word embedding which can be used to extract parallel sentence pairs from comparable corpora under different noise distributions.Specifically,the twolevel method refers to a word-level and a sentence-level.In the word-level,we improve the performance of word similarity measure by skillful combining the advantages of cosine-similarity and Manhattan distances.Based on the proposed word similarity measure,in the sentence-level,we propose a method of sentence similarity measure based on an aggregation model,which combines the word and sentence granularity information to calculate the sentence similarity.To further improve upon this technique,we take the margin-based score step to match more potential parallel sentence pairs from the corpus.To evaluate the effectiveness of our method,we conduct experiments on three tasks.Firstly,we evaluate the effectiveness of our sentence similarity measure method on different configurations.Secondly,we apply the method on comparable corpora with different noise distribution to evaluate the performance on sentence alignment tasks.Finally,we apply our method to filter the large-scale corpus,and we evaluate the effect of the filtered corpus on the downstream machine translation tasks.The results of the experiments demonstrate that the proposed method can significantly improve the effect of sentence similarity measurement,collect highquality parallel data from comparable corpus under different noise conditions.Moreover,compared with competitive baseline systems,the proposed approach can also significantly improve the performance of machine translation.
Keywords/Search Tags:Sentence Alignment, Parallel Corpus, Comparable Corpus, Crosslingual Word Embedding, Machine Translation
PDF Full Text Request
Related items