Font Size: a A A

A Research Of Document Representation And Bilingual Word Embeddings

Posted on:2019-08-27Degree:MasterType:Thesis
Country:ChinaCandidate:Y W J OuFull Text:PDF
GTID:2428330542994219Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Document representation and bilingual word embeddings are two important text representation learning techniques in natural language processing.They provide good feature representations for other natural language processing tasks.These two directions are the main research content of this dissertation.Document representation represents a document as a fixed-length vector.Existing work simply considers the document to be a sequence of texts,does not consider the hierarchical relationships in the document,and on the other hand ignores the different importance of different parts of the document.This dissertation proposes a document representation model based on hierarchical attention mechanism(HADR),taking into account the differences in the sentences of the documents and the differences in the words in the sentences.The experimental results show that after considering the differ-ence between the importance of words and the importance of sentences,the document representation has better performance.And the effect of HADR model on the sentiment classification of documents is higher than that of Doc2vec and word2vec models.With the successful application of representation learning in single language,some methods begin to study the cross-lingual text representation because of the needs of cross-lingual natural language processing task,and build a bilingual word embedding model.Bilingual word embeddfings is a technique that can both represent different languages in the same latent vector space and enable the knowledge transfer across lan-guages.To learn such representations,most of existing works require parallel sentences with word-level alignments and assume that aligned words have similar Bag-of-Words(BoW)contexts.However,due to differences in grammar structures among different languages,the contexts of aligned words in different languages may appear at different positions of the sentence.To address this issue of different syntactics across differ-ent languages,we propose a model of bilingual word embeddings integrating syntactic dependencies(DepBiWE)by producing dependency parse-trees which encode the ac-curate relative positions for the contexts of aligned words.In addition,a new method is proposed to learn bilingual word embeddings from dependency-based contexts and BoW contexts jointly.Extensive experimental results on a real world dataset clearly validate the superiority of the proposed DepBiWE model on various natural language processing tasks.
Keywords/Search Tags:Document representation, Attention, Unsupervised learning, Bilingual word embeddings, Syntactic dependencies
PDF Full Text Request
Related items