Font Size: a A A

The Study And Application Of Text Embeddings With Deep Learning Technique

Posted on:2017-05-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z YuFull Text:PDF
GTID:1108330485472911Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Text embeddings is an approach to represent texts as dense, low-dimensional and real-valued vectors. With the development of deep learning, making using of neural net-works to learn text embeddings has become a research focus in the area of natural language processing, especially in terms of word embeddings. As a basic semantic unit, word plays a very important role in understanding sentence and document. Nowadays, much work has been proposed to learn word embeddings, and such representations are greatly help-ful in most natural language processing tasks. The traditional word representations, such as one-hot representation and matrix-based representation, usually have the problems of data sparse and high dimension. However, word embeddings have the following distinc-t advantages:i) low dimension, and ii) semantical similarity, that is semantically similar words would have similar embeddings, and the similarity can be measured by the distance between the embeddings.We first investigate and analyze some popular models to learn word embeddings, then we proposes a novel neural network that successfully encodes isA relationships into a new kind of word embeddings. Moreover, to extend the vectorization technique to short texls and long documents, we further design and implement some comprehensive meth-ods to encode the semantics of short texts and documents into embeddings. At last, we instigate the practical application value of all methods in kinds of NLP tasks. The main contributions are listed as below:1. IsA word embeddings. IsA relationship has a strong generalization ability, and it plays an important role in understanding texts and relationship inference. In this work, we propose a novel neural network that can effectively encode isA relation- ships into word embeddings. Using the isA embeddings as features, we further pro-pose two prediction models to do relationship identification. Specifically, given any two words, the models are used to respectively identify whether they hold a hyper-nymy relationship and a head-modifier relationship.2. The vectorization of short texts:this work consists of semantic enrichment and se-mantic hashing. As we known, a lot of applications need to handle short texts, such as twitter, search query and news title. However, measuring the similarity between any two short texts is not easy because short texts usually lack sufficient statistical information and do not follow grammar rules. To solve these problems, this the-sis proposes a method, which combines a mechanism for enriching short texts and a deep neural network for semantic hashing, to effectively understand short texts. Con-cretely, based on a probabilistic semantic network named Probase, we first propose a mechanism to enrich every word in a short text with its concepts and co-occurrence words. Then we design a deep neural network to map the short text to a compact binary codes, which can be regarded as the embedding of that short text. As last, the similarity between any two short texts can be measured by the Hamming distance between the embeddings of that short texts.3. The vectorization of documents. Based on pre-trained word embeddings, we further investigate how to effectively represent documents by embeddings, and make use of them to do document clustering and classification. Unlike traditional document rep-resentation methods, our approach aims to encode the most typical semantic instead of the whole contents of a document into embeddings. The basic idea is that, given a document, we pick up the most representative clusters from its word clusters, and then generate the embeddings for the document.At last, we perform comprehensive experiments to show the reliability and effective-ness of our proposed methods to learn embedding representations for words, short texts and documents. Those embedding representations are also shown to be able to help many text-related tasks, including classification, clustering, information retrieval and relation-ship identification, to get better results.
Keywords/Search Tags:Natural Language Processing, Deep Learning, Neural Networks, Word Embeddings, Text Embeddings
PDF Full Text Request
Related items