Font Size: a A A

A Research On Text Vector Representations And Modelling Based On Neural Networks

Posted on:2017-02-06Degree:MasterType:Thesis
Country:ChinaCandidate:L Q NiuFull Text:PDF
GTID:2308330485962281Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Text representations and modelling are fundamental tasks in Natural Language Processing (NLP) area. Traditionally, these methods of text representations based on Bag-of-Words (BOW) models are simple, efficient, and scalable, but they suffer from disadvantages such as the curse of dimensionality, data sparsity, and inability to capture semantic information. Recently, with the significant development of applying big data and deep learning technologies in speech, image, and bioinformatics areas, researcher start taking the usage of deep neural networks (DNN) to NLP area. In particular, Col-lobert and Weston applied DNN-based word vector representations in all kinds of NLP tasks in 2008, Google researcher learned distributed word representations using neural network language models (NNLMs) in 2013, and now there is an increasing number of text embedding methods based on neural networks.This paper focuses on studying NNLM-based text vector representations and topic models. First we briefly introduce traditional N-Gram statistical language models and neural network language models, also review some traditional methods of word rep-resentations and Word2Vec model which can learn distributed word representations. Then this paper extends these basic models and methods in the following aspects:1. Latent Dirichlet Allocation (LDA) mining thematic structure of documents plays an important role in nature language processing and machine learning areas. However, the probability distribution from LDA only describes the statistical relationship of occurrences in the corpus and usually in practice, probability is not the best choice for feature representations. Recently, embedding methods have been proposed to represent words and documents by learning essential concepts and representations, such as Word2Vec and Doc2Vec. The embedded representations have shown more effectiveness than LDA-style representations in many tasks. In this paper, we pro-pose the Topic2Vec approach which can learn topic representations in the same se-mantic vector space with words, as an alternative to probability. The experimental results show that Topic2Vec can model topics better.2. Distributed word representations have achieved great success in natural language processing (NLP) area. However, most distributed models focus on local context properties and learn task-specific representations individually, therefore lack the ability to fuse multi-attributes and learn jointly. In this paper, we propose a unified framework which jointly learns distributed representations of word and attributes: characteristics of word. In our models, we consider three types of attributes:topic, lemma and document. Besides learning distributed attribute representations, we find that using additional attributes is beneficial to improve word representations. Several experiments are conducted to evaluate the performance of the learned topic representations, document representations, and improved word representations, re-spectively. Several experiments show that our models achieve competitive results.3. While perception tasks such as visual object recognition and text understanding play an important role in human intelligence, the subsequent tasks that involve inference, reasoning and decision-making require an even higher level of intelligence. The past years have seen major advances in many perception tasks using deep learning models. For higher-level inference, however, probabilistic graphical models with their Bayesian nature are still more powerful and flexible. To achieve integrated intelligence that involves both perception and inference, it is naturally desirable to tightly integrate deep learning and Bayesian models. This paper considers fus-ing word representations based on neural networks and latent Dirichlet allocation (LDA). In particular, we apply word embeddings into the LDA for improving topic models, and propose word embedding cluster prior LDA, context-aware LDA, and word embedding enhanced LDA models, respectively. The experimental results show that using word representations made LDA perform better.
Keywords/Search Tags:Natural Language Processing (NLP), Text Representation, Deep Learning, Neural Networks, Text Modelling, Topic Models, Word Embeddings, Topic, Docu- ment, Framework, Latent Dirichlet Allocation (LDA)
PDF Full Text Request
Related items