The Research Of Semantic Vector Representations And Modeling Approachesfor Text

Posted on:2019-08-07

Degree:Master

Type:Thesis

Country:China

Candidate:P X Chen

Full Text:PDF

GTID:2428330542494084

Subject:Information and Communication Engineering

Abstract/Summary:

PDF Full Text Request

Text embeddings is a technique to discover and learn the semantic information contained in the texts,and represent texts as real-valued vectors.The purpose of text embeddings is to facilitate the computer to further process the subsequent natural language processing(NLP)tasks.The most naive but common approach for text representation is bag-of-words(BOW),which is simple,efficient and often achieves satisfactory performance.Although BOW effectively make use of the word frequency information,it fails to learn the sequential information between words as well as the topic relevance of different words.Besides,it suffers from high dimensionality and data scarcity.In recent years,researches have proposed the word embedding approach,which can learn the semantic information of words from a large amount of unlabeled data,and efficiently estimate the distributed representations of words in a continuous vector space,where similar words tend to be close to each other.Word embeddings establish the foundation for text representations based on neural networks.By combining the word vectors in a sentences or document,the neural networks can learn better semantic representations of texts,which facilitate the process of many NLP tasks such as text classification,text clustering,sentiment analysis,sentence semantic matching and automatic question-answering(QA).Focusing on sentence semantic matching,document classification and document clustering tasks,this thesis conducts research on sentence representation and document representation.In sentence semantic matching,the semantic contents of two sentences are firstly represented as vectors through neural networks,and then the two vectors are fed into a multi-layer perceptron to predict the semantic relationship between them.In state-of-the-art semantic matching models,the semantic contents of sentence are usually encoded by Long Short-Term Memory(LSTM).Despite the powerful sequence modeling ability of LSTM,the serial computation of recurrent structure is highly time-consuming.In this paper,Convolutional Neural Network(CNN)is utilized to encode the semantic contents of sentences in semantic matching frameworks,in view of the parallel computation of CNN.To further control the information and optimize the path through which information flows,we introduce output gate,forget gate and memory cells into CNN,which is inspired by the gating mechanism of LSTM.The contextual information created by previous convolution layers are modulated by the forget gate and stored in the memory cells.Meanwhile,the output gate modulates the outputs of the current convolution layer.Experiments and analysis show that the gating mechanism effectively improve the semantic modeling of CNN.In document classification,the classical probabilistic topic models are commonly used for text modeling.The topic models represent the documents as low-dimensional vectors in the latent topic space by analyzing the word co-occurrences in the documents.Recently,the research of text classification based on neural networks have achieved remarkable performance and formed the mainstream.In view of the discriminative ability of supervised neural networks,this paper utilizes neural networks to extract the distributed semantic features of documents.Based on this,this paper combines the semantic features learned by different neural networks or the latent topic features inferred by topic models to generate a more discriminative feature.SVM classifier is then adopted to predict the topic category.The experimental results show that different document feature representations can have complementary strengths,and combination of them is expected to increase the accuracy and robustness of topic classification system.Document clustering is a typical unsupervised task.In general,the documents are first represented as fixed-dimensional feature vectors,and conventional clustering algorithms are subsequently performed to partition the documents into groups.Compared with unsupervised models,the supervised models can encode more discriminative features.In view of this,we propose a pseudo-supervised semantic vector learning approach based on consensus analysis.In this approach,the neural network is trained in a supervised fashion with pseudo-labels,which are provided by the cluster labels of pre-clustering on unsupervised document representations.To enhance the quality of pseudo-labels,a consensus analysis is employed to select training samples for the neural network.The experimental results demonstrate that the proposed approach can generate more discriminative semantic vectors,and improve the clustering performance significantly.

Keywords/Search Tags:

Text Representation, Artificial Neural Network, Semantic Matching, Document Classification, Document Clustering, Consensus Anlysis, Pseudo-Supervised

PDF Full Text Request

Related items

1	A semantic graph model for text representation and matching in document mining
2	Research And Application Of Document Semantic Representation Method
3	Long Document Classification And Crowdfunding Platform Project Screening And Recommendation Based On Deep Learning
4	Research On Key Techniques In Cross-document Fusion
5	Research On Semantic Similarity Computation And Applications
6	Research On Semi-supervised Text Clustering Algorithm For Personalized Topics
7	A semantic partition based text mining model for document classification
8	The Research Of Semi-Supervised Chinese Document Classification Algorithm
9	Web Document Automatic Classification Based On Keywords
10	Design And Implementation Of Document Sharing Website