Font Size: a A A

The Research On Local Smooth Preserving Of Manifold Regularization Auto Encoder For Text Representation

Posted on:2018-06-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:C WeiFull Text:PDF
GTID:1318330566955931Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the development of information technology and the formation of massive unstructured text data,the importance of text mining has become increasingly prominent.The text representation is to transform a large number of unstructured or semi-structured character information in the corpus into a simple and concise structured form,which is a key and important step in text mining and is widely used in text classification,clustering and retrieval.The high-dimensional,sparse and lexical related features of text data obstruct the development of text representation theory and technology.Most of the existing methods assume that the feature space of text is a Euclidean space,that is,any words are independent,and thus can not make full use of the semantic features of the text.In fact,if we can extract the richer text semantic information,such as the local Euclidean characteristics of the neighboring texts,the feature smoothness of the low-dimensional manifold space of the neighboring text can be improved,and the effect of the text representation can be improved more effectively.Based on the theory of manifold learning,this paper aims to study the text representation of low-dimensional dense vector by perserving the local smoothness of text representation vector.Firstly,the text similarity measurment is developed by combining the distributed semantic features of word embeddings.Then,based on the text similarity,text proximity graph is constructed for judgment of local neighboring texts.Finally,the parametric topic coding function is built by preserving the locally weighted embedding of the topic of neighboring texts(text topic modeling),meanwhile the smooth affine map for text embedding is also built via an approximation of the probabilistic generative structure of subspace(text embedding representation).The main contributions and innovations of this thesis are as follows:1.A text similarity measurement is proposed by combining the syntagmatic andparadigmatic structure of words,which makes full use of the semantic relationbetween words and efficientlly improve the accuracy of similarity for text with lowco-occurrence words.To address the problem that the text similarity measurement fails to consider the semantic relations of words and provide a poor performance for text with low co-occurrence words,a text similarity measurement is proposed by combining the syntagmatic and paradigmatic structure of distributed semantic features.Firstly,a autoencoder is adopted to combine the syntagmatic and paradigmatic structure of words.The word embedding encoder network is built by training the autoencoder.Then,the text similarity is calculated by using the maximum weighted matching distance of word embedding.Based on Wikipedia 2010,20 newsgroups and RCV1 corpus,the experiments are carried out on word embeddings and text similarity respectively.In the word analogy experiment,the accuracy rate of word embeddings is 73.95%.In the word similarity experiment,the spearman rank correlation of word embeddings is 74.12,and the results show that the combination of syntagmatic and paradigmatic structure of words can express richer distributed semantic information.In the text similarity experiment,the text clustering NMI of the maximum weighted matching of word embeddings distance is 63.1%,and the text classification ? reaches 71.59% by using the maximum weighted matching of word embeddings distance.The results show that the computation of the maximum weighted matching distance of word embeddings can effectively use the semantic relation between words,thus further improve the accuracy of the text similarity measure.2.A topic modeling framework of autoencoder is proposed by preserving the locallyweighted embedding of the topics of neighboring texts,which can build a parametrictext topics encoder network for out-of-samples topic modeling,and further improvethe text classification and clustering by preserving the geometric structure of localneighborhood texts.To address the problem that the manifold-based text modeling method can not build a parametric topic encoder function for topic modeling of out-of-samples and the problem that the exiting OOS methods fails to effectively perserve the smoothness of the topic probability generative structure of the local neighboring texts,a method called locally weighted embedding topic model(LWE-TM)is proposed.LWE-TM first uses the conditional visited probability of a low-rank approximation random walk structure to compute the weighting coefficients of the local neighboring texts.Then,regularizes the training of the autoencoders to explicitly perserve the local geometrical smoothness by the topics encoding and builds the parameterized text Topic coding network.Based on the 20 newsgroups and RCV1 text sets,the text modeling,clustering,and classification experiments are performed with the topic encoding of out-of-sampes.In the text modeling experiment,the perplexity is 679 and 1800 for two data sets.In the text clustering experiment,the NMI of LWE-TM was improved to close to 74%.In the text classification experiment,LWE-TM achieved ? 86.59%.The experimental results show that LWE-TM can use the built parametric text topic encoder network to effectively performe the topic modeling of OOS,and improves the smoothness of text topic coding by perserving the smoothness of local geometric structure of neighboring texts,resulting in the improvement of the topic modeling,text clustering and classification.3.A regularized autoencoder of text embedding is proposed by perserving theprobability generative structure of topic around subspace,improveing the result oftext classification and clustering by preserving the smoothness of the topic structureof neighboring texts.To address the problem that the existing text embedding method can not effectively perserve the smoothness of the topic probability generative structure of the local neighboring texts,Discriminative Locally Document Embedding(Disc-LDE)is proposed.The method first construct the text neighbor graph based on the text similarity measurement.Then,the subspace is constructed by the transductive multi-agent random walk on text graph.Finaly,a pseudo-text is generated by using the LDA model of the subspace and regularizes the training of the autoencoders(AEs)to jointly recover the input text and its pseudo-text.The regularized training can build a smooth affine mapping function for out-of-samples.Based on the 20 newsgroups,RCV1 and Amazon reviews three text sets,the clustering and classification experiments are performed with the text embeddings.Disc-LDE achieves the clustering results NMI to nearly 71%,and the classification results ? is up to 83.91%.The results show that the high overlap ratio subspace can effectively perserve the smoothness of the local neighboring texts probabilistic generation structure,and construct the smooth affine mapping function,which further improves the effect of text classification and clustering.
Keywords/Search Tags:text representation, manifold learning, topic modeling, text embedding, autoencoder, text clustering, text classification
PDF Full Text Request
Related items