Font Size: a A A

The Study Of Diachronic Analysis Based On Word Embeddings And Evaluation

Posted on:2021-05-29Degree:MasterType:Thesis
Country:ChinaCandidate:X F XuFull Text:PDF
GTID:2428330611964283Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the vigorous development of information technology,the digitization of traditional documents is gradually evolving.At the same time,the explosive growth of social networks and news media in recent years has brought a large number of cross-time data.How to mine effective information from these crosstime data has become the focus of the current academic and industrial research.In recent years,the wide application and rapid development of deep learning makes it possible to mine these cross-time data,especially the strong representation learning ability of deep learning in text.It has been used in almost all the work related to natural language processing based on deep learning.As the cornerstone of tasks related to natural language processing,word representation learning(word embedding)has experienced from the traditional statistics-based representation methods such as simple statistics and bag of words model to the deep learning model-based methods such as learning the co-occurrence relationship of words and learning the word sequence information in sentences.However,there are still some problems in the existing algorithms for temporal representation learning methods.For example,some studies apply these models to different time slices.Then the results of representation learning on multiple time slices are aligned through the alignment algorithm.The evolution results of these methods depends not only on the representation learning algorithm itself,but also on the alignment algorithm.However,the existing alignment algorithms are easy to fall into the state of over-alignment.In this thesis,the main study is focused on the improvement of cross-time representation learning algorithm in the point of alignment,by introducing relevant algorithms based on deep learning mixed with our proposed framework to achieve alignment-free word vector learning.We first introduce an alignmentfree idea,which distinguishes the input words according to time slices on the low-dimensional representation learning algorithm.Then,when learning the hidden-level word representation,the expressions of all words on the time slice are integrated,so that they can reflect not only the characteristics of the same word in different periods,but also the cross-temporal correlation of different words.After applying this idea to the existing vocabulary representation learning algorithms,two models: Tagged-SVD and Tagged-SGNS are proposed.At the same time,aiming at the task of word diachronic analysis,an evaluation method is proposed,which can reflect the smoothness of the cross-time word representation learning model.In addition,after further analyzing the characteristics of our existing models,we also propose a word representation learning model named as sentence-based word embedding(SWE),which improves the existing cross-time word representation learning algorithm from two aspects.The first is to take the whole sentence as the input of word representation learning,in order to learn the relationship between words that are far away.Secondly,it provides the ability to deeply mine the context,which can consider the contextual cascading features into the model.The main research results of this paper are as follows:(1)SGNS have shown promise in diachronic analysis by embedding words as vectors(known as word embeddings)into low-dimensional dense vector spaces of different time periods.Then,probing the evolution of words over time is transformed to measuring the distance between word embeddings across time.As the prerequisite of distance measurement,it requires aligning the vector spaces properly.In recent years,various alignment methods have been proposed based on the assumption that most words remain unchanged over time.However,none of them ensure the alignment is smooth,i.e.,if a word has the similar co-occurrence statistics over time,the word embeddings should be similar;otherwise,the word embeddings should be dissimilar.In this paper,we propose Tagged-SGNS(TSGNS)which guarantees the smooth alignment of vector spaces at different time periods to enhance diachronic analysis.Besides analysis,we have evaluated TSGNS on an 105 GB dataset from Google Books N-gram.The test results show the unique advantage of our method against the current state-of-the-art.(2)It has been recognized that word semantics are implicit in co-occurrence relationships.Based on co-occurrence statistics in context,words can be embedded into feature spaces and the semantic similarity between words can be approximated.One research field aims to capture the semantic shift of words in language over time.Various methods have been proposed in the past decade.However,all existing studies neglect two fundamental issues.First,the word embedding is based on context words co-occurrence statistics within a fixed range before and after in sentences in the corpus,but the rich information of the whole sentence is neglected.Second,the existing studies capture the semantic shift of words based on the change of context words in corpus over time but disregard the change of deep-context,i.e.,the contexts of the context words may change and thus imply the semantic shift of the context words over time.To fill the gap,this paper proposes the sentence-based word embedding(SWB)by tackling challenges including the various lengths of sentences and the random positions of the targeted words in sentences;and develops a similarity measure metric to probe the situation of deep-context.The unique advantages against the state-ofthe-art in this research topic have been verified by extensive texts on large corpus at various settings.(3)There are still few studies on the application of machine learning supported diachronic word analysis in Chinese.Compared with English,the diachronic word analysis in Chinese will be affected by word segmentation in Chinese natural language technique.We use a huge amount of data provided by Sogou,a Chinese Internet search engine service provider.After data preprocessing,a cross-time Chinese corpus is obtained.Then we train the above data with three different word representation learning methods,and compare these three methods.Finally,we build a system based on the model results of Chinese diachronic word analysis,which can display the semantically similar words(neighbor words)of the queried words in different periods online.by distinguishing neighbor words in different periods,we can speculate the semantic change direction of the queried words between these different periods.
Keywords/Search Tags:Deep Learning, Word Representation Learning, Word Embedding, Diachronic Analysis
PDF Full Text Request
Related items