Font Size: a A A

Improving Word Embeddings And Applying Them In Literature Style Recognition

Posted on:2019-02-20Degree:MasterType:Thesis
Country:ChinaCandidate:H WangFull Text:PDF
GTID:2428330566984187Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The word embedding technique is able to represent words as low dimensional continuous vectors.Among all the techniques that train the word embeddings,one of the most impressive efforts might be word2 vec,which is very effective and easy-to-use.Though word2 vec works pretty well in obtaining semantically informative word embeddings,it still can be further improved to make the quality of the resulting word embeddings higher.On the one hand,word2 vec is less able to capture the inter-sentence structure information contained in the training corpus.On the other hand,it lacks the ability to learn a lot of valuable information about the similarity between words.Based on the above two points,this paper proposes two methods to retrofit the word embeddings: the syntactic placeholder based method and the feedback information based method.In order to enhance the ability of word2 vec to capture more inter-sentence structure information,this paper proposes a special structure called “the syntactic placeholder”.By adding the syntactic placeholders into the training corpus properly,word2 vec can capture more inter-sentence structure information,which will make the resulting word embeddings more informative.The feedback information based method uses the similarity information of words generated by the model itself to improve the structure design of the hierarchical softmax,which can make the model have the ability to utilize the similarity information of words without the help of extra lexical information or knowledge base.The experimental results show that our retrofitting methods can significantly improve the quality of the word embeddings and make the word embeddings more semantically and syntactically informative.Word embeddings,with their good characteristics in representing the semantic and syntactic information of language,are widely used to improve the performance of many downstream natural language processing tasks.However,word embeddings are only used as sources of input features for some existing tasks in most cases.There has been little research on finding effective ways to make the most of word embeddings to solve some practical problems.Therefore,this paper proposes to analyze the text style based on the word embeddings.Particularly,the period style is discussed and investigated in depth and is also represented,analyzed and recognized based on the word embeddings.Specifically,this paper first systematically expounds the concept of period style.Then,a structure called “period style vector” is proposed to be the exact definition of the period style.Further,based on the period style vector,the difference in period style between texts completed in different periods can be quantified.Last of all,a practical application scene is presented: determining the completion periods of literature works.The experiments demonstrate that our method can effectively determine the completion time of a literature work whose written time is unknown.Compared to the traditional methods,our proposed method is more effective and easy-to-use.
Keywords/Search Tags:Word Embeddings, Word2vec, Hierarchical Softmax, Text Style Analysis
PDF Full Text Request
Related items