Font Size: a A A

Research On Word Sense Disambiguation Based On K-means Cluster And LSTM

Posted on:2021-02-17Degree:MasterType:Thesis
Country:ChinaCandidate:X S ZhouFull Text:PDF
GTID:2428330605972932Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Chinese contains many ambiguity words,which can express disparate meanings with different contexts.The concept of word sense disambiguation(WSD)is proposed for applying computer in natural language processing(NLP).We expect that with the help of algorithm,computers can clear the meaning of context and select the accurate implication of ambiguous words automatically.WSD makes computers comprehend and apply natural language accurately.It has been widely utilized in machine translation,text classification and so on.WSD has been an important issue to be solved urgently in NLP.This paper presents a WSD method based on K-means clustering method and LSTM(Long Term Memory,LSTM).Unlabeled corpus is merged by a semi-supervised K-means cluster.Then,they are added into training corpus to optimize LSTM model and its performance is tested by testing corpus.Three research aspects are reflected in this paper as follow:Firstly,the current research status and development at home and abroad is introduced by analyzing the literature on WSD.The objective and significance of WSD are clarified,and we summarize difficulties and development trend of WSD in the future.Secondly,synonyms word forest and the necessary corpus for experiment are introduced.By studying related knowledge of WSD feature engineering,we confirm the extraction process of clustering features and disambiguation features.The disambiguation process of Bayesian classifier and LSTM classifier is described in detail.Finally,we introduce the process that semi-supervised K-means cluster merges unlabeled corpus.Several cluster centers are selected in labeled corpus.Then take an unlabeled data and calculate its distance to each clustering center.If there is a distance to a certain clustering center less than threshold value,theunlabeled data is taken out and putted into the class which the clustering center is located in.Update clustering centers in labeled corpus after calculating distance from each unlabeled data to each cluster center.Repeat this process until the clustering centers in labeled corpus aren't updated any more.Adding the clustering data into training corpus,we get the extended training corpus.It is used to train LSTM model.After getting the optimized LSTM classifier,we test it with testing corpus.Experimental results show that the disambiguation ability of the proposed method in this paper is higher than LSTM classifier,DBN classifier and Bayesian classifier.
Keywords/Search Tags:word sense disambiguation, K-means cluster, LSTM classifier, disambiguation features
PDF Full Text Request
Related items