Font Size: a A A

Word Clustering Based On Self-Organizing Map

Posted on:2005-11-30Degree:MasterType:Thesis
Country:ChinaCandidate:T ChenFull Text:PDF
GTID:2168360152967686Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Word classification is a major and pivotal problem in the areas of linguistic and nature language processing. There are two methods of classifying words semantically. The first one is to classify words based on linguists' subjective judgment; the other one is automatic clustering. The thesis focuses on the latter.Based on large-scale corpus, the thesis applies Self-Organizing Map(SOM) to unsupervised clustering of Chinese words. By introducing perplexity which is a concept in language model,it proposes a evaluation model based on perplexity for automatic word clustering. First,extract words appearing in the context windows of the words which are going to be clustered from the corpus; second, select features in terms of Information Gain (IG) to form feature vectors; third,weight features within vectors using the well-known TFIDF criterion; fourth,input the weighted feature vectors into a SOM which after maps the words on different points of output gridding according to their semantics after interative learning for many times;use C-means and Genetic C-means Algorithm which combines Genetic Algorithm and C-means Algorithm to implement clustering.Referring to 85 words that are manually selected,the thesis discusses the factors which work on the clustering performance,including the size of context window,the dimensions of the feature vectors,learning rate and the side-length of SOM output gridding. Furthermore, based on 4638 high-frequency words taking out from the corpus,it experiments automatic word clustering with different side-lengths of SOM output gridding. Experimental results show that the perplexity of clustering descends significantly from 1005.72 (random classification) down to 247.37 (1000 times of interative learning),the results of C-means and Genetic C-means are 353.68 and 337.27.
Keywords/Search Tags:Clustering, Self-Organizing Map, Genetic C-means Algorithm, Information Gain, Perplexity
PDF Full Text Request
Related items