Font Size: a A A

Research On Semantic Text Clustering Algorithm

Posted on:2018-11-19Degree:MasterType:Thesis
Country:ChinaCandidate:Q Q MaFull Text:PDF
GTID:2348330512996735Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology,the volume of network data grows exponentially.How to acquire target information from network quickly and accurately has become an important issue that people have to deal with.As an important text mining technique that covers data mining,machine learning,natural language processing and some other fields,text clustering is introduced to solve the problem aforementioned.Vector space model(VSM)has been widely utilized in the study of text clustering due to its simplicity and efficiency in text representation.However,traditional VSM,which directly take the words in text as the features in text representation,does not take into account semantic relations among different words,therefore it is unsatisfactory in accuracy of text representation.To overcome this problem,some word sense disambiguation approaches are proposed to recognize the polysemes and synonyms in text by mapping the words in the text to their corresponding synsets in WordNet.Through the analysis of these methods,we find there exist some problems in their disambiguation strategies.Thus,we propose a word sense disambiguation algorithm based on continuous word vectors(WSD-CWV),which improves the accuracy of word sense disambiguation by utilizing neural network language model in semantic similarity measure between synsets and contexts.We implement a text clustering algorithm by applying WSD-CWV in text clustering.There is large amount semantic information in WordNet,which is organized in structural form.Thus,some WordNet based text representation approaches are proposed to improve the performance of text representation in text clustering.However,due to the complexity and diversity of semantics in textual data,and the amount of synsets in WordNet is more than one hundred thousand,the dimension of the WordNet based text representation can be extremely high.To solve this problem,we propose a dimensionality reduction algorithm based on synset clusters(DR-SC),which extract coarse-grained feature from text by synset clustering to reduce the dimensionality of text representation.The most difficult and crucial part for DR-SC is how to get the semantic representations of synsets for synset clustering.Based on the efficiency of neural network language model in semantic feature extraction,we explore encode gloss relations among synsets in WordNet into a synset based corpus,and utilize neural network language mode to learn synset vector representations based on the co-occurrence among the synsets in this synset based corpus.We implement a text clustering algorithm based on WSD-CWV and DR-SC to improve the accuracy of text clustering,meanwhile,reduce the computational complexity of clustering algorithms.We experimentally evaluate the performance of our proposed text clustering algorithms,and the results show that with our approaches the performance of text clustering can be improved significantly comparing with some other classic methods.
Keywords/Search Tags:text clustering, continuous word vectors, synset clusters, NNLM, WordNet
PDF Full Text Request
Related items