Font Size: a A A

Research On Thesis Text Clustering Based On Semantic Similarity

Posted on:2010-10-06Degree:MasterType:Thesis
Country:ChinaCandidate:R YinFull Text:PDF
GTID:2178360275458277Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Face with the growing number of theses on networks,how to quickly and efficiently retrieve the theses that users need is a difficult problem for thesis retrieval.At present the commonly used method is based on the keywords match,this method inquiry speed is quick, but has not solved the problem which begot by synonym,polysemy and concept upper or lower position,so the retrieval effect is not entirely as desired.If using the text cluster to further processing the retrieval result,division the retrieval result according to its related subject, produces the different cluster according to subject,delete the redundant items,provides a clear guidance for the user,It will be greatly beneficial to the users to find the relevant thesiss they need,so it can improve the quality of retrieved thesis.In this thesis we improved a thesis text cluster algorithmthat based on the semantic similarity(TCUSS algorithm) and applied it to clustering thesis text.The improved algorithm uses the feature selection methods and the method for description of cluster.Obtain the method of text mathematics expression and the cluster algorithm by improving TCUSS algorithm in view of the text of thesis.In feature selection,because the keywords can not express the theme of the text article better,so we unifies the WordNet semantics dictionary,according to the concept that expressed by keywords finished the feature extraction.Using the synonym collection in WordNet and the comparion with the similarity between feature words solved the synonym and the polysemant question;In the the representation of text,we represents text with concept list.In terms of similarity calculation,we use the synset which include the keywords instead of the keywords, calculate the semantic distance between synsets in WordNet,and then we can calculate similarity of words in accordance with the semantic distance between synsets.According to calculated similarity between feature words we get text similarity;Then we used a text cluster algorithm that based on the semantic similarity,the algorithm clusters texts based on graph analysis to be independent with the shape of clusters.Use the frequency of feature words appear in cluster and the information feature words contained in WordNet to calculate the weight of feature words,and select the right part of the feature words to describe.Finally we designed a system based on the semantic similarity thesis text cluster to examine the algorithm the validity, Contrast the effects the algorithm in this thesis,TCUSS algorithm and K-Means algorithm act on a self-built version of a data set of thesis,results show this algorithm improved the text clusters correctly,has certain usability.
Keywords/Search Tags:Text Clustering, Semantic Similarity, Concept List, theses retrieval
PDF Full Text Request
Related items