Font Size: a A A

Research On Text Clustering Based On Semantic Similarity

Posted on:2008-12-12Degree:MasterType:Thesis
Country:ChinaCandidate:S SunFull Text:PDF
GTID:2178360215997642Subject:Computer applications
Abstract/Summary:PDF Full Text Request
Text document clustering plays an important role in text mining and information retrieval systems. It can improve the result of queries; provide intuitive navigation and browsing mechanisms; and find similar texts.In text clustering applications, the text or document is always represented using Vector Space Model. This representation is very simple, but raises one severe problem: the high dimensionality of the features pace and the inherent data sparsely. In addition, this representation also can't solve text data's polysemy problem and synonym problem. All these problems interfere with classification or clustering learning processes greatly and make their performances be dramatically dropped.The main technologies to solve the problem are weight adjustment and dimensionality reduction, but these methods have their own defects. Weight adjustment doesn't solve those problems effectively, so it improves the quality of clustering a little. Although dimensionality reduction solves high dimensionality, it cost highly. Moreover, there are many clustering algorithm, but they don't settle high dimensionality and understandable description of the clusters. To solve the problems mentioned before, this text proposed a new method for text clustering based on semantic similarity– TCUSS (Text Clustering Using Semantic Similarity). This method represents text with concept list. This representation not only reduces the feature dimension, but also is convenient for calculating semantic similarity. TCUSS calculates the semantic similarity of each concept in two concept lists on WordNet. Semantic similarity solve the polysemy and synonymy problems, also reflects the content similarity between tow texts. TCUSS clusters texts based on graph analysis to be independent with the shape of clusters. The experiment result has shown that TCUSS improved the text clusters correctly.
Keywords/Search Tags:text clustering, semantic similarity, text representation, clustering algorithm, semantic network
PDF Full Text Request
Related items