Font Size: a A A

Document Clustering Based On The Semantic Network Of Forestry Thesaurus

Posted on:2011-06-15Degree:MasterType:Thesis
Country:ChinaCandidate:L R LiFull Text:PDF
GTID:2178360305464321Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the perspective of ontology semantic, this paper attempts to improve the measurement of document similarity, using the semantic knowledge of ontology. This work combines ontological semantics and document clustering, in order to improve the effect of document clustering. For this purpose, a thesaurus-based document clustering method is proposed, where the "thesaurus" is a kind of ontology.Firstly, features of documents are extracted and collected with the help of the thesaurus. As a result, the processed documents are represented by TF-IDF (Term Frequency-Inverse Document Frequency). Then the similarity between terms is calculated according to the semantic relations among the terms. After that, the similarity between documents is attained according to TF-IDF and term relationship. And finally, the documents are clustered with the K-means algorithm.In this paper, the key technologies related to document clustering are studied and discussed, including Vector Space Model, feature extraction and collection, calculation of term similarity, and calculation of document similarity.The experiments in this paper is designed and implemented based on the data of Chinese-English Forestry Thesaurus and Chinese Forestry Literature Database. The experiment result has been compared to that of the clustering method without using thesaurus.Results from the experiments show that the clustering method with using thesaurus is apparently improved comparing to the clustering method without using thesaurus.
Keywords/Search Tags:Thesaurus, document clustering, similarity, forestry
PDF Full Text Request
Related items