Font Size: a A A

Study On Text Clustering Based On Topic Sentence Vector Model

Posted on:2014-06-14Degree:MasterType:Thesis
Country:ChinaCandidate:L WangFull Text:PDF
GTID:2268330401985829Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The text clustering is one of the most important research braches in data mining. With the rapid development of the Internet, textual information is growing continuously. Text clustering technology has become an important method for organizing text message, summary and navigation effectively, which is concerned by more and more researchers.The particularity and unstructured form of the text clustering objects result in character of high-dimensionality and sparseness. Text similarity measurement method based on vector space model takes text as composition of individual words, ignoring the semantic information and structure information, which makes it unable for this method to calculate the similarity between the texts accurately. When faced with these particular objects, the traditional clustering algorithm has some shortcomings more or less, leading to unsatisfactory text clustering results.Aiming at solving the above problems, first, use the text after word segmentation to select feature items through words filtering, frequency-inverse document frequency, in order to achieve the purpose of dimension reduction. Then find topic sentences from the text according to the feature items, and calculate the weights of each sentence, signify text as the topic sentence vector model. Finally, calculate the similarity between the texts based on the semantic relations in How-Net.In the study of the text clustering algorithm, an algorithm which determines the cluster number and initial cluster centers is proposed. Mainly aimed at the shortcomings of k-mediods algorithm, this algorithm determines the optimal clustering number through rules of thumb and similarity-diversity function, and confirms the initial clustering center by condensing hierarchical clustering algorithm.It has designed and implemented a Chinese text clustering system, and a clustering experiment on real corpus is done by the system. The experimental results show that the proposed algorithm is feasible and effective.
Keywords/Search Tags:text clustering, How-Net, k-mediods, similarity-diversityfunction
PDF Full Text Request
Related items