Font Size: a A A

Research And Application Of K-Means Algorithm Based On Concept Lattice

Posted on:2011-04-12Degree:MasterType:Thesis
Country:ChinaCandidate:Y X LiFull Text:PDF
GTID:2178360302499231Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Along with the rapid development of Internet, search engine becomes a major channel of accessing to information. However, one search result of search engines will be thousands and all of them are mixed together, it will be needle in a haystack to find the information they want for users. One of the effective ways to improve the quality of search engine is that, to cluster the similar web texts in the web search results to a class using text clustering technique. Clustering of web search results can provide a easy-to-navigate information navigation for user, and assist users to locate the subject category which meet their needs rapidly, thereby the efficiency of search engine retrieval is enhanced.Clustering is the process of grouping objects according to the similarity in the case of that classes are unknown. Vector space model are used by most text representations and similarities are calculated based on it. Vector space model calculates the weight using TFIDF (Term Frequency Inverse-Document-Frequency). It has the advantage of reflecting the importance of keywords to the text, but this representation model has brought two major problems:(1) the dimension of feature vector is too high; (2) the text vector space is regarded as a vector space which consist of orthogonal term vectors, assuming that there is no semantic links between words, but the reality is that the words of texts is often semanticly associated, therefore the reliability of the results have been affected.Concept lattice is a ordered set of concepts, the process of establishing concept lattice is that of clustering concepts. In the concept lattice, the extension of a concept is a set of objects which belong to the concept, and the content is the set of attributes all these objects sharing. For the given formal context, the concept lattice constructed must existed and be unique.K-Means algorithm is the most widely used clustering algorithm based on partition, to deal effectively with large text set. This article proposes K-MeansBCC(K-Means Algorithm Based on Concept Lattice) combining concept lattice and K-Means algorithm. First, Generate concept lattice on the basis of texts as objects and tokens in the texts as attributes; second, extracte the formal concepts in the concept lattice, represent texts using these concepts and definite the similarity function between concepts; finally, cluster the texts using K-Means algorithm. Representing texts using concepts can reduce the dimension of feature words and improve the clustering performance. Aiming at the problem of the existing K-Means algorithm, such as artificial assignation of number of final clustering and random selection of initial centers, we propose a method based on the density. We applied the K-MeansBCC algorithm in the clustering module of "Haisou", and compared it with K-Means algorithm, the experimental results show that, K-MeansBCC algorithm is obviously reasonable and effective.
Keywords/Search Tags:Concept lattice, Text Cluster, Text Representation, K-Means Algorithm, Concept Similarity
PDF Full Text Request
Related items