Font Size: a A A

Optimize SOM Algorithm To Apply In Text Clustering

Posted on:2009-11-17Degree:MasterType:Thesis
Country:ChinaCandidate:A X SunFull Text:PDF
GTID:2178360272463229Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of network technology and the popularity of the rapid expansion of information, in order to gain useful information from the large information sea, data mining and knowledge discovery technology arise at the historic moment. Because text is the most important existing form of information, correspondly text mining is one of the most important data mining fields. Clustering is one of the fundamental technology in text mining field. The research in this text clustering field has undergone considerable development in recent years. As text is unstructured data, in order to cluster them, pretreatment technologies must be adopted to transform them to structured form. So Firstly this paper introduces the text pretreatment technology such as word segmentation, stemming, dimension reducing systematically. Clustering technology is the key technology in text clustering field. Since the 1950s, a variety of clustering algorithm has been invented, of which SOM algorithm is a very famous one. Then this paper sets focus on studying SOM algorithm, and make two important improvements.SOM neural network is one kind of artificial neural networks by simulating the signal processing characteristics of the human brain. The basic idea of SOM clustering is to through network training, map the similar input vectors to one output node, so it can realize the input vector clustering.This paper improves SOM algorithm from two aspects. The first is to take the text clustering goal-the minimum of average deviation, also called the average similarity within cluster into account, then proposes an improved learning strategy. The improved algorithm introduces the equal deviation error theory into the learning process of neural network, the algorithm guides neural network learning through the adjustment of the cluster deviation in order to make clustering results with the smallest average deviation. This improved algorithm not only solve the neurons less-use and over-use problem, and has greatly enhanced the quality of the text clustering outcomes.The second is: aiming at the problems of random weight initialization causing long training time of the net, a hierarchical clustering method is used to detect data-intensive region, the centers of K regions which are detected are used to initialize the connecting weight. Experiments show that the improved SOM can reduce the training time of network and is not easily converge to a local optimum. Meanwhile, in order to express the result of clustering easily, we select several important key words to express clusters appropriately in that the content of clusters can be understood correctly and the performance and efficiency of information processing can be enhanced.
Keywords/Search Tags:Text Clustering, Self Organizing feature Mapping, Equal Eerror, Weight Initialization, Label
PDF Full Text Request
Related items