Font Size: a A A

Research On SOFM Text Clustering Algorithm

Posted on:2018-07-23Degree:MasterType:Thesis
Country:ChinaCandidate:L J TanFull Text:PDF
GTID:2348330542991457Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
In recent years,with the rapid development and popularization of network technology and database technology,people can quickly and easily get and store large amounts of data,and eighty percent of the data is text.At present,how to accurately and quickly obtain useful information from a large-scale text data has become an urgent problem.Under this background,the data mining technology arises at the historic moment,text clustering technology as one of the important branch of data mining technology has become a research hotspot in recent years.As the text data is unstructured,it must transform into structured form with a series of pretreatment technology like words segmentation,stop words processing,character words selection,weight calculation and mathematical model representation before the text data clustering.The traditional self-organizing feature map(SOFM)neural network algorithm is applied in the text clustering.We put forward two improved points which make it more suitable for large scale text data.The first is aiming at random network initial connection weights selection of the traditional SOFM algorithm.It may lead to the training result that makes all samples together.We put forward a SOFM text clustering algorithm which is based on the improved initial connection weights.According to the proposed method to select the initial connection weights,it can make the initial connection weights are close to the input mode of text data,so that it can improve the accuracy of clustering results.At the same time,it can speed up the clustering convergence.The second is aiming at the problems of sparse data and dimension disaster which is caused by the high dimensionality of text data.The text data is represented by vector space model.This paper proposes a SOFM text clustering algorithm which is based on principal component analysis(PCA).Relative to the feature selection method,the proposed algorithm mainly consider that it can keep the right amount of useful feature words in terms of dimension reduction,and will not lose important information.By comparison with the simulation experiments,the algorithm can further improve the clustering accuracy of the algorithm and speed up the clustering.
Keywords/Search Tags:Text clustering, SOFM algorithm, Initial connection weights, Dimension reduction, Principal component analysis method
PDF Full Text Request
Related items