Font Size: a A A

Application Of The Clustering Analysis In The Large Vocabulary Chinese Character Recognition

Posted on:2008-09-07Degree:MasterType:Thesis
Country:ChinaCandidate:J YangFull Text:PDF
GTID:2178360245475360Subject:Communications and information processing
Abstract/Summary:PDF Full Text Request
With the fast development in science, the need of analysis and management a tremendous amount of data becomes more and more important. Clustering analysis is introduced to find out the model from large data. Clustering analysis has been widely used in data mining, pattern recognition, image progressing and so on. This thesis mainly studies the application of Clustering technologies for recognition of large databases.At first, we summarize Clustering's principle, construction and the basic idea in detail. Many Clustering algorithms have been investigated, they are classified into several types: partitioning cluster, hierarchy cluster, density-based cluster and model-based cluster. They have their own advantages and disadvantages respectively; moreover, each type has been improved from different parts by different researchers. In Chapter 3 we study three classic Clustering algorithms: K-means, LVQ, kernel Clustering, meanwhile experiment MLVQ (the improve LVQ algorithm). At last we select K-means algorithm for large HCC recognition. Two kinds of feature exact algorithm were used in the experiments: Gabor feature and Gradient feature, experiments show that Gradient feature is better than Gabor feature in recognition accuracy, and the recognition accuracy can be enhanced furtherly by LDA algorithm..Among our researching, we find that the clustering codebook after clustering needs a lot of memory and the recognition time is also very long. All of these are disadvantages for population of large data recognition in real world. So we employ Split VQ algorithm and two-layer clustering algorithm to increase the defectiveness of recognition in time and space. It has been shown that these two algorithms not only guarantee the recognition accuracy, but also can reduce the recognition time and memory of codebook greatly.Conventional k-means needs to know the exact cluster number before performing data clustering. Otherwise, it may lead to a poor clustering performance. The Rival Penalized Competitive Learning algorithm (RPCL) can automatically select the correct cluster number, but it is sensitive to the learning rate and the de-learning rate, especially the de-leaning rate. Chapter 5 presents an improved RPCL algorithm, which is based on the evaluation of competition ability between the winner and the rival, the improved RPCL algorithm could determining clustering number without the selecting of de-learning rate. Our experiments have shown that this improved algorithm can find out the correct clustering more quickly and convenient than RPCL algorithm...
Keywords/Search Tags:Clustering Analysis, Character Recognition, K-means Algorithm, Feature Extract
PDF Full Text Request
Related items