Font Size: a A A

Research On Chinese Text Clustering Based On AP Algorithm

Posted on:2013-08-21Degree:MasterType:Thesis
Country:ChinaCandidate:M D TangFull Text:PDF
GTID:2248330371488849Subject:Circuits and Systems
Abstract/Summary:PDF Full Text Request
Along with the strongly development of information technology, knowledge rises dramatically, and the data mining technology provides the effective theory of the search for the information you want from the massive data. The datas in Data Mining have a variety of forms, which this paper is mainly aimed at the text with the carrier of Chinese characters, mining the information from Chinese characters datas by using AP algoritm(Affinity Propagation Algorithm) and its related improving, realizing the text set of clustering at last. This study design is divided into two parts, the first piece is mainly to the Chinese text processing work; the second piece is mainly to the clustering algorithm-AP algorithm, then make part of the improvement, updated.Since Chinese character coding characteristics, Chinese words no Spaces are labeled, difficult segmentation, and Chinese semantic cause segmentation ambiguity, identifying produce unknown words and so on, we need process the data before data mining. The software ICTCLAS provided by Chinese Academy Of Sciences by free is chosen to separate the sentences. After word segmentation, the design achieves processing data, calculating the characteristic matrix, characteristic vector and the similarity matrix by programming. Then write the result into the related document.Choosing AP algorithm as the core clustering algorithm to cluster the data set in the paper. First, compared with Kmeans algorithm by comparative tests, we observe the performance of AP algorithm, then make part of improvement of AP algorithm. Second, improve the calculation method of similarity martrix which works as the input of AP algorithm, by decreasing the dimension of characteristic vector featuring the text set to improve the calculation rate and the performance of representative as well. Third, improve the calculation method of the damping factor λ which brought into the process of iterations in AP algorithm to prevent the oscillation. By this way, we can improve the controls of strengthening the robustness of the algorithm. Fourth, improve the calculation method of the preference p according to the need of clustering, then we can control the cluster number. The whole updated AP algorithm is achieved under the programming by matlab software. Then compared with the former AP algorithm, the performance promotion can be observedThrough experimental comparison, the experiment shows that the update AP algorithm has better clustering performance than the original AP algorithm. And then using the updated AP algorithm to clustering Chinese text set, achieve100article txt documents set of Chinese text clustering. Processing and observations.The first half part of the experiment uses the object oriented language Java to achieve the text reading and writing, pretreatment, calculation of similarity matrix, and writing the similarity martrix into the excel file. The second half part of the experiment uses matlab programming software to realized the clustering algorithm, then write the result into the excel file.
Keywords/Search Tags:AP Algorithm, Chinees Text, Clustering, Similarity Martrix, damping factor, preference
PDF Full Text Request
Related items