Research On Chinese Text Clustering Based On AP Algorithm

Posted on:2013-08-21

Degree:Master

Type:Thesis

Country:China

Candidate:M D Tang

Full Text:PDF

GTID:2248330371488849

Subject:Circuits and Systems

Abstract/Summary:

PDF Full Text Request

Along with the strongly development of information technology, knowledge rises dramatically, and the data mining technology provides the effective theory of the search for the information you want from the massive data. The datas in Data Mining have a variety of forms, which this paper is mainly aimed at the text with the carrier of Chinese characters, mining the information from Chinese characters datas by using AP algoritm(Affinity Propagation Algorithm) and its related improving, realizing the text set of clustering at last. This study design is divided into two parts, the first piece is mainly to the Chinese text processing work; the second piece is mainly to the clustering algorithm-AP algorithm, then make part of the improvement, updated.Since Chinese character coding characteristics, Chinese words no Spaces are labeled, difficult segmentation, and Chinese semantic cause segmentation ambiguity, identifying produce unknown words and so on, we need process the data before data mining. The software ICTCLAS provided by Chinese Academy Of Sciences by free is chosen to separate the sentences. After word segmentation, the design achieves processing data, calculating the characteristic matrix, characteristic vector and the similarity matrix by programming. Then write the result into the related document.Choosing AP algorithm as the core clustering algorithm to cluster the data set in the paper. First, compared with Kmeans algorithm by comparative tests, we observe the performance of AP algorithm, then make part of improvement of AP algorithm. Second, improve the calculation method of similarity martrix which works as the input of AP algorithm, by decreasing the dimension of characteristic vector featuring the text set to improve the calculation rate and the performance of representative as well. Third, improve the calculation method of the damping factor λ which brought into the process of iterations in AP algorithm to prevent the oscillation. By this way, we can improve the controls of strengthening the robustness of the algorithm. Fourth, improve the calculation method of the preference p according to the need of clustering, then we can control the cluster number. The whole updated AP algorithm is achieved under the programming by matlab software. Then compared with the former AP algorithm, the performance promotion can be observedThrough experimental comparison, the experiment shows that the update AP algorithm has better clustering performance than the original AP algorithm. And then using the updated AP algorithm to clustering Chinese text set, achieve100article txt documents set of Chinese text clustering. Processing and observations.The first half part of the experiment uses the object oriented language Java to achieve the text reading and writing, pretreatment, calculation of similarity matrix, and writing the similarity martrix into the excel file. The second half part of the experiment uses matlab programming software to realized the clustering algorithm, then write the result into the excel file.

Keywords/Search Tags:

AP Algorithm, Chinees Text, Clustering, Similarity Martrix, damping factor, preference

PDF Full Text Request

Related items

1	Study On Similarity-based Text Clustering Algorithm And Its Application
2	Study On Similarity-based Text Clustering Algorithm And It's Application
3	Chinese Text Clustering Based On Text Similarity
4	Search Of Group Intelligent Text Clustering Methods Based On Semantic Similarity
5	Research On Text Clustering Based On Semantic Similarity
6	Research On Text Clustering Algorithm Based On Spectral Clustering
7	Research On The Calculation Method Of Han-Thai Bilingual News Text Similarity With News Elements
8	Web Mining Algorithm Based On Anchor Text Similarity And Time Factor
9	Study Of Text Clustering Algorithm Based On Semantics
10	Study On The Chinese Text Clustering Algorithm Based On Semantic Similarity