Font Size: a A A

Research On Clustering Algorithms For Source Code Mining

Posted on:2011-10-12Degree:MasterType:Thesis
Country:ChinaCandidate:M Z MengFull Text:PDF
GTID:2178330332480687Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Source code data is characterized by massive data, many nominal attributes, and so on. In order to mine the software engineering data efficiently, we must utilize fast and efficient approach. With broad applications in many situations, K-means clustering is a concise and practical algorithm. However, it does not optimize the features of the samples and the result of the clustering is dissatisfactory and its efficiency largely depends on the distributions of samples. In order to solve the issues mentioned above, in this paper we proposed a KFCM algorithm based on TF-IDF to cluster source code data; meanwhile, we also otiptimize KFCM algorithm using genetic algorithm and devise a new algorithm called SGAKFCM. This new algorithm solves the problem of local minimum values inherited in KFCM algorithm. Finally, we utilize KFCM and SGAKFCM algorithms to mine source code data. The experimental results illustrate that the KFCM algorithm and SGAKFCM algorithm are suitable for large number of data, with high efficiency and good results.Main research issues of the paper are as follows:(1) KFCM algorithm based on TF-IDF. In this paper, because the KFCM algorithm cannot cluster text datas of sourcecode directly, we use the TF-IDF method to transform the text datas of sourcecode into numerical data, which addressed the problem of local optimization for KFCM.(2) We make use of SGAKFCM algorithm to cluster the TF-IDF format data, which solve the problem of local minimum values inherited in KFCM algorithm.We implemented KFCM and SGAKFCM algorithms based on Eclipse and Matlab platforms, and evaluated the algorithms on the source code of WEKA. Then, FCM, KFCM and SGAKFCM algorithms are used to analyze the output respectively. By comparing the results of the three clustering algorithms, we concluded that the KFCM algorithm has satisfactory clustering effects and high efficiency on software engineering data with nominal attributes.Experimental results show that KFCM algorithm based on TF-IDF can achieve satisfactory performance on source code mining. The main contributions of this paper include using TF-IDF to represent the source code data, adopting genetic algorithm to optimize KFCM algorithm to solve the problem of local minimum values inherited in KFCM algorithm.
Keywords/Search Tags:Data Mining, Source Code Mining, Kernel Function, KFCM Algorithm, Genetic Algorithm
PDF Full Text Request
Related items