Font Size: a A A

Research On Non-IID K-means Clustering Algorithm

Posted on:2019-12-09Degree:MasterType:Thesis
Country:ChinaCandidate:Q Q XieFull Text:PDF
GTID:2428330548486999Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As an important branch of data mining technology,clustering analysis is an important technical means for data partitioning or grouping.Without prior knowledge,clustering analysis divides data objects into several categories according to certain requirements or rules.Because clustering analysis has a high practical value,it has become a very active research topic in data mining research.The k-means algorithm has been widely used in various industry fields as a classical algorithm in clustering analysis.However,the k-means algorithm also has some areas that need improvement,such as the randomness determined by the initial center,the vagueness and subjectivity of the k-value selection,and the Euclidean distance equal to the variable.In addition,the data is processed under the independent and identical distribution.However,in practical applications,the data is not independent and identically distributed,that is to say,there is a coupling or dependency between the attribute values,attributes,and objects of the data source.If we neglect the non-independent and identically distributed data in the algorithm research,it may cause the analysis result to be inaccurate due to the loss of important information.Aiming at these shortcomings,this paper attempts to improve the k-means algorithm based on the non-independent and identically distributed condition.Aiming at the randomness of the initial center point selection of k-means algorithm,a new method for selecting the initial center point is proposed,and it is deeply discussed mainly based on the coupling relationship between attributes,and the correlation between the data attributes.The conditions satisfied by the improved algorithm include the sample object with the largest number of coupled similarity points as the initial center point,and the modified Pearson correlation coefficient can more reasonably represent the correlation between attributes.Aiming at the shortcomings of similarity measure and clustering criterion function in k-means algorithm,two improvement methods are proposed.The first is the improvement of the similarity measure based on coupled attribute analysis,which is based on the existence of Euclidean distance,which treats different attributes of data items in equal dimensions and cannot distinguish the importance of data attributes.The function can not deal with the uneven distribution within the class.An improved method of clustering criterion function based on coupled attribute analysis is proposed.The above two points of improvement mainly use the not independent and identically distributed learning theory to explore the coupling relationship between object attributes,not only considering the coupling of attribute values within attributes,but also considering the coupling of attribute values between attributes.The improved algorithm is verified experimentally on several UCI data sets.The experimental results show that the improved k-means algorithm has good stability and high accuracy.
Keywords/Search Tags:Non-IIDness, Initial Central Point, Pearson Correlation Coefficient, Similarity Measure, Coupling Relation Analysis
PDF Full Text Request
Related items