Font Size: a A A

Clustering Incomplete Data Using Pseudo Nearest Neighbor And Interval-valued Distance

Posted on:2017-08-05Degree:MasterType:Thesis
Country:ChinaCandidate:Z J ChenFull Text:PDF
GTID:2348330488459744Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
Missing data handling is a challenging issue often dealt with in data analysis and pattern classification. Sometimes data sets can be incomplete as a result of random noise> human error, etc. However, traditional clustering methods are not directly applicable to such incomplete data. If not handled properly, these incomplete data may lead to large errors or biased clustering results, In this paper, we study the clustering algorithm of incomplete data using pseudo nearest neighbor and interval-valued distance. Results on several incomplete data sets demonstrate the effectiveness of the proposed algorithms. Main work includes:1. Concerning the uncertainty of missing attributes values, a fuzzy c-means clustering algorithm based on pseudo-nearest-neighbor intervals of incomplete data is given. The data are first completed using the pseudo-nearest-neighbor intervals approach, and then the data set can be clustered based on the fuzzy c-means algorithm for interval-valued data. The proposed algorithm estimates the missing attribute values without normalization, thus captures the essence of pattern similarities in the original untouched data set. Additionally, the pseudo nearest neighbor intervals representation takes account of implicit uncertainly of missing attribute values, and considers the angle between incomplete data and other data as well.2. In view of using missing attribute values to calculate distance, a fuzzy c-means clustering algorithm using triangle-inequality for incomplete data is proposed. Firstly, an interval representation of distance using triangle inequality is presented, which can be used to measure the distance between incomplete data and prototypes. The proposed interval distance makes full use of neighborhood information in incomplete datasets, and can also represent the uncertainty of missing attribute values to some degree. Also, the use of triangle-inequality helps the estimate of the range of interval to some degree. Then a clustering algorithm based on the proposed distance for incomplete data is given. The proposed algorithm clusters the incomplete data without elimination or imputation, and can thus avoid the possible error accumulation and propagation through the iterative optimization procedures.
Keywords/Search Tags:Pseudo nearest neighbor, Fuzzy c-means, Incomplete data, Clustering
PDF Full Text Request
Related items