Font Size: a A A

Attribute Weighted Three-way Clustering Model For Incomplete Data

Posted on:2017-01-20Degree:MasterType:Thesis
Country:ChinaCandidate:T SuFull Text:PDF
GTID:2348330533450187Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Clustering analysis is widely used for identifying hidden structure in a new era of big data. It has achieved great success and applied to some areas such as machine learning, data mining, pattern recognition and image analysis. However, there are many reasons like data access limitation, random noise, data misunderstanding and data lost which cause the missing values in the dataset. Usually, the data set which contains missing values is called an incomplete data set.Usually, the existing clustering approaches just assign an object to a cluster or not to a cluster, which is the result of the two-way decisions. However, the incomplete data set is uncertainty or imperfect, it is unreasonable to divide such objects to a cluster certainly. Therefore, this thesis researches on clustering for incomplete data and has four contributions as follows.Firstly, an attribute weighted three-way clustering model for incomplete data is proposed. This model divides the incomplete data set into four types of subsets according to the missing rate of attributes and weighted attributes. Four types of data are sufficient data, valuable data, inadequate data and invalid data respectively and their useful information are decreasing in this order. Then the three-way decision strategy is used to process the four types of data. The model does not decide an incomplete datum with uncertainty to a certain cluster but assign it to the boundary of clusters.Secondly, a method to get attribute weights based on the cover rate of objects is built. Semi-supervised clustering usually contains a small amount of known information which could well guide the clustering process. This thesis proposes a method which could achieve attribute weights and attribute order using a small amount of labeled objects. The method calculates the cover rate of each attribute between clusters. Therefore, this model is suitable for both semi-supervision and unsupervised clustering. For unsupervised clustering, attribute weights can be specified by the user or randomly generated. For semi-supervised clustering, attribute weights can be calculated through the proposed method.Thirdly, an improved partial Euclidean distance formula and an interval filled method for missing values based on objects in neighborhood are designed. Combining the missing rate and attribute importance the partial Euclidean distance formula is improved. In addition, the effective recoveries of missing data is the key to incomplete data clustering, this thesis proposes an interval-valued method for missing data based on objects in neighborhood to fill the missing values.Finally, four three-way clustering methods based on the density peaks are presented. These four methods are semi-supervised filled method, semi-supervised unfilled method, unsupervised filled method and unsupervised unfilled method. Some experiments have been done on these different methods and some conclusions have been obtained. Users can choose one of the methods according to requirements.
Keywords/Search Tags:incomplete data, three-way decision, clustering, similarity measure
PDF Full Text Request
Related items