Font Size: a A A

Mixed Data Mining Methods Based On Rough Sets Theory

Posted on:2015-11-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:H ZhaoFull Text:PDF
GTID:1228330461474376Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In many practical application areas, a lot of data needing to be processed are mixed. One common scenario is that the data mixed with numerical features and nominal features. How to mine from the mixed data has become a full of challenge task.The researches in this dissertation mainly focus on the similar relation of samples in incomplete information system, the feature selection and text selection for mixed data, imbalanced classification method for mixed data and outlier detection for mixed data. Concretely, the researched achievements include the following aspects:In the first part, based on the existing extended rough set model, the relationships between samples in incomplete information system are further studied. The neighborhood-tolerance rough set model, the variable precision tolerance rough set model and the variable precision neighborhood-tolerance rough set model are proposed and the related properties are discussed. Among these, the neighborhood-tolerance rough set model and the variable precision neighborhood-tolerance rough set model are applicable to mixed data. Furthermore, the concepts of neighborhood-tolerance information entropy and neighborhood-tolerance conditional entropy are introduced. Moreover, a feature selection algorithm based on neighborhood-tolerance conditional entropy is constructed.In the second part, sample selection methods and sample-based learning methods are studied. Firstly, a text selection method is presented. This method uses variable precision tolerance relation to evaluate the similarity between texts and employs variable precision tolerance classes to be as text clusters. All text clusters can be obtained by only one pass text over the text. And these text clusters can be represented by cluster centers. This considerably reduces the size of data, which can facilitate classification. Furthermore, based on neighborhood rough set, a sample selection method is given. In this method, the samples in neighborhood decision region are considered as interior samples and are deleted. The samples in neighborhood boundary region are divided into three parts:noise samples, the samples that are more approach decision boundary and the samples that are located away from decision boundary. Then, the samples that are more approach decision boundary are chosen as selected sample set. Last, a weighted prototype classification method is presented. This method uses self-generating prototypes algorithm to divide the whole sample set into some small sample subsets. The mean of every small sample subset is considered as a prototype and every prototype is assigned a weigh. A test sample is assigned the classification label of the sample subset with the smallest distance according to the weighted prototype distance formula.In the third part, the imbalanced classification problems are studied. To alleviate the boundaries get evidently biased toward the minority class, the synthetic minority over-sampling technique based on neighborhood rough set (NRS-SMOTE) is constructed. The main characteristics of this technique are:1) The under-sampling technique is used to clean noises; 2) Only the minority samples in decision boundary are synthesized instead of all minority samples. The decision boundary is represented as neighborhood decision boundary region; 3) The size of synthetic samples generated by a minority sample is determined by the class distribution of neighborhood of this sample.4) NRS-SMOTE can be applicable to the mixed data.In the last part, outlier detection problems of mixed data are researched. Based on neighborhood information granule, an outlier detection method is constructed. In this method, The outlierness factor of an sample is determined by weighted sum of the size of the neighborhood of this sample and this sample’s neighborhood density. The size of a sample is the amount of samples in neighborhood of this sample. And a sample’s neighborhood density reflects the density of samples in neighborhood.
Keywords/Search Tags:Mixed data, Rough sets theory, Feature selection, Sample selection, Variable precision neighborhood-tolerange class, Neighborhood-tolerance relation, Neighborhood information granule
PDF Full Text Request
Related items