Font Size: a A A

The Study Of Some Issues For Unsupervised And Semi-supervised Dimensionality Reduction

Posted on:2017-11-22Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y T WangFull Text:PDF
GTID:1318330536968282Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of science and technology,pattern recognition has played an important role in more social life,such as large-scale text data recognition,the mass of face image recognition,and a large number of remote sensing image recognition.In the meantime,the occurrence of high dimensional dataset presents new tremendous challenge that the computation complexity is extremely high and the learning results are harder to understand.Dimensionality reduction is the pivotal research problem in high dimensional dataset processing,which maps the high dimensional data in the low dimensional space,and is able to extract the meaningful and important features for data recognition and different class label data discriminant,and remove the irrelevant features and redundant features from the dataset.Although the research for dimensionality reduction has made fruitful results,there are still many challenging problems due to the new characteristics of the real-world data such as high dimensionality,huge quantity,and incomplete class label.Therefore,this work further explored the problems in dimensionality reduction through improving the existing methods and exploring new theories and techniques.The main contributions of this thesis are listed as follows:(1)For the problem of feature extraction about the incomplete label-information utilization and the neighbor data's metric of multi-model distribution in the incomplete labeled data,we propose a semi-supervised local fisher discriminant analysis based on reconstruction probability class.In which,the reconstruction probability class gives each unlabeled data with a probability value belonging to a class label,which is decided by the nearest neighbor labeled data,and then implements the integration of labeled and unlabeled data.The weighting distance of two adjacent objects introduced to calculate the scatter matrix of between-class and within-class,the data from different classes to be separated,and the multimodality neighbor data from one class to be closer in the process of dimensionality reduction.(2)For the relevance metric research in the unlabeled dataset,we propose an unsupervised relevance gain metirc based on information theory,which effective and efficiency to evaluate the feature's importance and the relevance of features,avoid the multiple times executing a specific learning algorithm in the process of unsupervised feature selection.The feature's importance in an unsupervised condition is its average mutual information to all features.Under the na?ve Bayes assumption that supposes that the features are not independent but conditionally independent given the value of the latent class,the feature's importance is a lower bound of the mutual information of feature and latent class.(3)There are several problems in the unsupervised feature selection,i.e.unsupervised feature's relevance metric,low execution efficiency,and local optimal solution.Hence,we propose double unsupervised feature selection methods: feature selection method based on relevance gain and Markov blanket clustering,and feature selection method based on relevance gain and particle swarm optimal.The former uses the construction and partition of the directed acyclic feature-graph to acquire the corresponding feature-clusters,and selects representative features from each cluster to form a selected feature subset.The latter is carried out on the basis of the former,and introduces the particle swarm optimal in the process of feature selection,the ability of swarm intelligence through cooperation and competition between swarm particles to lead the optimization search,and then quickly and effectively achieve the global optimal feature subset.(4)In engineering practices,the labeling work is time consuming and expensive,and many unlabeled data can be readily obtained in the real world.Regarding to the limited amount of labeled data and large number of unlabeled data,the supervised or unsupervised feature selection utilizes the part information,while discarding the rest information from the unlabeled data or the label.In order to solve this problem,we propose a semi-supervised representation feature selection based on information theory and relevance analysis.In this method,a trade-off parameter integrates the mutual information of the labeled data and the relevance gain of the unlabeled data making feature relevance metric to better take advantage of the information of unlabeled and labeled data.
Keywords/Search Tags:dimensionality reduction, feature extraction, feature selection, Markov blanket, feature relevance, mutual information, relevance gain
PDF Full Text Request
Related items