Font Size: a A A

Research On Feature Weighting And Feature Selection-based Data Mining Algorithms

Posted on:2014-02-04Degree:DoctorType:Dissertation
Country:ChinaCandidate:L ZhuFull Text:PDF
GTID:1228330392460359Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
With the development of information techniques, data mining is one of important tasks ofartificial intelligence and database research, which has been extensively studied in the pasttens of years. Data mining is not only an analysis step of the "Knowledge Discovery inDatabases" process, or KDD, which attempts to discover patterns in large datasets; but also animportant step of decision support, which extracts information from a dataset and transformsit into an understandable structure for further use. Data mining utilizes methods at theintersection of artificial intelligence, machine learning, pattern recognition, statistics, anddatabase systems.Nowadays, data mining has achieved a lot of important advancements. However, it stillconfronts many challenges. Among of them, several crucial challenges can be described asfollows:1) the collection of datasets is becoming larger and more complex, which makes itdifficult to work with using relational databases and traditional data mining and machinelearning algorithms;2) when the data dimensionality increases, the volume of the space getsbigger so fast that the available data becomes sparse, which is also referred as curse ofdimensionality;3) the subject of data mining may appear differently when examined bydifferent disciplines, for example, biology, neuroscience, economics and business. Therefore,interdisciplinary involves researchers in the goals of connecting and integrating severalacademic schools of thought, professions, or technologies along with their specificperspectives in the pursuit of a common task.Motivated by the above challenges, several issues are addressed in the past decades. Data stream analysis was proposed to solve the large scale data or streaming data problem. Inaddition, feature weighting and feature selection methods were introduced, which investigatedto process high-dimensional data. Furthermore, interdisciplinary programs may also arisefrom new research developments, such as bioinformatics, combining molecular biology withcomputer science, is current one of research focuses on data mining research area.In this thesis, we propose a collection of new algorithms, algorithm improvements, andnovel applications of existing methods, for feature weighting and feature selection. The mainissues addressed in this study mainly involve two parts: research and improvement of featureweighting-based subspace clustering analysis; as well as study and application of featureselection-based classification methods. Furthermore, we apply our proposed algorithms to textclustering of information retrieval, gene expression data clustering, face image classificationand the problem of predicting disulfide connectivity. In particular, the main contributions ofthis dissertation and our innovations are as follows:1) We extended the online learning strategy and scalable clustering technique to softsubspace clustering, and propose two online soft subspace clustering methods, OFWSCand OEWSC; as well as two streaming soft subspace clustering algorithms, FuStreCAand EnStreCA, respectively. The proposed evolving soft subspace clustering algorithmscan not only reveal the important local subspace characteristics of high dimensional data,but also leverage on the effectiveness of online learning scheme, as well as the ability ofscalable clustering methods for the large or streaming data.2) We introduced a multiobjective evolutionary-based soft subspace clustering, MOSSC,which simultaneously optimizes the weighting within-cluster compactness and weightingbetween-cluster separation incorporated within two different cluster validity criteria. Theproposed MOSSC algorithm not only inherits the merits of soft subspace clustering, butalso receives the beneficial properties of the multiobjective evolution strategy.3) We presented two novel soft subspaces clustering algorithms: fuzzy weighting subspace clustering with competitive agglomeration (FWSCA) and entropy weighting subspaceclustering with competitive agglomeration (EWSCA) by considering the advantages ofcompetitive agglomeration strategy. FWSCA and EWSCA can be rapidly converged in afew iterations regardless of the initial number of clusters, and can also converge to thesame optimal partition regardless of its initialization.4) We proposed a novel feature selection method, termed Sparse Score, based on the sparserepresentation theory. First, we obtain the sparse representation reconstructing coefficientmatrix by L1-minimization, and then we evaluate the importance of a feature by itspower of sparse preserving, termed Sparse Score. We compare our proposed method withother score function-based feature selection algorithms on UCI and Yale datasets.Experimental results demonstrate the effectiveness and efficiency of our method.5) We introduced an efficient feature selection technique for predicting disulfideconnectivity, based on which we find that the high-dimensional feature vector containsmuch redundant information and the prediction accuracy can be further improved whenthe high-dimensional vector is reduced to a lower but more compact feature space. Wealso find that the global protein features contribute little to the formation and predictionof the disulfide bridges.
Keywords/Search Tags:data mining, feature weighting, clustering analysis, feature selection, bioinformatics
PDF Full Text Request
Related items