Font Size: a A A

Research On The Online Unsupervised Feature Selection Algorithms On The Data Stream

Posted on:2021-11-25Degree:MasterType:Thesis
Country:ChinaCandidate:R MaFull Text:PDF
GTID:2518306548495924Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Unsupervised feature selection of large-scale and high-dimensional data has always been a research hotspot in machine learning.Among them,the online unsupervised feature selection on the dynamic high-speed data stream is urgently needed in the fields such as online anomaly diagnosis,social data analysis and video image management.Data streams generated by practical applications are generally characterized by high dimension,fast velocity and complex type.It presents new challenges to the online unsupervised feature selection.Firstly,the high-dimensional data are often distributed as irregular nonlinear structure,which makes it difficult to mining the potential structure.Secondly,the new instances arrive at any time,which leads to the problem of unstable interactions on the data streams.Thirdly,the data generated in the practical application usually contains both numerical and categorical features,which leads to the problem of real-time measurement and interactions mining of the mixed data streams.Aiming at these problems,this paper firstly studied the unsupervised feature selection algorithms on the static data set.On this basis,the online unsupervised feature selection algorithms on the numerical data stream and on the mixed data stream were further studied,and the following main progress were made:High-dimensional data are often distributed as an irregular nonlinear structure,and the distance between instances is approximately equal in the high-dimensional space,which leads to the loss of qualitative significance of distance.However,most of the existing unsupervised feature selection algorithms construct data structures based on the accurate value of distance,which may easily result in the excessive loss of effective information or excessive retention of invalid information,and affect the accuracy of feature subset selection.To solve this problem,we propose Unsupervised Feature Selection via Local total-order Preservation(UFSLTP).UFSLTP uses the concept of total-order relation on the dataset to compare the distance between data instances.It is a ternary relation on the datasets,i.e.,an instance is closer to a given instance than another.On this basis,we mine the local total-order relation in the original feature space.UFSLTP weighs all features with the ability of preserving the local total-order relation,and further selects the features with higher weights,to form a new low-dimensional space.The experiments show that compared with the existing unsupervised feature selection algorithms,the corresponding clustering performance Normalized Mutual Information(NMI)of UFSLTP improves by 15.32% on average.The interactions on the complex data streams can be summarized as three-level interactions,that is the individual level,the aggregation level and the streaming level.The existing online unsupervised feature selection methods only consider a part of them,which leads to the poor real-time and accuracy,and the features subset is not ideal.Therefore,we propose the Feature Selection via Multi-Cluster graph structure Preservation(FSMCP)for the numerical data stream.Based on the concept of total-order relation,we design a Multi-Cluster graph structure for the numerical data stream.It integrated the three-level interactions.On the individual level,the instance-to-instance total-order relation is used to represent the relationship between the arriving instances.On the aggregation level,the global cluster-to-cluster total-order relation and the local cluster-to-cluster total-order relation represent the relationship between the arrived instances(the concept of ‘cluster' is a kind of data sketching for the arrived instances).On the streaming level,the instance-to-cluster total-order relation represents the relationship between the arriving and arrived instances.FSMCP weighs all features by the ability of preserving the Multi-Cluster graph structure at each moment,and further selects the features with higher weights,to form a new low-dimensional space.In comparison with baseline methods,FSMCP holds better efficiency than the offline methods,while still providing almost similar or even better quantitative feature subset.And it outperforms the existing online feature selection methods with NMI improvement of 26.41% on average.There are both numerical and categorical features in the practical application environments.At present,there is no online unsupervised feature selection algorithm for the mixed data stream.Therefore,we proposed Feature Selection based on the Heterogenous Distance(FSHD).It adopts the mixed data measurement model Heterogeneous Euclidean Overlap Metric(HEOM)to measure the distance between the instances.It is a metric function of heterogenous distance,it adopts different distance metrics for the numerical and categorical features,and then integrates the measurement of all features to obtain the total distance.FSHD constructs the Multi-Cluster graph structure based on the HEOM.FSHD weighs all features by the ability of preserving the structure,and further selects the features with higher weights,to form a new lowdimensional space.Experiments show that FSHD can obtain the feature subset with stable quality in time.In comparison with the online methods for the numerical data streams,FSHD could obtain a feature subset with higher quality,the corresponding clustering performance NMI improved by 85.89% on average.And FSHD outperforms the offline methods for the mixed dataset,with NMI improvement of 36.52%.
Keywords/Search Tags:Data Stream, Feature Selection, Online Learning, Unsupervised Learning
PDF Full Text Request
Related items