Font Size: a A A

Data-Driven Feature Selection With Redundancy-Complementariness Dispersion And Feature Envelopment Frontiers

Posted on:2017-03-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y S ZhangFull Text:PDF
GTID:1318330482494404Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of nowadays society, the change of data style presents a complicated and large-scaled orientation. Feature selection, which is a hot topic in the field of big data dimensionality reduction, has become an essential research direction for social economical decision and business intelligence in the context of Big Data. The parameter se-lection problem in feature selection has a significant influence on the quality of the selected features and data re-expression. From the perspective of information theory, since the joint mutual information between feature set S=F1,..., Fk and the class C can be expanded by the summation of interaction information between each feature and the class in different dimen-sions (orders), many nowadays feature selection methods are the lower-order approximation of such expansion. From the viewpoint of Brown et al. (2012), parameter determination in feature selection is equivalent to selecting a concrete feature selection criterion. However, among traditional feature selection methods, parameters are almost all predefined. The ap-proach to determine such parameters includes specific experience in real world industries. For example, MIFS feature selection method never tell you how to properly select its redundancy weight ? when it is applied in different real world fields. Under this circumstance, how to find a proper and non-prior parameters from the perspective of higher-order information loss is a key issue in dimensionality reduction.In this dissertation, two resolution framework from the perspective of methodology are proposed. First, we take parameters of the second-order correlations among features as a necessary modification for omitting higher-order information. We analysis in-depth the redundancy-complementariness dispersion on the lower-order correlations which is caused by higher-order information loss, and then introduce a modification factor (the parameter-s) driven by higher-order information to partially mitigate the interference of redundancy-complementariness dispersion, in such a way as to evaluate features precisely.Second, taking into account the change of the bias of the criteria and correlations raised by different fields and times in the real world, a Data Envelopment Analysis (DEA) based feature selection framework is proposed, taking the advantage of "data-driven" in DEA to accommodate various feature correlations and multiple feature criteria and different data environments.According to the first framework, parameters in feature selection are determined by the redundancy-complementariness dispersion caused by higher-order information omission. We analysis in-depth the dispersion and introduce a modification factor for the redundancy-complementariness dispersion in lower-order dimension (parameter determination). Then, a novel feature selection method named Redundancy-Complementariness Dispersion-based Feature Selection method (RCDFS) is proposed. RCDFS tries to tackle the problem caused by the dispersion via the modification factor (weight) of such dispersion.In addition, one of the popular operations such as "summation" in second order approx-imated feature evaluation criteria is proved to be the lower bounds of the higher-order redun-dancy and complementariness. Since the "prior knowledge" of various field-specific feature evaluation criteria and feature correlations are actually contained in the concrete data of such fields, a super-DEA feature selection model is proposed according to the second framework. This framework utilizes multiple feature evaluation criteria to build a feature envelopment frontier for feature evaluation and ranking. We implement this model in the method called feature selection with Multi-Criteria based Super-DEA (MCSD) and the complexity of this method is analyzed. Classification experiments are conducted and the results show MCSD is superior to IG, ReliefF, mRMR, CMIM, FOU, and JMI in most of cases.With the increasing development in traffic and transportation techniques, the accident rate increases rapidly during recent years. Risk driving behavior as one important factor of very big traffic accidents increases the demand of driving behavior recognition and prediction based on video surveillance systems in real-time traffic safety management. Automatic vehicle behavior learning from videos is a very challenging task. From the perspectives of Wright?(2009) and Mo?(2014), any new vehicle trajectory can approximately be linear combined by training vehicle trajectories. Therefore, sparse reconstruction technique can be applied in trajectory learning and vehicle behavior classification. Since there are so many redundant and noisy features potentially existing in the data of trajectories, proposed feature selection methods are embedded in the sparse reconstruction trajectory learning model with l2-lp minimization. Specifically speaking, feature selection method is first applied to build the new trajectory dictionary, then an Orthogonal Matching Pursuit-quasi-Newton (OMPN) algorithm is proposed to apply recently developed lp (0<p<1) techniques for l2-lp minimization problem to get more sparser solutions. Experimental results show the superiority of proposed method. Meanwhile, experimental results also show that proposed methods embedded with RCDFS and MCSD both outperform the original one, indicating proposed data-driven feature selection methods have an important theoretical significance and a wide application space.
Keywords/Search Tags:Data-driven, feature selection, information theory, redundancy- complementariness dispersion, data envelopment analysis, sparse reconstruction
PDF Full Text Request
Related items