Font Size: a A A

Research On Rough Set Based Semi-supervised Feature Selection Algorithm For Mixed Data

Posted on:2020-10-19Degree:MasterType:Thesis
Country:ChinaCandidate:X FanFull Text:PDF
GTID:2428330590496524Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the increasing popularity and maturity of artificial intelligence and machine learning technology,mining intrinsic information in data has become more important.How to effectively excavate the key information in data and improve the efficiency of machine learning has become a key issue to be solved.In the information age,enterprises try to collect as many as possible data sets(e.g.log data sets and online data sets,etc.)in order to meet requirements of different business and intelligent computation.Due to the limitation of computing capacity and response time,more features in data sets also mean more time and money costs,while unnecessary features may even lower the generalization ability of the algorithm.Feature selection algorithm aims to select some of the more effective features from the original features and reduces the data dimension,which is an important means to improve the performance of learning algorithm.Feature selection not only reduces the data dimension,but also retains the original meaning of the data.Therefore,the research of feature selection methods has attracted attention of many scholars.Information theory is one of the commonly used evaluation method in feature selection,in which mutual information can identify both linear correlation and non-linear correlation effectively.However,the mutual information cannot be well adapted to the numerical data,and the real-life data are often mixed type data containing both numerical and nominal features.How to measure the mutual information between numerical data properly and efficiently has high practical research value.In the era of big data,a large amount of data has been accumulated in various industries,however,most of the labels of data can only be marked by expensive manual tags to mark a small part of it.How to effectively utilize such data with only a small number of labels has become a hot issue in the field of machine learning.Facing the challenges of mixed type data,semi-supervised data and big data,this dissertation proposes solutions in turn.Finally,Spark framework based semi-supervised feature selection algorithm for mixed attribute big data was implemented.The main research work and innovation points of this dissertation are as follows:A feature selection algorithm combining neighborhood rough set discernibility matrix and mRMR principle was proposed.Based on the principle of maximum relevance and minimum redundancy,the significance of the feature is defined by employing neighborhood entropy and neighborhood mutual entropy based on the principle of the minimal redundancy and maximal relevance,which can deal the mixed data better.Dynamic discernibility set is defined based on the discernibility matrix.The dynamic evolution of the discernibility set is utilized as the policy to delete redundant features and narrows search range.The optimized feature subset is given when the iteration is stop by the stop condition given by discernibility matrix.Experimental results show that the proposed method can effectively improve the classification accuracy.Aiming at the problem of large error in the calculation of mutual information under small data,symmetric uncertainty was used to instead the neighborhood mutual information when evaluating the feature significance.Then,a semi-supervised algorithm based on neighborhood symmetric uncertainty was proposed.The correlation of attributes is calculated by labeled data,and the redundancy of attributes is calculated by full data set.Therefore,the information of data is used as much as possible to evaluate the redundancy between the attributes.Experimental results show that the proposed method can achieve higher classification accuracy with fewer features in small data sets.In order to solve the problem of high complexity in computing entropy and mutual information based on neighborhood rough set,a fast computing method of neighborhood mutual information based on data sorting was proposed.This method can reduce the complexity of neighborhood entropy fromO(n~2)to O(nlog n).Then the approximate value of joint neighborhood entropy based on an infinite norm neighborhood relation is calculated,and then neighborhood mutual information is estimated.Experimental results show that proposed method can significantly reduce the computation time of neighborhood mutual information and guarantee high approximation accuracy in large data sets.Based on Spark framework and previous research,a semi-supervised feature selection algorithm for mixed data based on column partitioning was proposed.By using the fast neighborhood mutual information algorithm,the requirement of fast computation speed is realized while the mixed data processing is satisfied.The heuristic search process is optimized by remove duplicate calculations from traditional algorithms.The experimental results show that the proposed algorithm can select better feature subset which may improve the classification performance of classifiers and can cope with the challenges brought by massive data.
Keywords/Search Tags:Feature Selection, Big data, Mixed-type-data, Semi-supervised, Rough set
PDF Full Text Request
Related items