Font Size: a A A

Granulation-mechanism-based Efficient Rough Feature Selection Algorithm

Posted on:2014-02-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:F WangFull Text:PDF
GTID:1228330401463043Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
At present, data mining has been conceived as a significant approach for knowledge discovery in the information society, which aims at transforming data into useful information. With the rapid development of information tech-nology including internet and database, both the size and the dimension of data sets increase at an unprecedented rate, which has brought the times of "large-scale data with high dimension". These data and their high dimension brings big challenges for traditional data mining algorithms, and exploring ef-ficient and effective data mining algorithms has quickly become a global issue in many areas.Feature selection is an important data preprocessing technique in data mining. However, existing feature selection algorithms are usually low in com-putational efficiency, especially when dealing with large-scale data sets. In this paper, on the basis of rough set theory, efficient feature selection for large-scale data sets is studied systematically. Main contributions are listed as follows.1. Based on the idea of decompose and fusion, an efficient framework for feature selection is constructed. According to the idea of sample estimation, two key steps are discussed in this paper. One is decompose which means decomposing a big granule into a family of small ones which have the similar distribution with the large one. The other one is fusion which means fusing all the estimates got from small granules together and generating a final feature subset of the large data set. The framework provides new ways for analyzing big data.2. By employing the framework, two efficient rough feature selection algorithms are developed. One is used for nominal data and the other one is applicable for hybrid data. Two typical algorithms for nominal data and hybrid data are embedded in the framework respectively, and then, two efficient algo-rithm are developed. The two developed algorithms can find an effective result efficiently, especially for large-scale data sets. Experiments better illustrate effectiveness of the two developed algorithms and the framework.3. For dynamic data sets, group incremental mechanisms, dimension in- cremental mechanisms and updating mechanisms of three representative in-formation entropies are introduced. On the consideration of there are three situations of data updating in databases, based on analyzing changes of ele-mentary granules and granular space in dynamic data sets, the corresponding mechanisms of three employed information entropies are proven.4. On the basis of mechanisms, three efficient rough feature selection are proposed for dynamic data sets. They are a group incremental feature se-lection algorithm, a dimension incremental feature selection algorithm and a feature selection algorithm for data sets with varying data value. Both theo-retical analysis and experiments illustrate effectiveness and efficiency of the three algorithms. In addition, the main ideas can be expanded to fusion of two data sets or even multiple data sets. It is our wish that this study provides new approaches on fusion of multi-source data sets.In this paper, on the basis of analyzing limitations of existing feature se-lection algorithms for large-scale data sets, several efficient rough feature se-lection algorithms are introduced. Experiments better illustrate that these al-gorithm are effective and efficient. Hence, the development in the paper makes an important contribution to knowledge discovery for large-scale data sets.
Keywords/Search Tags:Large-scale data with high dimension, Dynamic data, Hybrid data, Rough set, Information entropy, Information granularity, Multi-granulation, Feature selection
PDF Full Text Request
Related items