Font Size: a A A

Cost Sensitive Learning Method On Heterogeneous Data

Posted on:2017-11-23Degree:MasterType:Thesis
Country:ChinaCandidate:A J FanFull Text:PDF
GTID:2348330485955637Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
With the development of computer technology,tens,hundreds,or even thousands of data are produced in real-world applications.Data mining is a powerful new technology with great potential to help people extract the most important information in the data they have collected.It is the analysis of data for relationships that have not previously been discovered.It discovers information within the data that queries and reports can't effectively reveal.In real application,data mining are more commonly found in domains of scientific research,engineering,finance,medical analysis and other fields.Rough set theory is proposed by Pawlak in 1982,it can be seen as a new mathematical approach to vagueness.Rough set theory has attracted worldwide attention of many researchers and practitioners,who have contributed essentially to its development and applications.Rough set theory overlaps with many other theories,especially in research areas such as machine learning and data mining.Pawlak's rough set model is applicable to deal with the data with nominal attributes,and a neighborhood rough set model is successfully constructed to deal with data with numerical attributes.Pawlak's rough set model is one of the three main theories of granular computing as well as covering-based rough sets and decision-theoretic rough sets which developed by Pawlak's rough set model.Cost-sensitive learning is one of the ten most challenging problems in data mining.Costs are intrinsic to data.Test cost is the time,money,or other resources one pays for obtaining a data item of an object.The misclassification cost is the penalty we receive while deciding that an object belongs to class J when its real class is K.The main aim of cost-sensitive learning is to determine a minimal feature subset through considering the trade-off between test costs and misclassification costs while retaining a suitably high accuracy in representing the original features.Cost-sensitive learning is one of the most fundamental problems in data mining research and has drawn attention from many researchers.In real-world applications,datasets mostly are heterogeneous,such as nominal and numerical dataset.Some cost-sensitive attribute selection algorithms recently have been proposed which deal with cost-sensitive attribute selection problem through regarding data with different types as same type.Such as,it is common method that considers all attributes as nominal variables or views all attributes as real-valued variables which take values in the real-number spaces.Obviously,discretization of numerical attributes may cause information loss because the degrees of membership of numerical values to discretized values are not considered.While the nominal attribute is treated as numerical attribute and is normalized in [0,1],two different objects of a nominal attribute may be classified in error into one neighborhood.These methods decrease the discriminating capability of an attribute and increase test cost of attribute subset.Cost-sensitive learning on heterogeneous data is more effective and more practical significance than on homogeneous data.In this paper,we deal with cost-sensitive learning problem on heterogeneous data.This dissertation includes two parts.In the first part,we study the test-cost-sensitive attribute reduction problem on heterogeneous data.On the one hand,we propose an improved artificial bee colony algorithm to tackle the minimal test cost attribute reduction problem on nominal data.Experimental results show that the proposed algorithm outperform existing algorithms significantly.On the other hand,we introduce an adaptive neighborhood model and an algorithm based on the adaptive neighborhood model for heterogeneous data to do with test-cost-sensitive attribute reduction problem on heterogeneous data.Experimental results demonstrate that the proposed algorithm is more effective and more practical significance than previous algorithm.In the second part,we deal with the issue of cost-sensitive feature selection on heterogeneous data.On the one hand,test cost and misclassification cost are two types important cost in cost-sensitive feature selection.On the other hand,based on adaptive neighborhood we propose an adaptive cost algorithm on heterogeneous data and construct an algorithm features for cost-sensitive feature selection.Experimental results show that the proposed algorithm outperform existing algorithms significantly.
Keywords/Search Tags:Granular computing, cost-sensitive learning, attribute reduction, feature selection
PDF Full Text Request
Related items