Font Size: a A A

Research On Minority Area Estimation-based Over-sampling Algorithm

Posted on:2023-06-18Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y SunFull Text:PDF
GTID:1528307097474174Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Class imbalance problem,that serves as one significant challenge in data mining,occurs when the number of samples in one class(minority)is obviously smaller than the other one(majority).Learning from the imbalanced data is difficult.On the one hand,some distributions like noises or class overlapping make the class imbalance problem more complex,called them as complex distributions.On the other hand,the information loss of minority class in the decision boundary also brings much difficulty for this problem.Thus,to cope with the class imbalance problem with complex distributions and information loss of minority class,this dissertation focuses on the over-sampling method from three perspectives that respectively involve the computation of inner,outer and boundary areas of minority class.(1)To cope the class imbalance problem with complex distributions,some over-sampling techniques cooperate with clustering algorithm to guarantee new synthetic samples being generated in clusters.However,far-away samples but with the same minority sub-region are generally clustered into different groups owing to the characteristic of clustering algorithm itself.Thus,one new grouping algorithm,named Direction Distribution-based Minority Sub-region Estimation,is firstly proposed.The new algorithm exploits the intuitive observation,that the minority with the same sub-region almost distributes over the same direction when compared to other majority,to estimate minority sub-regions that tactfully ignores negative impacts brought by the distance factor in clustering algorithms.Finally,new synthetic samples are generated in those minority sub-regions.The experimental results on visualized 2D datasets indicate that the novel method can group the minority class samples into different clusters without previously giving the number of clusters,and is robust to outliers at the same time.And experimental results on real-world datasets show the comparable performance with other state-of-the-art over-sampling methods.(2)Except the cooperation with cluster algorithm,no one pure over-sampling method has been specially designed to fit complex distributions(such as noise,class overlapping and disjuncts).To fill this gap,this dissertation firstly proposes one searchlight-scanned over-sampling method,which tactfully treats the data filling of minority area as the searchlight scanning of objective area in real life.By respectively regarding the minority area and majority area as the objective area and the barrier area,a series of searchlight structures are computed to firstly pass through the corresponding minority area and then be stopped by the majority area.Finally,synthetic samples are generated in those structures.Besides,the novel method provides a new relationship between a pair of minority class points about whether they are in the same minority area.Moreover,the novel method gives the geometric definition of searchlight structure in the data space.Implement on real-world datasets demonstrates the capability of our method to complex distributions,and the outperforming performance to current state-of-art over-sampling methods.(3)Rare over-sampling methods focus on the decision boundary between classes and none of them is proposed to directly compute the certain area of decision boundary for the imbalanced problem.Thus,one novel method named Decision Boundary Computation-based Oversampling is proposed to fill this gap.The novel method uses the intuitive observation,that both boundary samples and their surrounding areas corporately constitute the decision boundary,to compute the partition belonging to the minority class by subtracting the partition of majority class from their corporate one.Which greatly enhancing the full use of boundary information brought by both boundary individuals and their near areas and implicitly complement the nature information insufficiency of minority class at the same time.Finally,new synthetic samples are generated in the partition of decision boundary of minority class.Besides,the novel method provodes a theoretical reference for many classification tasks by definiting the local decision boundary area,local majority area and local minority area.Extensive experiments indicate the good performance of proposed method when compared with other state-of-art methods.Theoretically,learning from the imbalance data can help the people to understand the data classification.Besides,the further research on over-sampling method can improve the classication performance of minority class,providing a new view on the potential characteristics of imbalance data.Practically,the identification of minority class is meaningful in some applications like credit card fraud detection,civil aircraft fault monitoring and disease prediction.But faced with complex distributions,many existed over-sampling methods poorly perform on the identification of key minority samples.Thus,to fill the gap,this dissertation focuses on the over-sampling method to improve the performance of minority identification,in which the results may help the leader to make the decision or solve the responding problem in real life.
Keywords/Search Tags:data mining, class imbalance problem, over-sampling, complex distribution, cluster, area estimation, decision boundary
PDF Full Text Request
Related items