Font Size: a A A

Multi-threshold Based Contrast Pattern Mining And Its Application In Classification Of Imbalanced Datasets

Posted on:2020-02-24Degree:MasterType:Thesis
Country:ChinaCandidate:Y G LanFull Text:PDF
GTID:2428330623951420Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Data mining is one of the hottest areas of research in the 21 st century.With the continuous development of data mining technology,people have begun to be able to extract knowledge that is easy to understand and easy to store from irregular data.Traditional imbalanced dataset comparison mode mining algorithms are often based on a single support threshold.As the size of the data set increases,some items with strong discrimination but low support may be lost.These items are based on The classification algorithm of the comparison mode will improve the performance of the classification to some extent when classifying the samples.In addition,the traditional contrast mode mining algorithm is generally based on a single machine operation,with linear execution characteristics,due to the limitations of CPU,memory and other aspects of a single machine,the traditional contrast mode began to bottleneck,especially in the face of today's data When the scale is large and the data dimensions are high,the traditional contrast pattern mining algorithm has the characteristics of low efficiency and low scalability.Although the traditional unbalanced data set comparison mode classification algorithm can solve the class bias problem to some extent,these algorithms often have the following defects.First,when the size of the data set is large,the traditional contrast mode classification algorithm still has a class bias problem when classifying due to the excessive dependence on the mode support degree.In addition,the traditional contrast pattern mining algorithm tends to weight the classified samples according to the support degree of the pattern,and the result of the classification is too much affected by the mode support degree.In response to the above problems,this paper mainly does the following work:(1)This paper proposes a parallel mining algorithm based on multi-support threshold.The algorithm can dynamically select a reasonable support for the item set according to the frequency count of the item set in the comparison pattern mining process,so that the traditional comparison pattern mining algorithm may filter out some items with certain discrimination.Set of questions.In addition,the algorithm can be based on MapReduce-like batch processing such as Spark to process large-scale,high-dimensional data sets.The basic idea is to divide the mined space into small,independent units.Since these independent units do not have interdependencies,Therefore,it can be mined in parallel.This paper experiments on multiple UCI datasets.The experimental results show that the proposed contrast mining algorithm can mine more discriminative contrast modes,which can improve the classification accuracy of the experiment in the post-sequencing experiments.In addition,the algorithm can mine large-scale data in a relatively efficient time,and can reduce the mining time by increasing the number of computing nodes,and has good scalability.(2)Based on the imbalance of data set IR,this paper proposes a comparison mode classification algorithm based on the reward and punishment coefficient of unbalanced data sets.The algorithm can calculate the reward and punishment coefficient of the sample according to the imbalance degree IR of the class,and calculate the classification score of the sample in different categories through the reward and punishment coefficient,which can overcome the defects of the traditional unbalanced contrast mode classification algorithm to some extent.
Keywords/Search Tags:data mining, classification, parallel mining, contrast pattern mining, unbalanced data set
PDF Full Text Request
Related items