Font Size: a A A

Research On Optimization Of Parallel Discretized Data Preparation In Data Mining

Posted on:2020-03-30Degree:MasterType:Thesis
Country:ChinaCandidate:H H YuanFull Text:PDF
GTID:2438330596997511Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
With the wide application of Big Data,Data Mining and Machine Learning,as a important method of Data processing,have become a hot issue in today's research.The prerequisite of many data mining algorithms is that the attribute values to be processed are discrete values.Therefore,how to use a good discretization for data preprocessing is the priority among priorities.In addition,in the discretization step,a kind of discretization algorithm can not be applied to all environment,so it is necessary to select an appropriate discretization method according to the characteristics of the data set and the learning environment.In this paper,based on the in-depth study of the current situation of data discretization technology at home and abroad,the distribution model is determined according to the statistical characteristics of data set detection,the decision basis of the selection ratio between different discretization methods is given,and a auto machine of selecting optimal discretization method is designed.This study proposes an Auto Optimize Algorithm(AOA)for selecting optimal discretization method by parallel comparison for environment.For different data sets.First,the Algorithm will detect the statistical characteristics of the data set to obtain the distribution characteristics of the data set,and detect and remove the abnormal values of the data set according to the distribution characteristics.Secondly,the Algorithm will parallel discrete the dataset by the alternative discretization accord to dataset's distribution.Finally,the Algorithm will compare the Minimum Euclidian distance(MED)formed by the three parameters of entropy,variance index and stability of different discretization methods.According to the automatic comparison of the three parameters,the optimal discretization pretreatment results were obtained.Simulation shows that for the association rule mining results of Beijing(temperate climate)and Sanya(tropical climate)sample datasets,compare four discretization,such as Equal Width Discretization(EFD),Equal frequency Discretization(EWD),based on the average and standard deviation Discretization,k-means' s discretization of discrete(KMEANS)data preprocessing methods,the mining results which used of AOA to after data preprocessing will get more mining association rules,the higher average confidence level,and the basic quite mining operation time in different the minimum support thresholds,so get better mining results.And based on AOA to realize automatic optimal discretization algorithm selection tool.
Keywords/Search Tags:Data mining, Data preparation, Parallel invocation, Distribution detection, Datadiscretization
PDF Full Text Request
Related items