Font Size: a A A

Discretization Method And Parallelization Based On Multi-scale And Information Entropy

Posted on:2022-02-01Degree:MasterType:Thesis
Country:ChinaCandidate:Q X YinFull Text:PDF
GTID:2518306521995069Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Nowadays,with the continuous development of information technology,data in various fields has exploded.In order to obtain potentially valuable knowledge from such a large amount of data and improve data utilization,it is necessary to preprocess the data in advance to obtain high-quality data that can be directly mined and analyzed.Discretization,as an important technology of data preprocessing,is of great significance for improving the efficiency and quality of data mining.Since there are many attributes in actual data sets,which can be arranged in an orderly hierarchical structure of concepts,that is multi-scale characteristics.Multi-scale technology can not only reveal the internal structure and hierarchical characteristics of data objects,but also obtain a variety of factors and methods to restore the essence of things.Moreover,using multi-scale segmentation technology to segment images or data can reduce the time complexity of the algorithm.By introducing the concept of multi-scale in the discretization process,the data is divided into multiple levels reasonably,and candidate cutting points of different granular expression forms are obtained,so as to obtain more valuable information and improve the quality of discretized data.Therefore,this paper proposes a data discretization algorithm based on multi-scale and information entropy(MSE).The algorithm first divides the dataset into reasonable scales to obtain candidate cutting point sets with different manifestations.Then information entropy is applied to calculate candidate cutting points,and the candidate cutting points with the smallest information entropy are selected by using MDLPC judgment criteria.Finally,the best cutting point set is obtained.The UCI datasets are used to verify that the algorithm effectively improves the efficiency of discretization.In order to effectively deal with massive data,parallel/distributed computing is introduced to improve the efficiency of discretization.Discretization itself is an iterative process.As a memory-based distributed computing framework,Spark can well support iterative computing with its efficient DAG scheduling and strong fault tolerance mechanism.Based on the spark parallel computing platform,this paper proposes a novel discrete parallel optimization algorithm.To achieve effective parallelization,each attribute of a data set is parallelized and processed separately.Experimental results show the experimental results verify that when the data size increases,parallelization effect is up to 6 times compared with serial MSE.
Keywords/Search Tags:Discretization, Information entropy, Multi-scale, Parallel computing
PDF Full Text Request
Related items