Discretization Method And Parallelization Based On Multi-scale And Information Entropy

Posted on:2022-02-01

Degree:Master

Type:Thesis

Country:China

Candidate:Q X Yin

Full Text:PDF

GTID:2518306521995069

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Nowadays,with the continuous development of information technology,data in various fields has exploded.In order to obtain potentially valuable knowledge from such a large amount of data and improve data utilization,it is necessary to preprocess the data in advance to obtain high-quality data that can be directly mined and analyzed.Discretization,as an important technology of data preprocessing,is of great significance for improving the efficiency and quality of data mining.Since there are many attributes in actual data sets,which can be arranged in an orderly hierarchical structure of concepts,that is multi-scale characteristics.Multi-scale technology can not only reveal the internal structure and hierarchical characteristics of data objects,but also obtain a variety of factors and methods to restore the essence of things.Moreover,using multi-scale segmentation technology to segment images or data can reduce the time complexity of the algorithm.By introducing the concept of multi-scale in the discretization process,the data is divided into multiple levels reasonably,and candidate cutting points of different granular expression forms are obtained,so as to obtain more valuable information and improve the quality of discretized data.Therefore,this paper proposes a data discretization algorithm based on multi-scale and information entropy(MSE).The algorithm first divides the dataset into reasonable scales to obtain candidate cutting point sets with different manifestations.Then information entropy is applied to calculate candidate cutting points,and the candidate cutting points with the smallest information entropy are selected by using MDLPC judgment criteria.Finally,the best cutting point set is obtained.The UCI datasets are used to verify that the algorithm effectively improves the efficiency of discretization.In order to effectively deal with massive data,parallel/distributed computing is introduced to improve the efficiency of discretization.Discretization itself is an iterative process.As a memory-based distributed computing framework,Spark can well support iterative computing with its efficient DAG scheduling and strong fault tolerance mechanism.Based on the spark parallel computing platform,this paper proposes a novel discrete parallel optimization algorithm.To achieve effective parallelization,each attribute of a data set is parallelized and processed separately.Experimental results show the experimental results verify that when the data size increases,parallelization effect is up to 6 times compared with serial MSE.

Keywords/Search Tags:

Discretization, Information entropy, Multi-scale, Parallel computing

PDF Full Text Request

Related items

1	Some Key Issues On The HPC Based Spatial Data Application Platform Of Multi-resolution
2	Data Scaling Theory And Method For Multi-scale Data Mining
3	Information Entropy Based Optimal Scale Selection In Multi-scale Decision Tables
4	Research On Key Technologies Of Parallel Optimization For Multi-computing Platforms For Large-scale Applications
5	Study On Comparison Of Discretization Algorithms Of Continuous Attributes
6	An Algorithm Of Discretization Based On Entropy In Application Of Decision Tree
7	A Discretization Algorithm And Its Parallelization For Unbalanced Data
8	Application Of HIE-FDTD High Perfomance Parallel Computing Method For The Design Of Multi-Pass Band Frequency Selective Structures
9	The Parallel Drawing Of Large-scale Terrain Models Based On Multi-core Platform
10	Research On The Key Techniques Of Large Scale Parallel Computing On Accelerator-based Heterogeneous Systems For Applications