Research On The Discretization Algorithm Of Big Data Based On Spark

Posted on:2021-04-15

Degree:Master

Type:Thesis

Country:China

Candidate:Y Chen

Full Text:PDF

GTID:2438330602498320

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the rapid development of the Internet,"big data" has become a key word in the 21st century.It has great potential to promote the development of science and technology and social production and life.In just a few years,it has become a research hotspot in information technology and other related scientific research fields.Discretization is one of the most important tasks of these data analysis preprocessing stages.Its purpose is to simplify and reduce continuous value data,improve classification accuracy,and promote the underlying learning process while retaining as much original information as possible.The characteristics of big data(large scale,complex data,and consumption of computing resources,etc.)make traditional discrete algorithms perform poorly in the current traditional computing architecture and algorithm models.This paper has designed two distributed discretization algorithms based on the Spark-MR programming model: DHLG and DSK.DHLG uses the Hellinger-Entropy in information theory to measure the divergence of discrete intervals,and provides the amount of information based on the divergence value of the discrete intervals,Select top-k boundary points to divide the continuous variable range into k discrete intervals;DSK is an algorithm that uses unsupervised classification for discretization,optimizes the selection of discrete sub-domains by AP algorithm,and SOM neural network competes for optimal clustering In the cluster center,KMEAS clustering divides the variable range of continuous feature attributes into k discrete sub-domains.The experimental results on the real sensor data set show that the proposed DSK and DHLG algorithms have better time performance and discrete value quality than the existing similar algorithms.

Keywords/Search Tags:

Big Data Mining, Discretization, Apache Spark, Hellinger-Entropy, Unsupervised Classification

PDF Full Text Request

Related items

1	Spatial Data Mining Classification Method And Its Application
2	OCTWAS - Online Check-pointer for Workflows on Apache Spark
3	Research On Association Mining Optimization Based On Spark Distributed And Application Of Comprehensive Decision
4	Research On The Application Of Inconsistency In Data Discretization And Classification
5	Research On Taxi Trajectory Organization Method Based On Apache Spark
6	The Research On Discretization Oriented To Na(?)ve Bayes Algorithm
7	Using apache spark for scalable gene sequence analysis
8	Bayesian Classification Algorithm Based On Attribute Discretization And Its Application
9	Research Of The Deployment Of Data Mining Project--Application Of Credit Card Mining System Based On Classification
10	An Algorithm Of Discretization Based On Entropy In Application Of Decision Tree