Font Size: a A A

Research On The Discretization Algorithm Of Big Data Based On Spark

Posted on:2021-04-15Degree:MasterType:Thesis
Country:ChinaCandidate:Y ChenFull Text:PDF
GTID:2438330602498320Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,"big data" has become a key word in the 21st century.It has great potential to promote the development of science and technology and social production and life.In just a few years,it has become a research hotspot in information technology and other related scientific research fields.Discretization is one of the most important tasks of these data analysis preprocessing stages.Its purpose is to simplify and reduce continuous value data,improve classification accuracy,and promote the underlying learning process while retaining as much original information as possible.The characteristics of big data(large scale,complex data,and consumption of computing resources,etc.)make traditional discrete algorithms perform poorly in the current traditional computing architecture and algorithm models.This paper has designed two distributed discretization algorithms based on the Spark-MR programming model: DHLG and DSK.DHLG uses the Hellinger-Entropy in information theory to measure the divergence of discrete intervals,and provides the amount of information based on the divergence value of the discrete intervals,Select top-k boundary points to divide the continuous variable range into k discrete intervals;DSK is an algorithm that uses unsupervised classification for discretization,optimizes the selection of discrete sub-domains by AP algorithm,and SOM neural network competes for optimal clustering In the cluster center,KMEAS clustering divides the variable range of continuous feature attributes into k discrete sub-domains.The experimental results on the real sensor data set show that the proposed DSK and DHLG algorithms have better time performance and discrete value quality than the existing similar algorithms.
Keywords/Search Tags:Big Data Mining, Discretization, Apache Spark, Hellinger-Entropy, Unsupervised Classification
PDF Full Text Request
Related items