| With the research of big data,the analysis and processing technology of big data has became an upsurge of research.This has also prompted the emergence of big data processing platforms and big data processing algorithms for various large-scale distributed systems.The MapReduce distributed processing framework has been used as a big data analysis.The mainstream technology that has been dealt with has risen rapidly and has been solved by adopting a “divide and conquer” approach to a complex issue.There is no doubt that the MapReduce distributed framework provides a fast and effective means for the current big data processing problems.Cluster analysis is important in the field of machine learning and data mining.In the field of cluster analysis,greedy EM algorithm is also a very practical and important algorithm.However,in today's society,the amount of information is increasing rapidly,and data transmission and communication are also in the rapidly expanding realistic context,when storing large amounts of data,existing methods cannot load these data into memory at one time.In addition,the traditional greedy EM algorithm can no longer use the traditional single-machine serial iterative method to process data.This causes the convergence speed of the algorithm to drastically slow down as the amount of data increases.In order to solve the problem that the convergence rate of a greedy EM algorithm is drastically slow when processing large-scale data sets,the MapReduce distributed framework idea is used to distribute the greedy EM algorithm and proposes a Greedy EM algorithm based on MapReduce.This algorithm adopts the greedy algorithm strategy,and mainly obtains the intermediate value and the final value through two stages of Mapper and Reducer.Specifically,the Mapper stage implements data distribution,processes each node and generates corresponding key-value pairs,and then uses the Reducer stage to integrate the generated key-value pairs,and finally obtains an optimal Gaussian mixture model satisfying convergence conditions.At the same time,the model component number of the Gaussian mixture model is also obtained.Finally,through three groups of experimental results,it is proved that without pre-specifying the number of initial model components and accurately obtaining the number of model components,the algorithm can greatly improve the convergence speed when dealing with large data sets,and it has good robustness,and it has good robustness and algorithm scalability. |