| K-Means,as a typical representative of machine learning algorithm,is widely used in large data background.However,it has many problems,such as local optimal solution caused by random selection of initial centroid,more iterations in large data scale,long computation time and low accuracy.In the context of large data,the optimization and application of machine learning algorithms based on large data computing framework has become a hot research topic.At present,many large data computing frameworks contain machine learning libraries.With the emergence of real-time search engines,social software and other applications,real-time data processing has increasingly become a research hotspot of scholars.Traditional batch computing theory of storage-first-calculation has become a hot topic.The idea is no longer applicable to real-time stream data processing,so how to construct a high throughput and low latency large data stream computing framework has become a key problem to be solved urgently.Based on the above problems,this paper studies the optimization of K-Means algorithm based on Flink platform,the acceleration of parallelization and the task scheduling strategy of Flink platform.The specific research contents can be summarized as follows:(1)In order to solve the problem of local optimum solution and slow clustering speed caused by choosing centroid in K-Means with large data scale,a CK-Means clustering optimization and parallel strategy based on Flink platform is proposed.From the optimization level,Canopy algorithm is used to determine the number of clusters K and select the initial centroid;from the parallelization acceleration level,a parallel acceleration strategy for CK-Means is designed based on Flink platform,and the impact of different parallelism degrees on computing time-consuming is analyzed.Compared with K-Means algorithm,CK-Means algorithm has higher accuracy and better performance than K-Means algorithm.The clustering time of CK-Means algorithm under different parallelism degrees decreases first and then rises,and the minimum clustering time is related to the size of the data set.(2)To improve the clustering speed and accuracy of K-Means algorithm,a KMeans parallel acceleration strategy based on k-d tree partition is proposed.From the level of algorithm optimization,the k-points farthest from the data set are selected to optimize the initial centroid;from the level of task parallelization acceleration,k-d tree partitioning algorithm is proposed to partition the data set and realize task parallelization;from the level of execution environment parallelization acceleration,different number of processes and CPU kernels are set to verify the acceleration effect of F-KMeans.(3)In order to improve the resource utilization of Flink computing framework,a resource-aware task scheduling strategy based on Flink streaming computing environment is proposed.Aiming at the problem that task scheduling algorithm based on Flink platform neglects the relationship between task resource requirements and available resources of nodes,which results in uneven task loads of different nodes,thus affecting system throughput,a resource-aware task scheduling strategy based on Flink streaming computing environment is proposed.Firstly,based on the resource data monitored by GlobalState module,considering the matching relationship between task resource requirements and available resources of nodes,a task selection algorithm and node selection algorithm are proposed to select the tasks to be executed and the optimal scheduling nodes.Secondly,the tasks to be executed are scheduled to the optimal scheduling nodes through resource-aware scheduling strategy.Finally,the design experiments are carried out to verify the proposed algorithm.The algorithm is effective. |