With the rapid development of the Internet industry, the era of "Big Data" has arrived. Hadoop with its advantages of high fault tolerance, high reliability, high scalability, high efficiency, low cost and simplicity shines in the mass data processing field.But with the growing Hadoop cluster size and the increasing users of cluster, operation and maintenance of the cluster gradually trouble, therefore, we need the real-time performance monitoring and analysis of Hadoop clusters to ensure high performance of clusters.In this thesis, firstly we give a brief overview of cluster monitoring metrics and monitoring techniques, and then design and implement performance monitoring system on Hadoop-based data analysis platform according to the needs of cluster operation and maintenance personnel.This system can facilitate cluster operation and maintenance personnel to understand the cluster state, the operation status of each component and resource usage of each node in real time so that they can deal with cluster warnings in time to ensure cluster efficient operation. Secondly we launch a collection and analysis on HDFS data distribution and data access, find that there is a phenomenon of unbalanced data distribution and data access on DataNode is consistent with DataNode performance resource consumption trend. Thus, we put forward data distribution optimization strategy, study the impact of data distribution on the HDFS data access and running job. Finally we draw the conclusion through the experiments.Balancer can optimize data distribution to achieve a balanced data distribution. The more balanced data distribution, the shorter time of data access and running job. What's more, with the increase of the number of concurrent users and concurrent jobs, there is a growing influence of data distribution on file access time and job execution time. |