Font Size: a A A

Research On The Communication Problem Of Large Scale Scientific Computing On High Performance Cluster Environment

Posted on:2005-05-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y TangFull Text:PDF
GTID:1118360122493288Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
LINUX Cluster is a new emergent high performance computing solution. A LINUX Cluster is generally built on top of various of commodity products like Personal Computer/Workstation, 100M fast ethernet, etc. This kind of architecture is named 'loosely coupled' as compared with traditional 'tightly integrated' MPP system. The 'loosely coupled' architecture of system will doubtlessly incur some worries or considerations about its reliability and stability, not only the correctness, but performance.From the most recent applications of large scale scientific computing on LINUX Cluster, we see that the communication time occupies more and more percents in the total wall clock time of an application. Also the percentage raises as the increasing of the number of nodes employed, which indicates the drop of parallelism and scalability of real applications. Although the theoretical peak performance of a modern Cluster system may be much higher than an old MPP system, its peak performance for the largest problem is lower, which implies its lower efficiency.The interconnect between every single node(PCAvorkstation) is named cluster Area Networks - cLAN. cLAN has the responsibility of relating every single node of the cluster and have them working together and simultaneously. So is the focus of this thesis. We are trying to clarify and quantify the problem of how to effectively use a cluster and how to tune the performance in real application from the perspective of communication.There are 2 trends in history to view the performance of cLAN : One school of thought is primarily interested in round-trip latency and large message bandwidth as indicators of network performance. Jack Dongarra et al. measure latency and bandwidth for a large class of multiprocessor systems : Convex, Cray, IBM, Intel, KSR, MEIKO, nCUBE, NEC, SGI, and TMC using a ping-pong benchmark. Luecke et al. evaluate the communication performance of Linux and NT clusters, the SGI Origin 2000, IBM SP, Cray-T3E. More recently, Petrini et al.examine the performance of Quadrics networks using uni- and bi-directional ping benchmarks. Another school of thought adopts a more detailed model of the network performance. In 1993, David Culler et al. introduced the LogP performance model of parallel computation. Their model is built upon the realization that modern parallel systems are essentially comprised of complete computers connected by a communication fabric. David Culler et al. measured the model's parameters for the Intel Paragon, Meiko CS-2 and Myrinet. lanello et al. measured the same parameters for an implementation of Fast Messages running on Myrinet. Further research extended the LogP model to take into account other factors that influence application performance and tailor the model for different communication layers (e.g. MPI) and architectures.While with the quick development of cLAN from both hardware and software, e.g. User Level Communication, Message Pipeline, etc., above 2 thoughts can not hold on every phenomenon/ behavior of Large Scale Parallel Programs running on cLAN. So it is urgent to call for new research, new model, new explanation.The main contributions of this thesis are We quantify the advantages versus disadvantages of several main kinds of LINUX Cluster configurations in chapter 2. We gave out a detailed testing report on the state of art LINUX Cluster of P.R.China in chapter 3, as well as some important conclusions. In chapter 4, we found some real applications run much slower on Myrinet 2000 than 100M fast ethernet, which seems eccentric. We tracked down this very abnormal phenomenon and dug out the back reason. Further tests on Gigabit Ethernet and Infiniband said these low latency, high bandwidth cluster area network might have similar problems. So we put forward the concept of 'hot spot test', i.e. , we establish a set of specific tests named 'hot spot test' based upon the special characters or features of target platform, in order to capture any possible abnormal communication behavior of it.The result...
Keywords/Search Tags:Large Scale Cluster System, performance tests and evaluation of HPC system, Optimization of Communication, communication behavior pattern, hot spot test, performance portable, User Level Communication, LINPACK standard, FFT standard
PDF Full Text Request
Related items