Font Size: a A A

Research On Parallel Recognition Method And Application

Posted on:2010-02-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z YanFull Text:PDF
GTID:1118360302965855Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The idea of parallel processing almost came into being with the birth of the computer. Since the technology of parallel processing is the main method to improve the computing power, during the past two decades parallel computing has been one of the most active branches for the development of computer technology and the continue improvement of processor speed. In short, parallel computing is a supercomputing made in high-performance computing system such as parallel computer or distributed computer, and it is clearly that its material is based on high-performance parallel computers. To generate the high-speed parallel codes operating on the high-performance parallel computers, the research and development of parallel compilation technology research and development has also become the key in the field of computer science. The technology on auto-parallelization of serial programs is one of the most important elements in the parallel compiling technology, for at first it can automatically convert the serial programs to an equivalent parallel programs which can be run efficiently on parallel computer, and at second it can also overcome the shortcomings that are the difficulties of programming and software transportation on parallel computer system. At the same time, the technology on auto-parallelization of serial programs plays an important role in using the accumulated heritage of serial program over a long period of time, and it also can reduce the difficulty of parallel programming and improve the performance of parallel programs. Furthermore, it can greatly reduce not only the burden of programmer but also the cost of parallel program development. This article adopted the techniques of data dependency analysis and the techniques of program parallelization such as loop permutation and unimodular transformation to solve the problems such as parallel recognition, computation decomposition and data decomposition during the achievement of serial programs auto-parallelization in the parallel compiler, so that the parallel performance of programs can be developed maximally.At present, due to the different hardware environment the auto-parallelization compiling technique has different development. For example, the auto-parallelization compiling technique based on shared memory is already quite mature, but there are still two problems need to solve: the first is the negative impacts of the efficiency arising from the small workload parallel areas after the program parallelization; the second is how to make a large computing task into many small computing tasks whose load is reasonable and balanced. On the other hand the auto-parallelization compiling technique based on distributed memory has still a lot of difficulties arising from data decomposition, so how to determine a good strategy of data decomposition and computation decomposition became the key research of parallelization compiling based on distributed memory. The good strategy can get the maximum data locality and program parallelism. It can directly affect the program parallelism and the cost of communication in run-time that the strategy of data decomposition is reasonable or not.In this paper, a series of parallel processing methods were proposed aimed at the subject of serial programs auto-parallelization achieved in the course of compiling. First of all, in order to understand the mechanism of parallel processing and the performance of parallelization compiling system, a parallel algorithm on auto-generation of syntax analysis table was designed in the parallel compiler. In the algorithm the high efficiency of parallel execution was fully demonstrated by a small and simple example. Secondly the subject of parallel grain on auto-parallelization was studied, and after the analysis on three kinds of parallel grain a implementation of medium parallel grain was presented, which mainly put forward two kinds of algorithm such as parallel recognition on basic blocks and parallel optimization in the course of medium parallel processing. Using these two algorithms the conflict between the workload and the overhead of parallel threads during the parallel processing, and the phenomenon of reduced program execution efficiency was avoided that may occur during parallel processing. Finally, the subject aimed at loop auto-parallelization that is often main part in calculation of the serial programs was studied. In this step the auto-parallelization method on tight-nested loops was put forward. In order to solve the problem on high cost of hardware environment in which the parallel programs execute, the method of parallel recognition, the method of data decomposition and the method of computation decomposition based on tight-nested loops auto-parallelization were presented in the environment of multi-core system. So that the execution speed of the serial programs was improved and the application of program auto-parallelization was increased.First of all, because of the restrictions of the limit speed on components, the key research of high-performance computer was focused mainly on the development of computing parallelism. When the programs run on the high-performance parallel computers, we hope that there is a parallel compiling system with function of auto-parallelization to automatically mining program parallelism, so that the serial programs can automatically convert to parallel programs, which makes parallel compiler became the key of automatic parallelization research. Parallel compilation system can be divided into two categories, one is to translate a source program programming by parallel language into a parallel target codes. The other is that the automatic parallelization of serial programs are embedded in a parallel compiler, and it is considered as one pass of parallelization. No matter what type, the syntax analysis is often needed in parallel compiler, and the key step of syntax analysis is to construct the syntax analysis table. Since there are a lot of nonterminals and terminals in grammar, to construct and access the syntax analysis table has become the bottleneck when we speed syntax analysis up . In this paper, to improve the performance of parallel compiler, aiming at a typical syntax analysis----LL(R) syntax analysis, the author proposed a parallel processing method of LL(1) analysis table auto-generation in multi-processor environment. This parallel algorithm achieve the construction of the local data in analysis table by multi-threaded mode, and complete the construction of the whole analysis table by the intercommunication mechanism among the threads. This parallel processing method greatly improved the speed of syntax analysis in the parallel compiler, and it also has the theoretical and practical significance in speeding program parallel processing up and improving the implementation efficiency.Secondly, in order to mine the parallelism of serial programs largely and to improve the operating efficiency of serial programs on a high-performance parallel computer, another important task of parallelization compiler is to recognize all of the parallel processing units from the serial programs. In general, parallel granularity was paid more attention during parallel recognizing give full play to high-performance parallel computing performance computers. Greater grain of parallel programs, the overhead of parallel programs spending on synchronization and communication will be smaller, and the parallel speed-up ratio of programs will be higher. But if the parallel grain is greater, it will bring two shortcomings such as lower overall parallelism and lower program execution speed. In the parallel mechanism of parallel computer, there are three kinds of parallel grain. And they are coarse grain parallel, the medium grain parallel and fine grain parallel. After theoretical analysis and engineering experiments, we found that the overhead of data dependent recognition and synchronization based on coarse grain parallel is small, but coarse grain parallel can restrict the parallel potential. In the other hand, there is a deep parallelism in fine grain parallel, but the overhead of the data dependent recognition and the subsequent synchronization is large. After considering in detail, we get a conclusion that the medium grain parallel is only the most appropriate granularity in parallel. In this article a parallel recognition algorithm is proposed. The algorithm is a kind of medium grain parallel algorithm. Through the implementation of the algorithm, the parallelized basic blocks can be automatically recognized in the serial program. When the basic blocks executes in parallel, the operating speed of the programs can be improved and the running ability of the programs can also be increased. On the study of parallel recognition, in order to avoid the high overhead arising by parallel execution, many techniques on parallel optimization were proposed in the paper, that are the basic block merging optimization and the processors distributing optimization. With the help of parallel optimization techniques referred above, the execution performance of serial programs that was be parallelized can be increased more effectively than no optimization techniques.In the computer industry, the hardware and software development are usually complementary and mutually reinforcing. Due to the emergence of multi-core microprocessor chip, a revolution of software is led to. In the single-core era, programmers rarely consider the parallel programming. Because that even if there are multi-thread programs, they can only concurrent execute on single-core platform, so that the real parallel can not be achieved. That is there is virtually no difference of performance between the single-thread programs and multi-thread programs, and because of the overhead of the thread switching overhead, sometimes the performance of multi-thread programs is often not as good as the performance of single-thread programs. However, in multi-core platforms, there are very significant difference of performance between the single-thread programs and the multi-thread program, and under the condition of proper design the performance of the multi-thread program will greatly exceed the performance of the single-thread programs. This makes it become an urgent problem to solve that is how to implement the auto-parallelization of serial programs that are usually can not be parallelized by traditional methods by the hardware characters of the multi-core processor use of traditional methods.Finally, a auto-parallelization method of tight-nested loops based on multi-core processor was presented. According to the physical characteristics of multi-core processors, a kind of solution about data accessing locality was presented. This solution was aimed at the automatic parallelization of the tight-nested loops. And then a computing method on workload was proposed in the algorithm of computation decomposition. In accordance with the workload of a loop iteration the parallel grain can be determined through the method when the tight-nested loops were parallelized and the problem on parallel overhead aroused by parallel threads of small workload can be solved. So that the model of loop auto-parallelization can be constructed based on the decomposition of workload. In addition, according to the characteristics of the storage structure on Chip Multiprocessor the strategy of data decomposition was also presented in this section, this strategy is applied to loop-level parallel data. Currently there is not only the shared memory and 2-level cache but also its own private 1-level cache on Chip Multiprocessor, so that the author adopted the ways to improve the accessing speed by increasing cache hit ratio in the course of data decomposition. To solve the problem of data accessing locality, the strategy of data decomposition was designed by the size of the 1-level cache and 2-level cache, and this method made the serial loops can be executed on multi-core processors by a multi-thread model.. For the data access locality was increased, the intercommunication overhead among the threads was reduced, further more the parallel execution efficiency of serial programs was also improved.The results of study demonstrated that when we adopted data dependency analysis to study the serial programs, we can maximally identify the independent segments of serial programs, and if we make these segments to execute by the multi-thread model, the auto-parallelization can be implemented correctly. That is this method can solve the problems on the increased communication overhead and wasted processor resource arousing by the synchronization mechanism in traditional multi-thread parallel to guarantee the correctness of parallel execution. And the model of data decomposition and computation decomposition designed by the hardware features of the multi-core system can achieve the target of data localization and load balancing that was always highlighted in the course of parallel processing .
Keywords/Search Tags:parallelization compiler, parallel recognition, data dependence, medium grain parallel, computation decomposition, data decomposition
PDF Full Text Request
Related items