Font Size: a A A

High-efficiency Reconfigurable Array Computing: Architecture, Methodology And Application Mapping Technology

Posted on:2015-08-21Degree:DoctorType:Dissertation
Country:ChinaCandidate:L ZhouFull Text:PDF
GTID:1108330479979668Subject:Electronic Science and Technology
Abstract/Summary:PDF Full Text Request
Benefit from the rapid development of integrated circuit, the computing capability and flexibility of processing hardware platform have been improved a lot. Applications in some domain are also increasing their demand for the high performance computing and flexibility due to their service quality and standards improvement. Such as applications in software defined radio, it not only requires high performance computing in hardware platform to ensure realtime feature but also needs reconfigurable hardware in some extend for the reason of co-existed and rapid evolving multiple protocols in application domain.Coarse grained reconfigurable array(CGRA) computing architecture is such an hardware which satisfies the two requirements. It offers high computing capability through exploring parallelism in different levels by many processing elements in the array, while providing a certain kind of flexibility feature due to its word level reconfigurable function units. Although traditional CGRA can meets the requirements for high performance computing and flexibility, it takes efficiency at cost. Both in hardware overhead and power consumption, there is a large gap between CGRA and ASIC. Moreover, the domain specific design methodology and the parallelism in instruction and loop level influence actual performance heavily in the design and application of CGRA. So, high-efficiency CGRA architecture, the design methodology and related supporting technology are important issues in current research for reconfigurable computing.Based on the 2D mesh connected CGRA computing platform, this dissertation do researches on the five point: novel and high-efficiency architecture, application specific design methodology, instruction and loop level parallelism on CGRA and reconfigurable accelerators. We also give the detailed evaluation and analysis of them. The major contributions and innovations can be summarized as follows.1) This dissertation proposes a new CGRA architecture, which improves the efficiency of hardware area and power consumption. According to the low resource utilization ratio and the pattern of data communication, we propose a clustered CGRA architecture. In this architecture, function units are divided into two types by their own characteristic. One is simple and low latency PE, the other one is complex and long latency PE. Several complex PE and several simple PE will construct a reconfigurable cluster. Data communication is done by shared register file in the cluster. Custom defined function units can also be included in the complex PE according to the profiling of specific application. Compared to traditional CGRA architectures, the proposeclustered architecture saves hardware area, improves hardware utilization ratio and maintains flexibility with higher computation efficiency since its resource sharing and reduced complex PE. Because there is a close connection between inner cluster resources, data transfer is easier. Route procedure can be trimmed and provide more resource for computing if the local connectivity feature can be utilized when mapping applications onto CGRA. Experiment results show that this kind of architecture achieves higher area efficiency and power efficiency. The clustered based mapping is more simple and effect. We also explored the architecture design space on the clustered CGRA, area efficiency and power efficiency approach best only if proper number of complex PE and simple PE are set.2) For problem of application specific function units design, this dissertation presents an approach of automatic application source code profiling and corresponding hardware architecture generation. This dissertation analyzed current research on the computation pattern selection and hardware generation problem and point out the drawback of separated computation pattern selection and hardware generation. Further, we notice the two problem equals to finding relevancy of vertices pairs in DFG, there are constraints between the two procedure. Based on the new formulation, this dissertation propose a unified method for pattern selection and hardware generation. In this method, the pattern selection and hardware generation are done iteratively, so that local optimal result can be avoid and ensures the global optimization. The ant colony algorithm merged the two procedures well. Under the affect of local heuristic and global heuristic, the relevancy of vertices are defined automatically, result is apparently optimized. This method can accelerate applications with low hardware area cost.3) For problem of instruction level parallelism, an ant colony based algorithm for mapping DAG to CGRA is proposed. Algorithm optimization for the clustered CGRA is also presents. This dissertation gives the integer linear programming model for mapping DAG to CGRA, then point out that the mapping is actually assigning operations to PE with all the execution times optimization. Further more, we propose using ant colony algorithm which takes execution as soon as possible through maze route as local heuristic and takes the pheromone left by ants as global optimization scheme to improve the mapping result. Based on the ant colony algorithm, we find some vertices with special affinity will influence the quality of mapping result heavily, then present the principle that parent-child and common child operations shouldbe mapped as close as possible. By defining the affinity of vertices, we add affinity and cluster distance computing in the mapping procedure, and takes it as another indicator for assign operations to PE. The orientation of searching is strengthened and better result is obtained in min-max ant colony algorithm. Moreover, by constraining result using the proposed indicator which abandon some solution, the searching space is reduced with endurance of result’s quality. Compared to other iterative improve algorithms, our methods have shorter mapping time and better, more steady results,which supports exploring instruction level parallelism in CGRA well.4) For problem of loop level parallelism, this dissertation proposes a modulo scheduling algorithm based on genetic algorithm, and design a more quick, effect modulo scheduling algorithm for clustered CGRA architecture. This dissertation proposes a modulo scheduling algorithm based on genetic algorithm, in which we define the priority of data dependencies and route them in priority order to ensure critical dependency is satisfied first. When there are dependencies with the same data producer and different data consumer, we use Steiner tree for route resource sharing while scheduling. A rapid and approximate method is adopt to solve Steiner tree problem which achieves route resource sharing well. In addition, this dissertation analysis the number of loop-carried dependency and inner loop dependency, then confines an iteration in one cluster and adjacent iterations in adjacent clusters. This method utilized local connectivity in clustered CGRA. By loop unrolling and extension, the modulo scheduling perfectly explores loop level parallelism in different CGRA clusters. The greedy algorithm can schedule loop kernel onto a cluster quickly and effectively.Without routing procedure, the complication of routing in modulo scheduling is perfectly solved. Experiment shows our cluster based approach greatly reduce mapping times. Compared to other algorithms, the propose method implements loop pipeline on clustered CGRA better and achieves more acceleration for applications.5) For Turbo product code algorithm which cannot be implemented on CGRA, this dissertation designed and implemented a flexible configurable VLSI accelerator. We also analyzed the common point of FFT and Viterbi algorithm, designed and implemented reconfigurable accelerator for them. This dissertation presents modifications for the computing of the turbo product decode algorithm, which is frequently used in channel coding. The modified algorithm has slightly improved performance and adept to being implemented by hardware. The proposed high efficiency turbo product code decoder support multiple code type, satisfies different protocols’ need. Inaddition, this dissertation analyzed the common point of FFT and Viterbi, which are two frequently used algorithm in wireless communication. We find they can share a part of hardware because of the similarity in their computation and memory access mode. The designed reconfigurable accelerator saves area in 20% compared to independently their implementation. Reconfigurable accelerator can be configured to implement two kinds of different functions, which increases flexibility while maintains the high efficiency of ASIC accelerator. It more adept to the flexibility requirement in software defined radio.
Keywords/Search Tags:Coarse grained reconfigurable array, reconfigurable computing, domain specific computing, instruction level parallelism, loop level parallelism, reconfigurable accelerator
PDF Full Text Request
Related items