Research On Fault-tolerant Technologies For NoC-based Manycore Systems With Redundant Cores

Posted on:2016-11-22

Degree:Doctor

Type:Dissertation

Country:China

Candidate:Z X Wu

Full Text:PDF

GTID:1108330479478683

Subject:Microelectronics and Solid State Electronics

Abstract/Summary:

PDF Full Text Request

With decreasing feature size of chips and increasing complexity of systems, fault tolerance of manycore systems has become a problem that cannot be ignored. Processor cores are essential components to provide processing power in manycore systems. To tolerate core faults, employing redundant cores in the manycore system is a commonly used scheme. Thus how to minimize the performance degragation of manycore systems caused by permanent core faults with the least cost is one of the greatest challenges currently faced by researchers.In manycore systems, core faults would not only affect the physical topology of the chip, but also affect software execution. To ensure the manageability of the system and the balance of task workloads, and to reduce the impact of physical topology changes on task execution after processor core faults, in this paper, the No C-based manycore systems with redundant cores are targeted as the research object. With the focus on the problem of tolerating permanent processor core faults in manycore systems, the research is conducted in four aspects, which are the fault tolerance of management structure, task migration, virtual topology restoration, and physical topology restoration. The main work of this paper includes:(1) Study on fault tolerant strategies for the management structure of manycore systems. When permanent core faults occur in a manycore system, the first problem needing to be solved is how the system can be recovered from failures. Since the manycore management structure is an important structure that is directly in charge of the resource management in the manycore system, the system can recover from a failure only when the management structure itself has strong fault-tolerant capability. To enhance the fault resilience of the management structure of many-core systems, this paper studies a role-changeable fault-tolerant management approach. According to the typical hierarchical management, this approach adopts a role-changeable management structure. Based on this structure, the mutual-monitoring mechanism, the adaptive management mechanism, the voting mechanism and the self-waking mechanism are proposed, so that each core has the ability to judge and to reconstruct the management structure. Experiments show that the approach can ensure the management structure to tolerate processor core faults in various distributions. With the overhead of 20 K bytes ROM and 35.6K bytes RAM, the manycore system can successfully reconstruct the management structure under various fault conditions and maintain operation. In normal operation, only 1.48% of the computational overhead is introduced by this approach.(2) Study on the load balancing task migration algorithm. When the management of a manycore system is recovered, tasks on faulty cores need to be moved to other fault-free cores to continue execution. Basically, finding optimal migration destinations for tasks belongs to the task assignment problem, which is NP complete. Thus it is difficult to obtain the optimal solution in a short time. To obtain a satisfactory load balanced task migration scheme in a relatively short time, in this paper, the standard genetic algorithm is improved, and an adaptive crossover An chaotic mapping disturbed genetic migration algorithm is studied. In this algorithm, the fixed crossover rate in standard genetic algorithm is changed with the adaptive crossover rate to accelerate the convergence of the algorithm, and a decreasing crossover point selection scheme is used to alleviate the algorithm premature problem and to balance the searching speed in the early and the late searching periods. In addition, to further improve the local search ability of the algorithm, the An chaotic mapping is adopted to apply disturbance to the best individual in each generation. Experiments show that the average improvements of the proposed algorithm in the fitness and the standard deviation aspects are 33.9% and 27.1%, respectively. The optimization process of the proposed algorithm is better than that of the standard genetic algorithm. Compared with the other four algorithms, the proposed algorithm can produce more balanced task distribution, which helps to mitigate the problem of local overheating and helps in even aging of the entire chip.(3) Study on the virtual topology reconfiguration fault-tolerant method. Permanent processor core failures in manycore systems can result in changes in the physical topology. To reduce the performance loss caused by the changes in the physical topology in the traditional 2D mesh No C manycore system and to shorten the system recovery time, in this paper, a fast two-step topology reconfiguration algorithm for virtual topology restoration in manycore systems is studied. This algorithm focuses on both the DF value of mapping results and the computational complexity. By defining the mapping domain and by adopting the Hungarian algorithm for solving the maximum matching problem, the initial solution is fast generated. By restricting twisted mappings, the searching area of Tabu search is reduced. Thus the final mapping solution is generated by Tabu search based on based a fast optimization of the initial mapping solution. In addition, the virtual topology layer in the previously proposed message passing model is extended by using the proposed algorithm. Experimental results show that the proposed algorithm has low time overhead on fault tolerance. When the faults are randomly distributed, the proposed algorithm achieves 5.81% improvements on average on the DF optimization effect than the reference algorithm. When faults gather together in location, the improvement ratio reaches 15.40%. Hence, the proposed algorithm has good adaptation to fault distributions.(4) Study on the physical topology reconfiguration fault-tolerant method. Although the virtual topology technique can alleviate the impact of physical topology changes on the upper software, for the traditional 2D mesh No C manycore system, the technique alone does not guarantee the complete recovery of system performance. To solve this problem, in this paper, routers and multiplexers are added to the traditional 2D mesh No C architecture, and a physical topology restorable reconfigurable 2D mesh structure is studied. Then, a topology reconfiguration algorithm is studied based on the structure to find effective topology reconfiguration schemes. By obtaining the local optimum step by step, this algorithm gradually approaches the global optimum. It modifies the initial solution under certain conditions and performs a new search. Experiments show that the area overhead of the Intel 80-core chip using the proposed structure is only 3.8%. For manycore systems with single column of redundant cores and the ones with a column and a row of redundant cores, which working network size does not exceed 12Ã—12, the proposed algorithm achieves more than 90% of success reconfiguration rate when the number of faulty cores does not exceed 5.1% and 7.7% of the total number of cores, respectively. The presented method provides a low area cost solution for complete recovery of system performance.

Keywords/Search Tags:

Network-on-Chip, manycore system, core-level redundancy, fault tolerance

PDF Full Text Request

Related items

1	The Research Of Redundancy And Fault-Tolerant Technology Based On Real-Time Operation System
2	Study Of Fault Tolerance Method Based On Redundancy Transmission For Network On Chip Soft Error
3	Research On Linux Kernel-Level Fault-Tolerant Technology Supporting Multi-process
4	Research On Fault-tolerance Technology Of Fault-Aware And High-Reliability Router In Three-Dimensional Network-on-Chip
5	Network-on-chip Enabled Manycore Architectures for Cyber-physical Syste
6	Research On High-efficiency On-chip Routing Architecture And Its Optimization Techniques For Many-core Communication
7	Research On The Key Technology Of Fault Tolerance For Network-on-chip
8	Research Of The Fault-detect And Fault-tolerant Methods On Router In Network-on-Chip
9	Research On The Key Techniques Of Soft Error Tolerance Design On Multi Core Microprocessor
10	XDFT: An Extensible Dynamic Fault Tolerance Mechanism For Cooperative Plotting System