Font Size: a A A

Loop unrolling along wavefronts and wavefront based techniques for exploiting instruction and thread level parellelism

Posted on:2014-03-12Degree:Ph.DType:Thesis
University:Santa Clara UniversityCandidate:Steinbrecher, JohannFull Text:PDF
GTID:2458390008455971Subject:Engineering
Abstract/Summary:
Loop unrolling is a loop transformation where a few loop iterations are grouped as a super iteration to have more independent instructions that feed multiple pipelines in a system, to decrease the total loop overhead or explore data locality by reusing loaded data. In this thesis, loop unrolling is characterized by the unrolling factor, the number of iterations in a super iteration and the unrolling direction, the choice of iterations to be grouped to form the super iteration. Data dependencies exist between different iterations. To increase the number of independent instructions in the super iteration, independent iterations should be grouped to form the super iteration. In most previous work, loops are unrolled along the inner loop, which may group dependent iterations together and not be the best unrolling direction. This thesis uses a linear schedule to group independent iterations on the same wave front, a hyper-plane in the loop iteration space. Then, the loop is unrolled along the wavefront which guarantees all iterations in the same super iteration are independent. The selection of the optimal unrolling factor is based on the assumption that if all the pipelines are saturated, i.e., there is no bubble, the performance should not be bad. Two necessary conditions and a sufficient condition for optimality are presented and used to find the optimal unrolling factor for systems with low memory access latencies. The total execution time is expressed as a function of algorithm parameters such as the data dependence structure and instruction counts, architecture parameters such as the number of pipelines and number of registers and the unrolling factor. An example, the longest common subsequence problem is used to demonstrate the method. An implementation of the proposed unrolling on the longest common subsequence problem scores a speedup of 1.14 on an IntelRTMAtom(TM) micro-architecture over traditional methods. Furthermore, it is presented how loop unrolling along wavefronts can aim in partitioning a loop nest in multiple threads. A sufficient condition for creating communication free threads is presented and applied. Additionally, new implementation strategies for supernode transformations based on the concept of wavefronts are presented.
Keywords/Search Tags:Unrolling, Loop, Super iteration, Wavefronts, Iterations, Presented
Related items