Richard Littin, Department of Computer Science, University of Waikato
Control and data flow speculation can improve processor performance through increased ILP. First it is demonstrated how aggressive speculation on both control decisions and data values can produce parallelism in a simple ``sequential'' problem. It is then shown through simulation that a speculative architecture can extract parallelism from more complex code examples. Finally a discussion about the effects of limiting resources in an aggressive speculative architecture is entered into.
In modern computer architectures one means to increase processor performance is to process instructions in parallel. This is commonly known as instruction level parallelism (ILP). In dynamic forms of ILP instructions enter a re-order buffer where they are pooled awaiting execution. These in-flight instructions are processed out-of-order to utilize the available computation resources. Where resources are adequate the size of the in-flight instruction window places upper limit on parallelism that can be obtained. The challenge is to maximize number of in-flight instructions in the instruction window.
When an instruction enters the re-order buffer its input dependencies are checked against the other instructions in the buffer. Once the input dependencies have been resolved the instruction is executed, possibly out-of-order with respect to the programmed order of other instructions in the buffer. Upon completion of execution instructions are stored in retirement buffers so that they can be retired from the system in their programmed order. Retiring instructions in the same order as they would have been retired in an in-order system ensures that the program performs correct computation. Freeing the instructions from the re-order buffer in order means that the program's execution can proceed forward.
The problem is that data and control flow dependencies limit the number of instructions that can be pooled in the re-order buffer at a given time. Program control flow is regulated through branching decisions based on values determined by data and control variables. This means that instructions after a branch cannot be loaded into the re-order buffer until the branch condition has been evaluated. Since, in typical programs, one in seven instructions is a branch [2] it becomes difficult to keep a re-order buffer of more than a few entries full. One way to break the constraints imposed by data and control dependencies is to use speculation.
In modern CPUs the most common forms of speculation are those that predict the direction that program control will proceed. Compilers unroll loops, reducing the number of branches, where it can be determined that the loop will execute in a defined pattern. At run time branch prediction mechanisms record information about the directions taken by branches and make a guess as to the direction to take in future branches. This dynamic prediction increases the pool of in-flight instructions, but when the predictions made are incorrect any instructions on the incorrectly predicted paths need to be quashed.
Programs contain blocks of code that are guaranteed to execute no matter what combination of branches precede them. This form of control flow speculation, code convergence [8], provides checkpoints in the quashing process. Although current production CPUs do not make use of code convergence, research architectures such as the Wisconsin Multiscalar Machine [9] and Stanford Hydra [7] use it to speculatively execute future tasks.
To further extend the level of speculation more aggressive data flow speculation can be used. Data flow speculation takes on several forms:
The speculated values can then be used to allow execution to proceed. With these forms of speculation the system requires the ability to undo (rollback) any incorrect execution caused by incorrectly predicted values and then re-execute with correct values. All forms of speculation need to keep a record of speculated events (state) so that they can be rolled back and re-executed if necessary.
To show how data and control speculation can improve performance through parallelism consider the addition of two binary coded decimal numbers in Figure 1 a). Figure 1 b) shows the C code to perform the addition assuming the BCD numbers are stored in arrays. Initially the carry value is set to zero and it forms a data dependence between iterations of the loop.
Speculating on the value of carry between loop iterations, all three can be run in parallel, if it is zero. Figure 2 shows the flow of data for the speculative execution of the three loop iterations assuming arbitrary resources. On the first step all the digits are loaded. On the second the digit pairs are added and the carry values are loaded. At this point the carry value is zero for all three iterations. On the third and fourth steps the carrys are added and the division/modulus operations take place. On step five three values are written to their appropriate locations in the ans array and three values for carry are written.
The relationships between the carry writes and reads of the 2nd and 3rd loop iteration are determined. It is discovered that the value read (0) on cycle two by iteration 2 was not correct (it should have been a 1). This incorrect data speculation point is indicated by the dashed circles. The computation dependent on the value of carry in iteration 2 (the shaded region) has to be re-executed. On step six the new value for carry is loaded (forwarded) and the remaining computation for iteration 2 is re-executed. On step nine new values are stored in ans and carry. This time the carry value data dependence with the 3rd loop iteration has been speculatively guessed correctly so it is not re-executed.
The speculative execution of the BCD addition completed its computation in nine steps. If no speculation had been done on the carry value then the best execution time would have been 13 steps. The inter-loop dependence would have forced a linearisation to take place. To summarize, the data speculative computation gave a parallelism improvement of 13/9 = 1.44 over the best conservative parallel execution and parallelism of 15/9 = 1.67 over the case where each iteration had been executed in sequence.
The WarpEngine [1] is a theoretical architecture that performs aggressive data and control speculation to extract parallelism. Event synchronization is based on the Time Warp [3] mechanism which is used in parallel discrete event simulation. The event granularity of the WarpEngine is a fixed size group of instructions organized into a block. Each block roughly corresponds to a basic block in conventional programming languages.
Within an active block, known as a frame, instructions are processed using data flow techniques. Instructions are activated for processing when their inputs are ready. Communication between frames is achieved through read and write operations on the memory system. All computation instructions within a block are activated when the block is activated. Instructions associated with control flow (branching) can be conditionally executed.
A tree based structure [6] is used to efficiently control block activation so that speculation can take place. It also provides an ordering amongst blocks for synchronization purposes. Using code convergence information, along with dynamically evaluated branches, a tree structure can be generated to control the program. The program's sequential, or virtual, order is maintained by performing a left-to-right pre-order traversal of the control tree. Speculation is achieved by treating blocks as independent entities and activating blocks at the same level of the tree in parallel. Inter-block dependencies are maintained by the memory system.
When evaluating features of the WarpEngine architecture a virtual ordered functional level simulator is used. In this simulation model events are processed in their programmed (virtual) order by performing a pre-order traversal of the control tree. This provides a simple method of ensuring that all events have valid inputs. To determine program run times, information about the real-time (CPU cycle) at which events would have occurred is maintained. The CPU cycle at which the last event is retired is given as the program's finishing time.
To examine various programming concepts several algorithms have been coded. These include:
Many aspects of the WarpEngine architecture have been examined, but due to the space constraints of this paper only the results of potential parallel tests are shown. In the following section potential parallelism is calculated using
| sequential execution time |
| length of critical path |
In the first set of results arbitrary resources were assumed when performing simulation runs. With this lack of resource restrictions it is possible to determine the parallelism available in a given application. Figure 3 graphs the parallelism obtained for N x N matrix multiplication. As expected, with this highly parallelizable algorithm, a lot of parallelism is obtained. The parallelism obtained increases with problem size.
Figure 4 shows the potential parallelism for the
AVL [4] binary tree insertion algorithm when run on different
random data sets. This sort of problem is usually deemed ``hard to
parallelize'' as the program control path is highly coupled to the data
being processed and cannot be predicted statically. Speculative execution
extracts significant amounts of parallelism though. This is a situation
where aggressive speculation wins over conservative methods. The
parallelism obtained varies for each data set for a given problem
size. Unlike matrix multiplication where control flow decisions are fixed,
the course of action taken in AVL insertion is dependent on the
data. Varying the data means that different decisions are made during
program execution and the parallelism fluctuates accordingly.
Figure 5 shows the amount of parallelism obtained when computation resources are limited for matrix multiplication. Each line represents a fixed number of frames that can be active at a given time. The number of frames available corresponds to the instruction window size in contemporary architectures. Once again it is assumed that there are enough computation resources available.
Each plot ascends to a peak and then drops off as the problem size increases even though the available potential parallelism increases with problem size. This diminishing parallelism phenomenon holds for all frame limitations and for a variety of different problems, although it is most marked in matrix based problems. The point at which, and the magnitude of, the peak changes as a function of the number of available frames and the problem size.
The problem of diminishing obtainable parallelism occurs because as some problems increase in size their inner sequential paths get longer. More of the speculative execution ends up waiting in retirement buffers for prior (in virtual order) execution to complete. When frame resources (instruction buffers) are limited this prevents other more speculative instructions from being activated, effectively preventing them from running in parallel.
Using code convergence to map programs onto a tree based control structure provides an adequate speculation control mechanism. This is an effective event ordering mechanism, but for the architecture to perform well an efficient mechanism is needed to disambiguate memory accesses. One method for doing this is through the use of timestamps.
When re-order buffer space is limited the amount of parallelism obtained can drop off even when the parallelism available increases. This phenomenon is not visible in today's architectures as their re-order buffer sizes are relatively small. As larger amounts of parallelism are sought this effect will be experienced and it is an important issue that needs to be addressed.
So far all tests have used ``toy'' problems. Programs with more complex data and control dependencies need to be examined to see if the same trends hold.
This document was generated using the LaTeX2HTML translator Version 98.1p1 release (March 2nd, 1998)
Copyright © 1993, 1994, 1995, 1996, 1997, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
The command line arguments were:
latex2html -split 0 -top_navigation -bottom_navigation gradconf.
The translation was initiated by Richard H Littin on 1999-03-17