Data and Control Speculative Execution

Richard Littin, Department of Computer Science, University of Waikato

Control and data flow speculation can improve processor performance through increased ILP. First it is demonstrated how aggressive speculation on both control decisions and data values can produce parallelism in a simple ``sequential'' problem. It is then shown through simulation that a speculative architecture can extract parallelism from more complex code examples. Finally a discussion about the effects of limiting resources in an aggressive speculative architecture is entered into.

Introduction

In modern computer architectures one means to increase processor performance is to process instructions in parallel. This is commonly known as instruction level parallelism (ILP). In dynamic forms of ILP instructions enter a re-order buffer where they are pooled awaiting execution. These in-flight instructions are processed out-of-order to utilize the available computation resources. Where resources are adequate the size of the in-flight instruction window places upper limit on parallelism that can be obtained. The challenge is to maximize number of in-flight instructions in the instruction window.

When an instruction enters the re-order buffer its input dependencies are checked against the other instructions in the buffer. Once the input dependencies have been resolved the instruction is executed, possibly out-of-order with respect to the programmed order of other instructions in the buffer. Upon completion of execution instructions are stored in retirement buffers so that they can be retired from the system in their programmed order. Retiring instructions in the same order as they would have been retired in an in-order system ensures that the program performs correct computation. Freeing the instructions from the re-order buffer in order means that the program's execution can proceed forward.

The problem is that data and control flow dependencies limit the number of instructions that can be pooled in the re-order buffer at a given time. Program control flow is regulated through branching decisions based on values determined by data and control variables. This means that instructions after a branch cannot be loaded into the re-order buffer until the branch condition has been evaluated. Since, in typical programs, one in seven instructions is a branch [2] it becomes difficult to keep a re-order buffer of more than a few entries full. One way to break the constraints imposed by data and control dependencies is to use speculation.

Speculation

In modern CPUs the most common forms of speculation are those that predict the direction that program control will proceed. Compilers unroll loops, reducing the number of branches, where it can be determined that the loop will execute in a defined pattern. At run time branch prediction mechanisms record information about the directions taken by branches and make a guess as to the direction to take in future branches. This dynamic prediction increases the pool of in-flight instructions, but when the predictions made are incorrect any instructions on the incorrectly predicted paths need to be quashed.

Programs contain blocks of code that are guaranteed to execute no matter what combination of branches precede them. This form of control flow speculation, code convergence [8], provides checkpoints in the quashing process. Although current production CPUs do not make use of code convergence, research architectures such as the Wisconsin Multiscalar Machine [9] and Stanford Hydra [7] use it to speculatively execute future tasks.

To further extend the level of speculation more aggressive data flow speculation can be used. Data flow speculation takes on several forms:

Value prediction. Hardware mechanisms predict the values of data items by observing patterns in the data. Prediction accuracies of up to 80% have been observed [5]. An example of where this method works well is index counter variables.
Address prediction. The locations of data can be predicted (addresses of array elements) and their values can then be pre-loaded. This is known as prefetching.
Memory system optimism. Multiprocessing systems with local caches return the value that they have for the given location. Due to cache incoherence the value may not be correct, but the correct value will be returned eventually.

The speculated values can then be used to allow execution to proceed. With these forms of speculation the system requires the ability to undo (rollback) any incorrect execution caused by incorrectly predicted values and then re-execute with correct values. All forms of speculation need to keep a record of speculated events (state) so that they can be rolled back and re-executed if necessary.

Execution of BCD arithmetic

To show how data and control speculation can improve performance through parallelism consider the addition of two binary coded decimal numbers in Figure 1 a). Figure 1 b) shows the C code to perform the addition assuming the BCD numbers are stored in arrays. Initially the carry value is set to zero and it forms a data dependence between iterations of the loop.


  
Figure 1: Binary coded decimal arithmetic, a) an example, b) C code to perform the BCD addition.
\begin{figure}
\begin{center}
\epsfysize=3.0 cm
\leavevmode \epsfbox{eqn1.eps}
\end{center}\end{figure}

Speculating on the value of carry between loop iterations, all three can be run in parallel, if it is zero. Figure 2 shows the flow of data for the speculative execution of the three loop iterations assuming arbitrary resources. On the first step all the digits are loaded. On the second the digit pairs are added and the carry values are loaded. At this point the carry value is zero for all three iterations. On the third and fourth steps the carrys are added and the division/modulus operations take place. On step five three values are written to their appropriate locations in the ans array and three values for carry are written.

The relationships between the carry writes and reads of the 2nd and 3rd loop iteration are determined. It is discovered that the value read (0) on cycle two by iteration 2 was not correct (it should have been a 1). This incorrect data speculation point is indicated by the dashed circles. The computation dependent on the value of carry in iteration 2 (the shaded region) has to be re-executed. On step six the new value for carry is loaded (forwarded) and the remaining computation for iteration 2 is re-executed. On step nine new values are stored in ans and carry. This time the carry value data dependence with the 3rd loop iteration has been speculatively guessed correctly so it is not re-executed.


  
Figure 2: Flow analysis for the speculative execution of the example in Figure 1.
\begin{figure}
\begin{center}
\epsfysize=8.2 cm
\leavevmode \epsfbox{eqn-spe.eps}
\end{center}\end{figure}

The speculative execution of the BCD addition completed its computation in nine steps. If no speculation had been done on the carry value then the best execution time would have been 13 steps. The inter-loop dependence would have forced a linearisation to take place. To summarize, the data speculative computation gave a parallelism improvement of 13/9 = 1.44 over the best conservative parallel execution and parallelism of 15/9 = 1.67 over the case where each iteration had been executed in sequence.

The WarpEngine

The WarpEngine [1] is a theoretical architecture that performs aggressive data and control speculation to extract parallelism. Event synchronization is based on the Time Warp [3] mechanism which is used in parallel discrete event simulation. The event granularity of the WarpEngine is a fixed size group of instructions organized into a block. Each block roughly corresponds to a basic block in conventional programming languages.

Within an active block, known as a frame, instructions are processed using data flow techniques. Instructions are activated for processing when their inputs are ready. Communication between frames is achieved through read and write operations on the memory system. All computation instructions within a block are activated when the block is activated. Instructions associated with control flow (branching) can be conditionally executed.

A tree based structure [6] is used to efficiently control block activation so that speculation can take place. It also provides an ordering amongst blocks for synchronization purposes. Using code convergence information, along with dynamically evaluated branches, a tree structure can be generated to control the program. The program's sequential, or virtual, order is maintained by performing a left-to-right pre-order traversal of the control tree. Speculation is achieved by treating blocks as independent entities and activating blocks at the same level of the tree in parallel. Inter-block dependencies are maintained by the memory system.

Experimentation

When evaluating features of the WarpEngine architecture a virtual ordered functional level simulator is used. In this simulation model events are processed in their programmed (virtual) order by performing a pre-order traversal of the control tree. This provides a simple method of ensuring that all events have valid inputs. To determine program run times, information about the real-time (CPU cycle) at which events would have occurred is maintained. The CPU cycle at which the last event is retired is given as the program's finishing time.

To examine various programming concepts several algorithms have been coded. These include:

Dynamic structures. AVL and naive binary tree insertion,
Sorting. Heapsort and quicksort,
Array manipulation. Matrix multiplication and Gauss-Jordan elimination, and
Recursion. Recursive Fibonacci number generation.
Each problem has been examined using various problem sizes and where appropriate different data sets.

Many aspects of the WarpEngine architecture have been examined, but due to the space constraints of this paper only the results of potential parallel tests are shown. In the following section potential parallelism is calculated using

sequential execution time

length of critical path

Results

In the first set of results arbitrary resources were assumed when performing simulation runs. With this lack of resource restrictions it is possible to determine the parallelism available in a given application. Figure 3 graphs the parallelism obtained for N x N matrix multiplication. As expected, with this highly parallelizable algorithm, a lot of parallelism is obtained. The parallelism obtained increases with problem size.


  
Figure 3: Potential parallelism for matrix multiplication.
\begin{figure*}
\center{
\leavevmode \epsfbox{potpar-mat.eps}
}
\end{figure*}

Figure 4 shows the potential parallelism for the AVL [4] binary tree insertion algorithm when run on different random data sets. This sort of problem is usually deemed ``hard to parallelize'' as the program control path is highly coupled to the data being processed and cannot be predicted statically. Speculative execution extracts significant amounts of parallelism though. This is a situation where aggressive speculation wins over conservative methods. The parallelism obtained varies for each data set for a given problem size. Unlike matrix multiplication where control flow decisions are fixed, the course of action taken in AVL insertion is dependent on the data. Varying the data means that different decisions are made during program execution and the parallelism fluctuates accordingly.

  
Figure 4: Potential parallelism for AVL binary tree insertion.
\begin{figure*}
\center{
\leavevmode \epsfbox{potpar-avl.eps}
}
\end{figure*}

Limited Resources

Figure 5 shows the amount of parallelism obtained when computation resources are limited for matrix multiplication. Each line represents a fixed number of frames that can be active at a given time. The number of frames available corresponds to the instruction window size in contemporary architectures. Once again it is assumed that there are enough computation resources available.


  
Figure 5: Potential parallelism for matrix multiplication with limited resources.
\begin{figure*}
\center{
\leavevmode \epsfbox{limit-mat.eps}
}
\end{figure*}

Each plot ascends to a peak and then drops off as the problem size increases even though the available potential parallelism increases with problem size. This diminishing parallelism phenomenon holds for all frame limitations and for a variety of different problems, although it is most marked in matrix based problems. The point at which, and the magnitude of, the peak changes as a function of the number of available frames and the problem size.

The problem of diminishing obtainable parallelism occurs because as some problems increase in size their inner sequential paths get longer. More of the speculative execution ends up waiting in retirement buffers for prior (in virtual order) execution to complete. When frame resources (instruction buffers) are limited this prevents other more speculative instructions from being activated, effectively preventing them from running in parallel.

Summary

An architecture that incorporates aggressive speculation on both control decisions and data values has the ability extract significant amounts of ILP. This has been shown to be true for those programs that depend on data for decision making and are otherwise hard to parallelize.

Using code convergence to map programs onto a tree based control structure provides an adequate speculation control mechanism. This is an effective event ordering mechanism, but for the architecture to perform well an efficient mechanism is needed to disambiguate memory accesses. One method for doing this is through the use of timestamps.

When re-order buffer space is limited the amount of parallelism obtained can drop off even when the parallelism available increases. This phenomenon is not visible in today's architectures as their re-order buffer sizes are relatively small. As larger amounts of parallelism are sought this effect will be experienced and it is an important issue that needs to be addressed.

So far all tests have used ``toy'' problems. Programs with more complex data and control dependencies need to be examined to see if the same trends hold.

Information

More information about the WarpEngine can be found at:
http://www.cs.waikato.ac.nz/timewarp/wengine/

Bibliography

1
John G. Cleary, Murray W. Pearson, and Husam Kinawi.
The architecture of an optimistic CPU: The WarpEngine.
In Proceedings of HICSS, volume 1, pages 163-172, Hawaii, 1995.

2
John L. Hennessy and David A. Patterson.
Computer Architecture A Quantitative Approach.
Morgan Kaufmann Publishers, Inc., San Francisco, second edition, 1996.

3
David Jefferson.
Virtual time ii: Storage management in distributed simulation.
In Proceedings of the 9th Annual ACM Symposium on Principles of Distributed Computing, pages 75-89, August 1990.

4
H. R. Lewis and L. Denenberg.
Data Structures & Their Algorithms.
Harper Collins, 1991.

5
Mikko H. Lipasti and John Paul Shen.
Extending the dataflow limit via value prediction.
In Proceedings of the 29th Annual ACM/IEEE Symposium on Microarchitecture, pages 226-237, Paris, France, December 1996.

6
Richard H. Littin, J. A. David McWha, Murray W. Pearson, and John G. Cleary.
Block based execution and task level parallelism.
In John Morris, editor, The 3rd Australasian Computer Architecture Conference (ACAC'98), volume 20 of Australian Computer Science Communications, pages 57-66, Perth, Australia, February 1998. Springer-Verlag.

7
Jeffrey Oplinger, David Heine, Shih-Wei Liao, Basem A. Nayfeh, Monica S. Lam, and Kunle Olukotun.
Software and hardware for exploiting speculative parallelism with a multiprocessor.
Technical Report CSL-TR-97-715, Computer Systems Laboratory, Stanford University, Stanford, CA 94305-4070, February 1997.

8
Murray W. Pearson, Richard H. Littin, J. A. David McWha, and John G. Cleary.
Applying Time Warp to CPU design.
In High Performance Computing Conference '97, pages 290-295, Bangalore, India, 1997. IEEE.

9
G. S. Sohi, S. Breach, and T. N. Vijaykumar.
Multiscalar processors.
In 22nd International Symposium on Computer Architecture (ISCA-22), pages 414-425, New York, 1995. ACM Press.

About this document ...

This document was generated using the LaTeX2HTML translator Version 98.1p1 release (March 2nd, 1998)

Copyright © 1993, 1994, 1995, 1996, 1997, Nikos Drakos, Computer Based Learning Unit, University of Leeds.

The command line arguments were:
latex2html -split 0 -top_navigation -bottom_navigation gradconf.

The translation was initiated by Richard H Littin on 1999-03-17


Richard H Littin
1999-03-17