What is the WarpEngine?

The WarpEngine is a new general purpose computer architecture which implements the Time Warp mechanism directly in hardware. The virtual memory model seen by the programmer of the WarpEngine has a single linear address space modified by a single thread of execution. Thus there is no need for locks or other explicit synchronising actions when accessing memory. One consequence of this is that existing "dusty desk" code can be parallelised just by recompiling to the WarpEngine. The intended physical implementation is multiple CPUs with their own caches and local memory. Also each CPU simultaneously executes multiple threads of control.

See the WarpEngine Papers.


Time Warp

TimeWarp is an algorithm for distributing and parallelising discrete event simulations. The algorithm was originally proposed by Jefferson in 1985.

Parallelising discrete event simulations is a difficult problem because of the need to synchronise the simulation times of different simulation objects without violating the principle of causality [8]. The TimeWarp algorithm solves this by executing events optimistically and later, if necessary, rolling-back (undoing) events that should not in fact have been executed. A simulation executed using TimeWarp is usually decomposed as a series of objects which communicate by passing messages. In simulation terms, these messages can be thought of as scheduling events which are specified to occur at a particular receiving object at a particular time.

To rollback the execution of an event two things must be accomplished: restore the state of the system; and undo the effect of any messages sent by the event. This latter is accomplished by sending anti-messages following the earlier messages. When an anti-message is received, it can cause two effects. If the original message had not as yet been executed (it is queued somewhere waiting to be scheduled) then it is annihilated and no further action is taken. If the original message had been executed then it will be rolledback (with possible transmission of further anti-messages), the antimessage and message will "annihilate" and execution will resume as if the original message (and anti-message) had never been received.

A common technique for restoring the state of objects on a rollback is to take a copy of the object's state each time it receives a message (much more efficient techniques are possible but do not concern us here). Then when a rollback occurs, the appropriate previous state can be copied back into the object. One side effect of this is that each object builds up a queue of old state copies. It is hence necessary to eventually garbage collect these old state copies (this is referred to as "fossil" collection). This can only be done when it is known that a state copy will never be required for a rollback. This is done by periodically computing a value called the GVT (Global Virtual Time) which is equal to the minimum time of any currently active object or message between objects. It is guaranteed by the TimeWarp mechanism that no object will rollback to a virtual time prior to GVT, and hence it is possible to fossil collect all state copies with time stamps less than GVT. GVT is also used to "commit" other interactions with the outside world. For example, a request to print cannot be honoured until GVT passes the time that the request was made.

Good performance by TimeWarp relies on a number of factors. First, the cascades of anti-messages induced by rollbacks must damp out quickly and not continue building. The overheads required to restore state on a rollback also cannot be too expensive. In practice, this means that the cost of recording state changes during forward execution must be low. Given all of this, TimeWarp is capable of parallelising and speeding up problems that are otherwise intractable. TimeWarp is the only algorithm known which is capable of robustly speeding up a wide range of discrete event simulation problems but that this is contingent on good implementations with low overheads.

It has been recognised ever since the algorithm was originally proposed that TimeWarp is not just an algorithm for parallelising discrete event simulations but is also a general purpose technique for synchronising parallel computation. For example, it can be used as an undo mechanism in editors, consistent checkpointing for fault tolerance, incremental recovery in data bases, selective undo in groupware tools, or debugging for parallel and distributed programs. In this context it can be seen as a generalisation of optimistic commit protocols. It has also been described as an algorithm for synchronising AND-parallel Prolog. Fujimoto has proposed a Virtual Time Machine for use with discrete event simulation systems. Fujimoto's work is seminal for the current WarpEngine design.


Tree Structured Execution

As a first step toward parallelising sequential code using Time Warp, it can be mapped onto an execution tree. The idea is that basic blocks of execution are mapped to each node of the execution tree. A depth first left-to-right traversal of the tree gives the correct sequential order of execution. By arranging the control in this way it is possible to remove (or ameliorate) control dependencies during parallel execution.

To exploit this it is essential that a compiler be able to generate code that maps well onto this tree structure. For example, it is important that many nodes generate more than one child, otherwise control dependencies force sequential execution.


Timestamps

Each node in the execution tree can be associated with a string that gives the path from the root to the node. A zero is used for a left branch, one for a right branch and ¥ to terminate the string. A lexicographic ordering where ¥ < 0 < 1, places these strings in the same order as the sequential execution order of the associated nodes. Thus the strings can be used as timestamps for the virtual sequence.

Time Warp makes frequent use of timestamps. They are created when new nodes are executed and compared whenever a memory operation occurs. It is desirable to have a finite representation of the timestamps for efficient use of resources. This is particularly true of a hardware based implementation where any representation that does not fit into a single machine word of 32 or 64 bits would cause great difficulties. Even in a purely software context this is still an important consideration. It is essential however, that the basic operations on the timestamps, including comparison and creation, remain fast.

Rescaling

A problem that arises for any finite representation is the need to re-use old timestamps. As the execution tree grows, eventually there will be more nodes than can be represented. Because branches of the tree grow in depth unevenly it is the number of levels available to a branch that causes the restriction. This uneven growth tends to cause inefficient use of the timestamps and some will remain unused and be wasted.

As GVT advances, early timestamps will become available for re-use. However, to make use of these requires that current timestamps be re-allocated while retaining their ordering. In some cases the rescaling operation cannot reclaim space in the tree even though space may exist.

Cancellation, however, has a high cost, as work has been done and then thrown away to no good effect. Although these nodes are latest in the virtual sequence, and most likely to be rolled back, it is better, if possible, to avoid it.

Rescaling is expensive as every timestamp in the system has to be altered. Also, while rescaling is being performed it may be necessary to suspend execution of the program to avoid inconsistencies in the timestamps, further raising the cost of rescaling. To minimise the performance losses due to rescaling two main types of solutions exist. The first is to use a representation that requires rescaling less frequently. The second solution is to reduce the cost of rescaling.

More details about timestamp schemes can be found in the paper Timestamp Representations for Virtual Sequences in the WarpEngine Papers.


Time-Space Cache and Memory System

David and Murray know about this.

Compilers and Assemblers

No compiler yet, but if anyone is interested in writing one we would like to here from you.

There is an assembler for the WarpEngine simulator.


Simulation

We have a simulator.

Performance

We have performed some experiments to determine the amount of potential parallelism that the WarpEngine can obtain from various pieces of code. The code fragments are all simple algorithms that are representative of the types of operations that are commonly performed in real applications. The following graphs show the potential parallelism that can be extracted using a WarpEngine that has infinite resources, unlimited bandwidth to memory and functional units, and 0 cost rollbacks. All instruction timings are representative of those in today's architectures, with uniform memory access times assumed.

Binary Tree Insertion

Two tree insertion routines are examined. In both cases a sequence of the specified number of insertions is done. The values to be inserted were generated randomly. The parallelism that is extracted comes mainly from the fact that the insertions can (to a certain extent) be run in parallel.

The first of the two is a naive binary tree insertion. That is, no explicit attempt is made to balance the tree so that the worst case insertion time is O(N). However, because the values inserted were random the expected tree depth (and insertion time) is O(log2N). This algorithm does not parallelism well on architectures that use branch speculation, as all possible paths through the tree need to be speculated for each insertion. The parallelism obtained by optimistic execution relies upon code convergence between the different insertions. (Explicit shared memory code that locks only the leaves of the tree may achieve parallelism similar to optimistic systems.)

The second tree insertion program uses the AVL algorithm. This is guaranteed to have (amortized) O(log2N) insertion time. Again an optimistic system can achieve parallelism via code re-convergence, while speculative execution cannot. Because of the possibility of rotations of internal nodes an explicit shared memory system needs to lock all the nodes from the root to the leaf during insertion. This serializes the computation removing most opportunities for parallelism.

Sorting

The following programs sort an array of the specified length. HeapSort and QuickSort are both efficient sorting algorithms (on a sequential processor), but differ in the amount of dynamic parallelism that is obtainable. HeapSort repeatedly modifies a root element which serializes computation through that point. QuickSort, on the other hand, subdivides the problem into smaller problems which are independent of each other. We have implemented two forms of QuickSort that differ in the manner in which they select their pivot elements. The first selects a pivot value and moves from left to right through the array swapping elements until all elements less than the pivot are to the left of it. The second version selects the pivot value and then moves in towards the centre swapping from either end of the array until the pivot is in the correct position. Both versions have a critical execution path of approximately O(N) giving potential speedup of O(log2N). It is not entirely clear why quicksort 1 does significantly better than quicksort 2 in the later results. Although, it is note-worthy that the sub-sorts after the partition step can start earlier for quicksort 1 because the left sub-sort can start after 1 partition step and the right sub-sort after N/2 partition steps. In quicksort 2 the sub-sorts cannot effectively start until the entire partition is complete in N steps.

Matrix Multiplication

Matrix Multiplication is coded to multiply two NxN matrices. The inner loop, is linear, and because of the data dependency through the accumulator variable has a critical path of O(N). The result is total parallelism of O(N2).

In matrix multplication there is a substantial amount of parallelism that can be obtained. With no modification (ie. use of parallel constructs) of the original sequential source code the WarpEngine can extract the maximum amount of parallelism that is available.

Gauss-Jordan Elimination

Gauss-Jordan elimination is a version of matrix inversion of an NxN matrix that makes a careful choice of which row to use as a pivot at each step: which requires a loop of O(N). As a consequence the overall critical path is O(N2) and the parallelism O(N).

In Gauss-Jordan elimination, as in matrix multiplication, there is a substantial amount of parallelism that can be seen at the source code level. But unlike matrix multiplication there is a decision that is dependent on the data being manipluated - so a traditional parallel coding techniques would find it hard to extract large amounts parallelism. The WarpEngine has no problems extracting this parallelism.

Recursive Fibonacci Number Generation


List Insertion

List insertion is just a special case of binary tree insertion, but the graph shows how the WarpEngine can extract the maximum amount of parallelism out of the data dependencies that are in place. This is shown on the green line, where the amount of potential parallelism that is extracted is dependent on the data that is being store in the list.

The above graphs indicate that a machine based on the Time Warp mechanism has the potential to extract large amounts of parallelism out of standard sequential code.

Performance Limitations

Restricting number of frames

???

Memory access orders

A sequential machine executes all memory accesses in order. A machine which attempts to execute many instructions in parallel must be careful, as memory operations to a particular location are constrained in the order in which they can occur. These constraints are often described in terms of the `hazards' that exist. For example, if logically a write is followed by a read then they must be executed in that order. This is called a Read After Write (RAW) hazard. In contrast two successive reads to a location can be reordered and there is no RAR (Read After Read) hazard.

When building a parallel computer there is a trade off between increased parallelism and the extra complexity of hardware needed to ensure that hazards such as RAW are not violated. Such machines can be classified by the sets of re-ordering that are possible. For example, if reads can be re-ordered then this is said to be an RPR machine (Reads can Pass Reads). Similarly if (whenever it is safe) a later write can be interchanged with an earlier read then the machine is said to be an WPR (Writes can Pass Reads) machine. WPW (Writes Pass Writes) and RPW (Reads Pass Write) round out the possible re-orderings.

To examine the performance properties of each of the reordering types we define a set of abstract machine models and analyze the potential parallelism that is available for each machine. The machines are formed by taking combinations of the memory ordering relaxation types. They each assume infinite resources and unlimited memory bandwidth. Today's architecture are of RPR type.

The results support the current wisdom that for parallelism up to 10 or so the RPR-RPW machine is a good choice. It provides parallelism of greater than 10 for all but two of the programs (heap and qu2). However, to consistently move into the region of parallelism of more than 15 clearly requires more. The RPR-RPW machine gives parallelism of 90 or more for 3 of the 8 algorithms, but the other 5 languish between 2 and 15. Adding WPR capability to give the RPR-RPW-WPR machine gives no noticeable improvement. Adding WPW to give RPR-RPW-WPW does improve the three least parallel programs by factors of between 2 and 100. The final step to the ALL machine improves two programs: heap from 9.8 to 19.9 and qu1 from 18 to 94. The result is that the ALL machine has only one program below 19 (qu2 at 5.45) and 6 of the 8 programs have parallelism of 90 or more.

The general conclusion is that to achieve reliable speedups in the region from 20 to 100 it will be necessary to use an ALL architecture. While some of the simpler programs achieved good speedups on simpler architectures the more complex codes such as heap and qu1 improve significantly with each step in capability. We expect this phenomenon to be even more marked for more realistic programs with less regular data and control dependencies.

The WarpEngine allows memory accesses to happen in any order through the use of its timestamped memory system. More detailed infromation about memory ordering problems can be found in "Effects of Re-ordered Memory Operations on Parallelism" in the WarpEngine Papers.


Problems and Solutions

There are plenty of problems, but not as many solutions yet.

Maintained by Richard Littin, suggestions given by David McWha.

Last modified: Mon Sep 6 15:53:04 NZST 1999