See the WarpEngine Papers.
Parallelising discrete event simulations is a difficult problem because of the need to synchronise the simulation times of different simulation objects without violating the principle of causality [8]. The TimeWarp algorithm solves this by executing events optimistically and later, if necessary, rolling-back (undoing) events that should not in fact have been executed. A simulation executed using TimeWarp is usually decomposed as a series of objects which communicate by passing messages. In simulation terms, these messages can be thought of as scheduling events which are specified to occur at a particular receiving object at a particular time.
To rollback the execution of an event two things must be accomplished: restore the state of the system; and undo the effect of any messages sent by the event. This latter is accomplished by sending anti-messages following the earlier messages. When an anti-message is received, it can cause two effects. If the original message had not as yet been executed (it is queued somewhere waiting to be scheduled) then it is annihilated and no further action is taken. If the original message had been executed then it will be rolledback (with possible transmission of further anti-messages), the antimessage and message will "annihilate" and execution will resume as if the original message (and anti-message) had never been received.
A common technique for restoring the state of objects on a rollback is to take a copy of the object's state each time it receives a message (much more efficient techniques are possible but do not concern us here). Then when a rollback occurs, the appropriate previous state can be copied back into the object. One side effect of this is that each object builds up a queue of old state copies. It is hence necessary to eventually garbage collect these old state copies (this is referred to as "fossil" collection). This can only be done when it is known that a state copy will never be required for a rollback. This is done by periodically computing a value called the GVT (Global Virtual Time) which is equal to the minimum time of any currently active object or message between objects. It is guaranteed by the TimeWarp mechanism that no object will rollback to a virtual time prior to GVT, and hence it is possible to fossil collect all state copies with time stamps less than GVT. GVT is also used to "commit" other interactions with the outside world. For example, a request to print cannot be honoured until GVT passes the time that the request was made.
Good performance by TimeWarp relies on a number of factors. First, the cascades of anti-messages induced by rollbacks must damp out quickly and not continue building. The overheads required to restore state on a rollback also cannot be too expensive. In practice, this means that the cost of recording state changes during forward execution must be low. Given all of this, TimeWarp is capable of parallelising and speeding up problems that are otherwise intractable. TimeWarp is the only algorithm known which is capable of robustly speeding up a wide range of discrete event simulation problems but that this is contingent on good implementations with low overheads.
It has been recognised ever since the algorithm was originally proposed that TimeWarp is not just an algorithm for parallelising discrete event simulations but is also a general purpose technique for synchronising parallel computation. For example, it can be used as an undo mechanism in editors, consistent checkpointing for fault tolerance, incremental recovery in data bases, selective undo in groupware tools, or debugging for parallel and distributed programs. In this context it can be seen as a generalisation of optimistic commit protocols. It has also been described as an algorithm for synchronising AND-parallel Prolog. Fujimoto has proposed a Virtual Time Machine for use with discrete event simulation systems. Fujimoto's work is seminal for the current WarpEngine design.
To exploit this it is essential that a compiler be able to generate code that maps well onto this tree structure. For example, it is important that many nodes generate more than one child, otherwise control dependencies force sequential execution.
Time Warp makes frequent use of timestamps. They are created when new nodes are executed and compared whenever a memory operation occurs. It is desirable to have a finite representation of the timestamps for efficient use of resources. This is particularly true of a hardware based implementation where any representation that does not fit into a single machine word of 32 or 64 bits would cause great difficulties. Even in a purely software context this is still an important consideration. It is essential however, that the basic operations on the timestamps, including comparison and creation, remain fast.
As GVT advances, early timestamps will become available for re-use. However, to make use of these requires that current timestamps be re-allocated while retaining their ordering. In some cases the rescaling operation cannot reclaim space in the tree even though space may exist.
Cancellation, however, has a high cost, as work has been done and then thrown away to no good effect. Although these nodes are latest in the virtual sequence, and most likely to be rolled back, it is better, if possible, to avoid it.
Rescaling is expensive as every timestamp in the system has to be altered. Also, while rescaling is being performed it may be necessary to suspend execution of the program to avoid inconsistencies in the timestamps, further raising the cost of rescaling. To minimise the performance losses due to rescaling two main types of solutions exist. The first is to use a representation that requires rescaling less frequently. The second solution is to reduce the cost of rescaling.
More details about timestamp schemes can be found in the paper Timestamp Representations for Virtual Sequences in the WarpEngine Papers.
There is an assembler for the WarpEngine simulator.
Two tree insertion routines are examined. In both cases a sequence of the
specified number of insertions is done. The values to be inserted were
generated randomly. The parallelism that is extracted comes mainly from the
fact that the insertions can (to a certain extent) be run in parallel.
The first of the two is a naive binary tree insertion. That is, no explicit attempt is made to balance the tree so that the worst case insertion time is O(N). However, because the values inserted were random the expected tree depth (and insertion time) is O(log2N). This algorithm does not parallelism well on architectures that use branch speculation, as all possible paths through the tree need to be speculated for each insertion. The parallelism obtained by optimistic execution relies upon code convergence between the different insertions. (Explicit shared memory code that locks only the leaves of the tree may achieve parallelism similar to optimistic systems.)
The second tree insertion program uses the AVL algorithm. This is
guaranteed to have (amortized) O(log2N) insertion time. Again an
optimistic system can achieve parallelism via code re-convergence, while
speculative execution cannot. Because of the possibility of rotations of
internal nodes an explicit shared memory system needs to lock all the nodes
from the root to the leaf during insertion. This serializes the computation
removing most opportunities for parallelism.
The following programs sort an array of the specified length. HeapSort and
QuickSort are both efficient sorting algorithms (on a sequential
processor), but differ in the amount of dynamic parallelism that is
obtainable. HeapSort repeatedly modifies a root element which serializes
computation through that point. QuickSort, on the other hand, subdivides
the problem into smaller problems which are independent of each other.
We have implemented two forms of QuickSort that differ in the manner in
which they select their pivot elements. The first selects a pivot value and
moves from left to right through the array swapping elements until all
elements less than the pivot are to the left of it. The second version
selects the pivot value and then moves in towards the centre swapping from
either end of the array until the pivot is in the correct position. Both
versions have a critical execution path of approximately O(N) giving
potential speedup of O(log2N). It is not entirely clear why
quicksort 1 does significantly better than quicksort 2 in the later
results. Although, it is note-worthy that the sub-sorts after the partition
step can start earlier for quicksort 1 because the left sub-sort can start
after 1 partition step and the right sub-sort after N/2 partition steps. In
quicksort 2 the sub-sorts cannot effectively start until the entire
partition is complete in N steps.
Matrix Multiplication is coded to multiply two NxN matrices. The inner
loop, is linear, and because of the data dependency through the accumulator
variable has a critical path of O(N). The result is total parallelism of
O(N2).
In matrix multplication there is a substantial amount of parallelism that
can be obtained. With no modification (ie. use of parallel constructs) of
the original sequential source code the WarpEngine can extract the maximum
amount of parallelism that is available.
Gauss-Jordan elimination is a version of matrix inversion of an NxN matrix
that makes a careful choice of which row to use as a pivot at each step:
which requires a loop of O(N). As a consequence the overall critical path
is O(N2) and the parallelism O(N).
In Gauss-Jordan elimination, as in matrix multiplication, there is a
substantial amount of parallelism that can be seen at the source code
level. But unlike matrix multiplication there is a decision that is
dependent on the data being manipluated - so a traditional parallel coding
techniques would find it hard to extract large amounts parallelism. The
WarpEngine has no problems extracting this parallelism.
List insertion is just a special case of binary tree insertion, but the
graph shows how the WarpEngine can extract the maximum amount of
parallelism out of the data dependencies that are in place. This is shown
on the green line, where the amount of potential parallelism that is
extracted is dependent on the data that is being store in the list.
The above graphs indicate that a machine based on the Time Warp mechanism has the potential to extract large amounts of parallelism out of standard sequential code.
???
Memory access orders
A sequential machine executes all memory accesses in order. A machine which attempts to execute many instructions in parallel must be careful, as memory operations to a particular location are constrained in the order in which they can occur. These constraints are often described in terms of the `hazards' that exist. For example, if logically a write is followed by a read then they must be executed in that order. This is called a Read After Write (RAW) hazard. In contrast two successive reads to a location can be reordered and there is no RAR (Read After Read) hazard.
When building a parallel computer there is a trade off between increased parallelism and the extra complexity of hardware needed to ensure that hazards such as RAW are not violated. Such machines can be classified by the sets of re-ordering that are possible. For example, if reads can be re-ordered then this is said to be an RPR machine (Reads can Pass Reads). Similarly if (whenever it is safe) a later write can be interchanged with an earlier read then the machine is said to be an WPR (Writes can Pass Reads) machine. WPW (Writes Pass Writes) and RPW (Reads Pass Write) round out the possible re-orderings.
To examine the performance properties of each of the reordering types we define a set of abstract machine models and analyze the potential parallelism that is available for each machine. The machines are formed by taking combinations of the memory ordering relaxation types. They each assume infinite resources and unlimited memory bandwidth. Today's architecture are of RPR type.
The results support the current wisdom that for parallelism up to 10 or so
the RPR-RPW machine is a good choice. It provides parallelism of greater
than 10 for all but two of the programs (heap and qu2). However, to
consistently move into the region of parallelism of more than 15 clearly
requires more. The RPR-RPW machine gives parallelism of 90 or more for 3 of
the 8 algorithms, but the other 5 languish between 2 and 15. Adding WPR
capability to give the RPR-RPW-WPR machine gives no noticeable
improvement. Adding WPW to give RPR-RPW-WPW does improve the three least
parallel programs by factors of between 2 and 100. The final step to the
ALL machine improves two programs: heap from 9.8 to 19.9 and qu1 from 18 to
94. The result is that the ALL machine has only one program below 19 (qu2
at 5.45) and 6 of the 8 programs have parallelism of 90 or more.
The general conclusion is that to achieve reliable speedups in the region from 20 to 100 it will be necessary to use an ALL architecture. While some of the simpler programs achieved good speedups on simpler architectures the more complex codes such as heap and qu1 improve significantly with each step in capability. We expect this phenomenon to be even more marked for more realistic programs with less regular data and control dependencies.
The WarpEngine allows memory accesses to happen in any order through the use of its timestamped memory system. More detailed infromation about memory ordering problems can be found in "Effects of Re-ordered Memory Operations on Parallelism" in the WarpEngine Papers.
Maintained by Richard Littin, suggestions given by David McWha.
Last modified: Mon Sep 6 15:53:04 NZST 1999