Timestamp Representations for Virtual Sequences

 

 

John G. Cleary, J. A. David McWha, Murray Pearson

 

Dept of Computer Science, University of Waikato, Private Bag 3105, Hamilton, New Zealand.

{jcleary, jadm, mpearson}@cs.waikato.ac.nz

 

Keywords: optimism, Time Warp, timestamps, parallel execution

 

 

Abstract

The problem of executing sequential programs in parallel using the optimistic algorithm Time Warp is considered. This is done by first mapping the sequential execution to a control tree and then assigning timestamps to each node in the tree.

For such timestamps to be effective in either hardware or software they must be finite, this implies that they must be periodically rescaled to allow old timestamps to be reused. A number of timestamp representations are described and compared on the basis of: their complexity; the frequency and cost of rescaling; and the cost of performing basic operations, including comparison and creation of new timestamps.

 

  1. Introduction
  2. Optimistic execution algorithms using causality error detection and recovery have been used successfully for speeding up parallel discrete event simulation [6, 7]. Time Warp [9], the best known of these algorithms, allows events to be scheduled for execution without regard for causality conflicts. When a conflict does occur the simulation recovers using rollback, which returns the simulation to an earlier correct state.

    The virtual time paradigm [9] forms the basis for the Time Warp algorithm. Virtual time is a totally ordered time scale, which restricts the causality between events. Each event is assigned a virtual time value and can only causally affect events with a later virtual time. In simulations virtual times are typically integer or real values explicitly computed in the model code.

    Although most widely used for parallel discrete event simulation, optimistic techniques including rollback can also be used as synchronisation mechanisms for more general parallel computation tasks. For example they can be used as an undo mechanism in editors [13], incremental recovery in databases [11], selective undo in groupware tools [12], or debugging for parallel and distributed programs [4]. The application we will focus on, though, is general purpose sequential computation.

    We associate timestamps with each part of the sequential computation. The ordering of the timestamps is the same as the original sequence and timestamps can be generated freely so long as they conform to this requirement. To distinguish this from simulations, where timestamps are computed explicitly, it will be called a virtual sequence (rather than a virtual time).

    Given some efficient way of generating such a virtual sequence, any conservative or optimistic algorithm for parallel simulation can be used to parallelize the sequential execution. Two different systems using Time Warp to implement these ideas have appeared in the literature. Back and Turner [1, 2, 14] have constructed a software based system for C++ and Cleary et al [5] have designed a general purpose CPU architecture which transparently executes sequential code, aiming to achieve up to one hundred-fold parallelism.

    As observed by Back and Turner [2], a major difficulty in using Time Warp with a virtual sequence is that it may be necessary to allocate an arbitrary number of new timestamps between any pair of existing timestamps, to allow structures such as dynamically bounded loops to execute in parallel with following code. The solution in Back and Turner (infinite length timestamps similar to the length representation described below) works well where the size of the timestamp representation is not important. Unfortunately, in real systems there are storage and bandwidth limits. When a large amount of parallelism is being extracted many events are active at one time and many messages are being passed around the system. If the sequential computation is in an imperative language then it is also necessary to record the timestamp of each write to and read from memory to maintain the correct data dependencies. If the timestamp representation is allowed to become arbitrarily large then storage and bandwidth limits will be quickly reached. This is of particular importance in hardware implementations, where timestamps of varying lengths are difficult to implement. Although hardware and software implementations have many of the problems and solutions in common they differ in some key ways, which we will identify throughout this paper.

    In section 2 we review execution structures used by the Time Warp algorithm and the mechanisms (fossil collection, cancelback and rescaling) used to guarantee execution progress. Section 3 describes a basic fixed length timestamp representation, which is further refined in sections 4 and 5. Section 6 draws some conclusions about the best representation to use.

  3. Optimistic Execution
  4. In this section we review the basic mechanisms necessary to map a sequential execution onto a parallel optimistic execution.

    1. Execution trees
    2. To aid in time stamping and parallelizing sequential code it is mapped onto an execution tree. The tree can be executed sequentially by a depth-first left to right traversal. A parallel execution of the tree might be achieved in a number of ways. Here we focus on the use of the Time Warp algorithm which executes nodes in parallel and uses rollback to resynchronize out of order memory operations.

      To exploit parallelism it is essential that a compiler be able to generate code that maps well onto this tree structure. For example, it is important that most nodes generate more than one child, otherwise control dependencies force sequential execution. Figure 2.1 shows a tree that might be generated for the code sequence: A; B; C; D;

       

      Figure .1 Tree generated for sequential code

      In the case of a for loop, where the bounds are known statically, a balanced binary tree can be generated using the ‘split’ nodes (marked S) to generate more children. Figure 2.2 shows a tree generated for:

      for i = 1 to 8 do iter;

      follow; /*statements following for loop*/

       

      Figure .2 Execution tree for simple for loop

      Notice that statements immediately following the for loop are available for execution in parallel with the for loop.

      While loops are more difficult, as the number of iterations to be executed may be unknown. However, it is possible to schedule iterations of the loop speculatively by generating an increasing number of nodes in a similar fashion to that described in section 5 of [2], as shown in Figure 2.3.

      In this example arcs labelled cond x point to subtrees which are scheduled conditionally. To allow these subtrees to be executed speculatively each of the conditionals is optimistically set to true allowing the execution tree shown to grow quickly. As soon as a conditional evaluates to false (ie. the loop terminates), execution of the subtree below it is rolled back and terminated. Also notice in this structure there is a linear "backbone" (the nodes labelled B), which starts exponentially increasing numbers of iterations through the use of a subtree with an extra level of split nodes in each successive node of the backbone. This structure balances the overhead of executing loops which contain only a small number of iterations (which is the most common case [8]) against the cost of executing loops with a large number of iterations. The speculation distance will be limited, ultimately, by the finite precision of the timestamps, which will present a rollback explosion occurring.

      When generating code for conditional statements (such as if-then-else) a number of possibilities exist. In the most conservative case the condition is evaluated first to determine which branch should be executed. A more aggressive approach is to execute the conditional and one of the branches speculatively. In the event that the correct branch was executed speculatively a speedup occurs. However, if the wrong branch was executed its execution has to be rolled back and the correct branch taken. A still more aggressive approach is to execute both possible branches, as well as the condition, in parallel and always roll back one of the branches. Each branch decision is allocated different timestamps and forms a different branch of the tree to allow them to execute in parallel.

    3. Fossil collection and cancelback
    4. The above figures represent a static picture of the execution trees. When Time Warp is applied to their execution, multiple nodes of the tree can be executed in parallel without regard for any data dependencies that may exist between them. In the case that a dependency does exist between two nodes and they are executed out of order the dependent node must be rolled back. This means that enough information must be maintained within the system to ensure that any data dependencies which are not met can be detected and corrected.

      As nodes cannot produce results which affect nodes executing earlier in the virtual sequence there will always be an earliest active node in the system, which cannot be rolled back. The virtual time of this node is known as Global Virtual Time (GVT) and represents the current point in the program which can be considered to have safely completed execution. Any resources (including timestamps) that have been allocated to nodes with timestamps less than this time can be released back to system for reuse. The process of reclaiming these resources is known as fossil collection. In most cases this is enough to ensure execution can progress, because if a node is scheduled for execution and there are not enough resources available to do so, then the node can wait for GVT to progress, freeing up the necessary resources.

      However, there are situations where fossil collection alone is not enough to ensure execution will always be able to progress. It is possible for a node scheduled for execution to become the GVT holder before resources necessary to execute it become available, as they are being used by nodes later in the virtual sequence. Cancelback is an additional mechanism that ensures that execution can proceed. It allows a node and all its descendants to be cancelled, freeing up resources within the system. The node is then rescheduled for execution later by assigning it a timestamp at the end of the virtual sequence when one becomes available. For example, if the node at the root of the tree, or anywhere on the right edge of the tree, did not generate the maximum number of children possible the right-most timestamp at the next level of the tree could be assigned to the cancelled node, allowing it to re-execute. The original description of the cancelback algorithm [10] showed that it guarantees that a Time Warp simulation can always complete in the same space as required by a sequential implementation.

    5. Timestamps
    6. Each node in the execution tree can be associated with a string that gives the path from the root to the node. A zero is used for a left branch, one for a right branch and ∆ to terminate the string (Figure 2.4). A lexicographic ordering where ∆ < 0 < 1, places these strings in the same order as the sequential execution order of the associated nodes. Thus the strings can be used as timestamps for the virtual sequence.

       

      Figure .4 Example execution tree showing timestamps

      Time Warp makes frequent use of timestamps. They are created when new nodes are executed and compared whenever a memory operation occurs. It is desirable to have a finite representation of the timestamps for efficient use of resources. This is particularly true of a hardware based implementation where any representation that does not fit into a single machine word of 32 or 64 bits would cause great difficulties. Even in a purely software context this is still an important consideration. It is essential however, that the basic operations on the timestamps, including comparison and creation, remain fast.

    7. Rescaling

    A problem that arises for any finite representation is the need to re-use old timestamps. As the execution tree grows, eventually there will be more nodes than can be represented. Because branches of the tree grow in depth unevenly it is the number of levels available to a branch that causes the restriction. This uneven growth tends to cause inefficient use of the timestamps and some will remain unused and be wasted.

    As GVT advances, early timestamps will become available for re-use. However, to make use of these requires that current timestamps be re-allocated while retaining their ordering. For example, consider the execution tree in Figure 2.5 which uses a timestamp representation with at most three symbols (including the terminating ∆). A new node 8 is to be added to the tree. In this example all nodes shaded with cross hatching have been fossil collected, so their timestamps can be reclaimed for reuse. This can be achieved by selecting the node which is the GVT holder and traversing up the tree until a node is found which has all currently active nodes in its subtree and making this the new root of the tree. All nodes below it can then be rescaled.

    In some cases the rescaling operation cannot reclaim space in the tree even though space may exist. For example in Figure 2.6 it is not possible to rescale the tree because the root node has a right-most child that is not an ancestor of the current GVT holder. A solution is to cancel node 3, and delete its parent, node 1. This allows rescaling which provides room to create node 8. For all finite representations investigated in this paper, cancellation is sufficient to permit forward progress.

    Cancellation, however, has a high cost, as work has been done and then thrown away to no good effect. Although these nodes are latest in the virtual sequence, and most likely to be rolled back, it is better, if possible, to avoid it. Figure 2.7 shows a less extreme version of the previous example. In this case it is possible to rescale just the left half of the tree (without cancelling any nodes). Such a rescaling is achieved by moving the node that is the GVT holder as far to the left as possible and then up the tree until a node which has a right most child is detected. All descendants of the GVT holder can then be rescaled, making space for the new node to be added.

    Rescaling is expensive as every timestamp in the system has to be altered. Also, while rescaling is being performed it may be necessary to suspend execution of the program to avoid inconsistencies in the timestamps, further raising the cost of rescaling. To minimise the performance losses due to rescaling two main types of solutions exist. The first is to use a representation that requires rescaling less frequently. The second solution is to reduce the cost of rescaling. The following sections describe different timestamp representations which can be used to address this problem.

    (b) Partially rescaled execution tree

  5. Timestamp representations
    1. Length representation

    To make timestamp comparison easy we can map the timestamps from bit strings to integer values. This makes it possible to compare timestamps bitwise, as for integers. The bit string timestamp can be converted to an integer by padding the bit string out to I bits with zeros and then appending the length of the string (less the terminating ∆). Some integers remain unused in the representation. A 32 bit representation (I = 27 bits, L = 5 bits) would allow 28 levels to be represented, while a 64 bit representation (I = 58 bits, L = 6 bits) allows 59 levels.

    Comparison is done with a bitwise comparison on the complete timestamp given by concatenating the two parts. Given the parent’s timestamp is the integer P, the left child’s is P + 1 and the right child’s is P + 2n-d+l + 1 where n is the maximum number of levels in the tree and d is the depth of the parent. The number of bits needed for the length is é log2 nù . Rescaling operates on the two parts of the integer separately. The string part is left shifted and the length is decremented. This provides a simple arithmetic operation to perform the functions of generate, compare and rescale easily in either hardware or software. The maximum tree depth for this representation is still quite limited, however.

  6. Rescaling less often
  7. By increasing the maximum depth of execution trees the frequency of rescaling operations can be reduced. One representation that can achieve this uses a mantissa and exponent and varies the range of timestamps allocated to children.

    1. Exponential representation
    2. This representation uses a scheme similar to floating point number representations to allow different parts of the tree to grow to different depths. It is comprised of two parts: a mantissa and an exponent. The exponent is the number of leading zeros in the timestamp, while the mantissa is the normalised tail of the timestamp. In the example in Figure 4.1 the timestamp 0010∆ becomes 2,10∆, where the number before the comma is the exponent and the string after the comma is the mantissa.

      A complete exponential representation requires that the mantissa be coded using the length representation above. Using this representation, the left-most path from the root of the tree can grow to (2n-1+m) levels for a representation that uses an n bit exponent and where m levels are available from the length representation in the mantissa. Moving to the right, the depth of the paths decreases until reaching the right-most path where the maximum depth is m-1 levels. The proportions in which a timestamp is divided into exponent and mantissa will be the subject of some optimisation based on application. Suppose, however, that a 32 bit representation for a binary tree uses an exponent of 16 bits and a mantissa of 16 bits. The maximum depth of the left-most branch is 65548 levels, while the right-most branch has a maximum of only 12 levels. A 64 bit representation using 32 bits for the exponent and 32 bits for the mantissa gives a maximum left branch depth of (232+27) levels and a maximum right branch depth of 27 levels.

      This representation favours execution of the left side of the tree. That is, nodes that are early in the virtual sequence can have longer strings and so will not exhaust the maximum depth so soon. Thus rescale, and possibly cancel, operations can be reduced. This has two additional advantages. First, by delaying execution of nodes to the right of the tree which are more speculative it helps balance the overall execution. Second, a compiler can take advantage of the representation by pushing more computation down to the left of the tree. Provided the compiler can schedule the critical parts of the execution to the left of the tree, execution can progress for much longer without needing to rescale using this representation.

      Comparison of two timestamps is more complex for this representation. Generally timestamps with a larger exponent are earlier in the virtual sequence, however, any timestamp with only a ∆ in the mantissa will be earlier than a timestamp with a longer mantissa. Hardware implementations will have a greater ability to optimise this operation, but will still be complex.

      The rescaling and creation operations are equally complex. So while this representation has promise in that it may be able to avoid rescaling and cancellation operations the details make it unsatisfactory. The next representation seeks to retain the ability to avoid rescaling without the attendant complexity.

    3. Arithmetic coding

    Rescaling is necessary no matter what finite representation is used. However, consider a perfectly balanced execution tree with leaves where the only leaf node with a child is the right-most one. In such a tree rescaling would only be necessary once every 2n nodes, thus amortising its cost over much computation. It is difficult, however, for either the compiler or the run-time system to achieve such a perfect balance. This section describes a time stamp representation, based on arithmetic coding [3], that allows dynamic allocation of timestamps and the possibility of approximating the perfectly balanced tree.

    Each node is allocated a range of (finite integer) timestamps for itself and its descendants. The interval is described by a lower bound, which becomes the timestamp of the node itself, and an upper bound. A descendant can then be allocated a timestamp sub-interval anywhere within its parent’s interval, excluding the timestamp of the parent itself. This provides an implicit lower bound, shown in brackets in Figure 4.2. By allocating the sub-intervals in proportion to the size of the respective sub-tree then a balanced allocation of timestamps is possible.

    Figure 4.2 shows a simple example of arithmetic coding using four bit timestamps. The children are allocated timestamp intervals in the proportions shown on the arrows. For example, the interval at the root node is split with one part to the right and three parts to the left. The limits of the tree are reached when the interval contains just one timestamp, as in the right-most leaf. Such trees may have leaves at widely varying depths, according to the size of the allocated sub-intervals. A 32 bit timestamp could conceivably give a tree depth of up to 232 nodes anywhere in the tree, if required before rescaling is necessary. Likewise a 64 bit timestamp would allow a depth of up to 264 nodes, as necessary. Unfortunately twice as much storage at the node is required for arithmetic coding timestamps as for any other scheme, because both an upper and a lower bound must be stored. Hence, to stay within the same storage limits the size of the timestamp must be halved, severely restricting the total number of timestamps available.

    The proportions of sub-intervals can be calculated statically by the compiler or dynamically at run-time. Dynamic allocation is necessary for some situations, for example, loops where the bounds are not known until execution, such as while loops or general recursion. Imperfect estimation of the sizes of sub-trees causes problems in the parts of the tree that are under-allocated and extra rescaling and cancellation may be necessary for computation to continue. This representation will do no worse than the static unbalanced techniques described earlier (allocating equal weights to each child gives almost the same codes as the integer representation). The better the estimates of sub-tree size, the less often rescaling is necessary.

    The individual timestamps associated with each node can be represented as simple integers, with timestamp comparison using normal bitwise compares. Calculation of the child intervals needs only integer multiplication (albeit with careful attention paid to rounding, see [3] for details). Rescaling requires that both the upper and lower bounds at each node be left shifted. In the case of the upper bounds 1’s are shifted into the low order positions and in the case of the lower bounds 0’s are shifted in.

  8. Reducing the cost of rescaling
    1. Moving head representation

    While the representations described in the previous section can reduce the frequency of rescaling operations, rescaling is still very expensive as every timestamp in the system must be altered individually.

    A technique to reduce the cost of rescaling is based on the observation that when rescaling occurs in either the length or the arithmetic coding representations, the same bits are removed off the front of all timestamps being rescaled. To take advantage of this a global index (the head index), which defines the start of all timestamps in the system, can be used. Rescaling requires only this single global variable to be altered. The bits in the timestamp are assumed to wrap around, so that no precision is lost when rescaling, as shown in Figure 5.1. A tail index is used to terminate each timestamp because, after the head has been moved, no assumptions can be made about the value of the unused bits. Note that the tail index performs the same function as the length indicator in the length representation in section 3.1.

    The general rescaling operation described for the length representation is the same for this representation except that it is only necessary to alter the head pointer as shown in the example in Figure 5.2. Note that it is not necessary to perform any operation on the timestamps themselves and hence the cost of rescaling is greatly reduced.

    The main advantage of this representation is that the cost of rescaling has been significantly reduced as only the head index has to be altered. This is particularly true in software where otherwise a scan of all timestamps in the system would be needed. Even in hardware the gain is significant. It might be possible in hardware to rescale all timestamps in parallel, but this would clearly require much hardware and a possible suspension of processing while this occurred. The moving head does require that the new value of the head be broadcast to all parts of the system but this can be done as part of other activities such as broadcasting values of GVT.

    One disadvantage of the scheme as described is that the entire tree must be rescaled, one part cannot be rescaled. However, with a slightly more complex scheme where a head index is associated with sub-parts of the tree it should be possible to do partial rescaling. We are currently investigating this possibility which was also suggested by one of the anonymous referees.

  9. Conclusions
  10. The paper has investigated one of the major problems in using Time Warp (and similar optimistic techniques) in parallelizing sequential computations - namely the representation and efficient processing of timestamps. The focus has been on finite representations, which have the advantage of being compact and potentially fast to process. Unfortunately they imply the need to periodically rescale the timestamps in the system and sometimes to cancel back already processed code.

    An exponential (or floating point) representation has some potential as it may reduce the frequency of rescaling. However, in a software context the cost of rescaling will be so high that this is unlikely to be viable overall.

    The moving head representation has the most promise as it reduces the cost of rescaling to that of updating a single global constant. This reduces the cost and complexity of rescaling in both hardware and especially in software. A moving head could be combined with either the fixed length or arithmetic coding representations. Arithmetic coding has an advantage in that it can potentially use the bits in the timestamps more efficiently. However, it is unclear to what extent this is able to offset the increased complexity and the need to maintain both upper and lower bounds to the timestamps on each node (although the timestamps themselves do not need this). Also, arithmetic coding requires good static or dynamic estimates of the size of subtrees as they are generated. It is not clear to what extent this is possible.

    The main contribution of this paper is to show that finite timestamp representations are possible. This removes one of the barriers to applying the mature technique of optimistic processing in the novel domain of parallelizing sequential computation. This holds out the promise of parallelizing and speeding up code that it would otherwise be impossible or too expensive to parallelize. It is clearly necessary to follow this up with a more precise characterisation of the costs and benefits of the different representations. We are currently running simulations to investigate to what extent the representations restrict the amount of parallelism available, how often rescaling and cancelback is necessary, and how quick the individual timestamp operations are.

  11. References

[1] Back, A. and Turner, S., (1995) "Using Optimistic Execution Techniques as a Parallelisation Tool for General Purpose Computing," Proceedings of HPCN Europe, Milan, Italy, pp21-26, May.

[2] Back, A. and Turner, S., (1995) "Time-Stamp Generation for Optimistic Parallel Computing," Proceedings of 28th Annual Simulation Symposium, Phoenix, AZ, USA, pp. 144-153, April.

[3] Bell, T.C., Cleary, J.G., and Witten, I.H. (1990) "Text compression," Prentice Hall, Englewood Cliffs, NJ.

[4] Choi, J.D., Miller, B.P. and Netzer, R.H.B., (1991) "Techniques for Debugging Parallel Programs with Flowback Analysis," ACM Transactions on Programming Languages and Systems 13(4), pp. 491-530, October.

[5] Cleary, J.G., Pearson, M. and Kinawi, H., (1995) "The Architecture of an Optimistic CPU: The WarpEngine," Proceedings of HICSS, vol 1 pp. 163-172.

[6] Fujimoto, R.M., (1989) "Time Warp on a Shared Memory Multiprocessor," Transactions of the Society for Computer Simulation 6(3), pp. 211-239, July.

[7] Fujimoto, R.M., (1990) "Parallel Discrete Event Simulation," Communications of the ACM, 33(10), pp. 30-53, October.

[8] Hennessy, J.L. and Patterson, D.A. (1996) "Computer Architecture: A Quantitative Approach," second edition, Morgan Kaufmann Publishers, San Francisco.

[9] Jefferson, D.R., (1985) "Virtual Time," ACM Transactions on Programming Languages and Systems, 7(3), pp. 404-425, July.

[10] Jefferson, D.R., (1990) "Virtual Time II: Storage Management in Distributed Simulation," Proceedings of the Ninth Annual ACM Symposium on Principles of Distributed Computing, pp. 75-89, August.

[11] Jefferson, D.R. and Motro, A., (1986) "The Time Warp Mechanism for Database Concurrency Control," Proceedings of IEEE International Conference on Data Engineering.

[12] Prakash, A. and Knister, M.J., (1992) "Undoing Actions in Collaborative Work," ACM Conference on Computer Supported Cooperative Work: Sharing Perspectives, Toronto, pp. 273-280, November.

[13] Thimbleby, H., (1990) "User Interface Design," ACM Press, pp. 261-286.

[14] Turner, S.J. and Back, A. (1994) "General Purpose Optimistic Parallel Computing," Proceedings of 7th PARSYS User Group Meeting, Oxford, April.