WarpEngine Logo

Architecture Updates

& Revisions for 1995

Results of discussions from meetings held on:

July 20th, 27th

August 3rd, 10th, 17th, 24th, 31st

September 7th, 14th, 21st, 28th

October 5th, 12th, 19th, 26th

November 2nd, 9th, 16th, 23rd, 30th

December 7th, 14th, 21st


20 July 1995

In situations where there are multiple children given the same child number in a block, any associated STOT's have to be linked to the child they correspond to.
eg.
	:
	CHILD	CC1	@A	1
	:
	CHILD	CC2	@B	1
	:
	STOT	CC3	V1	Rx	1
	:
which child does the STOT belong to?

The assembler either has to assure that the STOT's register Rx is the same in both children (a hard problem), or that the STOT belongs to one of the CHILD instructions. A solution is to associate a tag with each child in the block, ie. extend the child numbering to 1.a, 1.b, 2.a, etc... This makes the previous piece of code look like:

	:
	CHILD	CC1	@A	1.a
	:
	CHILD	CC2	@B	1.b
	:
	STOT	CC3	V1	Rx	1.a
	:
How the hardware deals with multiple children at the same child number is a separate issue. In the above code the two children have different conditional codes, but there may be cases where for short periods of time both condition codes are true. This means that both children will be fired. In situations like this time-space cache may have to cope with multiple writes to the same address at the same time. The T-S cache will have to maintain both copies until one of them is anti-written. To remove this necessity from the T-S cache design it has been decided that we can assume that the processor will never fire multiple children if it finds two more with the same time stamp. These children will be held until the appropriate anti-messages have been sent (and received) making only one of the children valid - the others will be removed.


27 July 1995

At this stage the Time-Space cache does not have to rescale time stamps. This feature is left out so that Stephen's 420 project (a hardware design of the Time-Space cache) does not have to be redesigned. In the Warp Engine time stamp rescaling in the TS cache will have to be implemented. Each operation of the cache (read, write, anti-read ...) has to be described in pseudo code so that the internal workings of the cache are known.

Also discussed was the layout of the Warp Lab and what shelving, white boards hardware etc... is to be placed in the lab. This is not very interesting unless you were there.


3rd August 1995

Upon reading Stephen's Time-Space cache specifications, the question, Does the Time-Space cache need to buffer anti-reads that occur before a read to the same address? was asked. If an anti-read (for a given location) is received before a read then the anti-read needs to be stored until such time that the read request has been received. The same is true with anti-writes and writes. A solution to this problem is to assume that all memory requests are received (by the Time-Space cache) in the order that they were given. This removes the necessity for the above mentioned buffering.

Also as part of the specification for the Time-Space cache each component of the cache should have a set of modes associated with it. ie. waiting for input, GVT, fossil collection, etc... From these modes a finite state machine can be developed. This will both show the state of the cache at any given time and help in the hardware layout stage.


10th August 1995

A walk through of the Time-Space cache was performed for a given sequence of memory operations using the specifications supplied by Stephen. This found some potential bugs in the specifications, such as:
  • setting the initial values for some of the acknowledge and response lines,
  • the times at which some acknowledge, mode and request lines are reset.

    Some of the operations within each of the modules of the T-S cache need to be better defined so that there are no ambiguities as to the operation of these components given any set of input parameters. ie. for each set of possible control line inputs some explicit action needs to be defined.

    This walk through was only interesting if you were there.

    The possibility of another instruction (for the Warp Engine) was mentioned. This came about for the case when a value is loaded form memory and a page fault occurs. In this situation some kernel code needs executed to deal with the page fault. As the instruction set currently stands a load and store to the same address can be done with the same time stamp. If the page fault occurs on the load then there is no time to perform the kernel code before the write is executed.

    A solution is to replace the LOD instruction with one similar to a STOT. An address is given to the instruction and the value at that address is then stored in the specified register of the child. This allows time for the kernel code to be executed (the time between parent and child). - This proposal needs more thought as to how this instruction will affect the way in which programs are written.


    17th August 1995

    John talked about the results of Husam's latest set of tests on the maximum speedup and number of entries in the time-space cache. Using the Gauss-Jordan elimination algorithm on various sized matrices the maximum T-S cache size peaked at <=100 entries for 16 to 18 processors in each case. Maximum speedup was:

    Matrix Size

    Max Speedup

    Number Procs

    Max T-S size

    Number Procs

    5x5x211303816
    10x10x2211005510-20
    20x20x23520010510-20

    Richard has to get his WarpEngine simulator to produce the same statistics and to compare the results to those of Husam's.

    Murray talked about the number of gates and the complexity of the gates to build the T-S cache. Each cell is constructed as follows:

    	+---------+---------+---------+---------+---+
    field	| Address |TimeStamp|   Tag   |  Value  |Dir|
    # bits	|    32   |    32   |    32   |    32   | 2?|
    	+---------+---------+---------+---------+---+	
    
    These cells are aligned as a linked list. Each read and write is inserted into the T-S cache in timestamp order. To evaluate the timestamp ordering 241 gates are associated with the TimeStamp field of each cell, and similarly there are 64 gates associated with the Address field to determine address equality. Thus for each bit in the cell there are <=20 gates associated with it. In total for a T-S cache with 200 entries there are approximately 0.5M gates (130x20x200 = 520000).

    When performing a read the value in the cell meeting the following condition is returned:

    greatest_TimeStamp(	(cell_Address == I_Address) 
    			&& (cell_TimeStamp < I_TimeStamp)
    			&& (cell_Dir == write) )
    


    24th August 1995

    Richard discussed some data flow architectures that have been developed, including:
  • MIT's static architecture
  • TI's DDP (data-driven processor)
  • LAU (ONERA/CERT France)
  • MIT's dynamic architecture
  • Manchester
  • DDM1 (data-driven machine)
  • EDFG
  • Monsoon (MIT)
  • EM-4 (Japan)

    Other interesting (non dataflow) parallel architectures mentioned were:

  • Alewife (MIT - message passing)
  • DDM (Bristol - data diffusion machine - hierarchical memory structure)
  • J-Machine/M-Machine (MIT - message passing distributed memory)
  • *T/*T-NG (MIT - stock hardware distributed message passing)
  • VLIW machines

    Some of the above machines mentioned above had parts similar to our proposed WarpEngine. These features include the routing networks between instruction storage and processors (MIT's static architecture), the two input multiple output instruction format (LAU) and the grouping of instructions to form instruction blocks (EDFG). In dynamic data flow machines some form of tagging is required so that messages to the same node can be distinguished. Many of the above architectures solve this by associating tags with the data and instructions as they move around the system, where as the WarpEngine solves this using time stamps.

    From the discussion it was unclear as to why these architectures have not succeeded (none have been mass produced). One reason may be the fact that single threaded programs do not perform well on them due to the long communications paths (ie. the five stage cycle of the Machester dataflow machine) or their instruction scheduling mechanisms (ie. the 8-to-1 multiplexed pipeline in Monsoon). Another feature of these architectures adding to the problem is the (relatively) large size of the messages that are sent. The WarpEngine may solve the problem through its optimistic execution approach and relatively small messages.

    Further investigation into the reason why dataflow architectures have not succeeded has to be carried out.


    31st August 1995

    No meeting.


    7th September 1995

    Lance had completed his WarpEngine Assembler. After discussing some details such as the bit orderings within instructions, the instruction opcode numbers and other minor aspects, it was decided that Richard and Lance should get together and see if Richard's WarpEngine simulator can run the code produced by Lance's Assembler.

    John had some ideas for changes to the instruction set and will mail them to the appropriate people for discussion at a later meeting (21st September).

    Richard continued his discussion about other dataflow architectures in the pursuit of the reason for failure of these machines. This discussion will be continued at the next meeting.


    14th September 1995

    Richard continued (and completed) his discussion on the possible reason(s) why no dataflow computer has been successfully mass produced. The following points were considered:
  • The long dataflow latencies between successive instructions in a thread. Some machines, such as the Manchester Dataflow Machine, have multi-stage packet routes (ie. a switch, a node store, a token matcher, processing elements). Individual data-instruction packets have to traverse this circuit, so successive instructions within a thread are separated by a long delay. Therefore single threaded programs will perform badly.
  • Large instruction and data packets. These contribute to the latencies mentioned above, and therefore require higher bandwidth communications paths. Some machines even have variable sized packets which must add complexity to the communications hardware.
  • Bottlenecks are inherent in the machines that have token matching units as all the packets within the system must pass through that unit. Thus the overall size of the system is limited by the speed of the token matching unit.
  • They tend to have broken away from conventional memory-register based architectures and thus do not use the cheap mass produced components (such as memory and floating point units) that are available.
  • The instructions sets in some of the architectures are not very powerful. In some cases the standard benchmark programs cannot be implemented. Many of these machines cannot deal with with hardware faults or handle software and hardware generated exceptions. The ability to handle exceptions is a necessity for any computer.
  • Another hindering point is the high level languages used for programming. They either have their own special language for programming or use a specialised or cut down version of a commonly used language. Dataflow programming languages are hard to learn (as most programmers are taught using conventional control-flow based languages) and therefore provide a BIG hurdle in terms of software availability for these platforms.

    To build a successful dataflow machine the above mentioned points need to be addressed. The WarpEngine (though not purely dataflow) does provide solutions (in some form or another) to most of these problems.

    Richard's simulator has reached the point where some statistical information can be collected. He has to create and collect the following:

  • instruction usage - what instructions are used and how often are the used - how many destination fields are used in the instruction.
  • Perform runs on various programs including: Gauss-Jordan elimination, heap sort, fibonacci sequence generation (both recursive and iterative), and other programs that might be of interest.
  • Produce histograms of the memory locations in the T-S cache (read and write separate and combined).
  • Experiment with different timing delays for various instructions.


    21st September 1995

    John discussed his new proposed instruction set for the WarpEngine and described how the new instructions are going to work. Basically the format of the instructions have changed slightly and the child, stot and load functions have been altered. The 32 bit I-word is now broken down to:
  • CC (3 bits) - constant control bits.
  • SC (1 bit) - select control bit.
  • Op (5 bits) - operand code (unchanged).
  • F (23 bits) - op code dependent constant.

    The instructions that have been removed or replaced are, CHILD, CHILD4, FORK, TRAP, STOT, STO (now ST), LOD, LDV, and NOP. They have been replaced by CH, MV and MA. NOP has been replaced by the use of the select control bits. CH is a more powerful version of the CHILD instruction, but its exact layout has needs to be determined. The STOT and LOD instructions have been replaced with MV and MA which Move a Value (or value at an Address) to a given child - this allows us to deal with memory page faults. The CHILD4 and LDV instructions have simply been deemed unnecessary. The rest of the comparison, logic and arithmetic instructions remain the same except that their destination register determination has been changed to allow for the reduced F-field length. The provision for floating point instructions has also been allowed, although their exact instruction layout has not been determined.

    NOTE: These new instructions and their layouts are not set in stone and may change. The new instruction set will be added to the WarpEngine instruction set web page when it has been finalised.


    28th September 1995

    John continued the discussion on the instruction set, with the Arithmetic, Logic and Floating Point formats finalised. The I-word format has been rearranged to:
  • CC (3 bits) - constant control bits.
  • Op (5 bits) - operand code (unchanged).
  • F (24 bits) - op code dependent constant.

    The SC bit has now been incorporated into the op-code. Op-codes 0-15 use their low order bit, SC, to control the Select register. This means that there are only 8 op-codes in this range and 24 in total. So the first 8 instructions can be conditionally executed (by setting the SC bit), and the remaining 16 cannot - this makes more sense than having all instructions conditional. NOP is achieved by using SC=0 and never setting it.
    The CH (now CHILD again) instruction takes the address of the child block of code in R1 (A1) and a migration hint in R2. This allows the child to be dynamically allocated to a processor at run-time. In the F-field there:

  • c (2 bits) - the child number.
  • mi (2 bits) - migration treatment: 00 - don't migrate; 01 - migrate only if no free frames on local machine; 10 - migrate optionally; 11 - always migrate.
  • dest (6 bits) - destination for child reference to be sent to.
  • a (14 bits) - static child block address. The child block address is calculated using A1+a*128.

    Two more functions were discussed but not finalised, they were ITOF and FTOI to allow conversion between integer and floating point numbers.

    Also discussed was a new strategy for the time stamps. This involved having three bits of a time stamp per child (not just 2). The extra bit is added so that there are time intervals in which faults can be handled. The bit patterns for these are:

  • 000 - page faults for incoming MA messages.
  • 001 - time at which execution occurs.
  • 010 - reserved.
  • 011 - instruction faults.
  • 100 - child 0.
  • 101 - child 1.
  • 110 - child 2.
  • 111 - child 3.

    NOTE: Once again the instruction set and time stamps will be explained in more detail when they have been finalised. They will be added to the WarpEngine Instruction Set page when complete.


    5th October 1995

    Richard brought along a paper which compares 3 dataflow architectures (Manchester Data-Flow Machine, Stateless Data-Flow Machine and CSIRAC II, all from Manchester). The paper describes a method for comparing their relative performances based on the number of tokens use in the execution of various benchmark programs. He has to give a talk next week on the details of that paper and those of another paper on "Hardware Support for a Functionally-Programmed Tagged Token Architecture".

    Also discussed were some of the preliminary results that Richard has gathered from his simulator. One notable result was that instructions that return more than one copy of a result (OR's, ADD, DIV, etc..) tend to, on average, return around 2 results. Thus the number of results returned from the functional units to the child frames can be halved if the frames can route multiple instances internally. This result may be significant when determining the types of switching mechanisms to use in the routing networks. Other results were as expected.

    One issue that was raised is how the WarpEngine is going to deal with Free Lists (for dynamic memory allocation). This needs some deep thought!


    12th October 1995

    Discussions were held about various aspects of the WarpEngine, including creating a simple compiler, the possibility of constant modification in instruction memory, and some ideas about the Free List's problem. The compiler would allow programs to be written at the level of the instruction set. It is intended to be used for generating small fragments of assembler, not as a full blown language. John's idea for allowing constants in the instruction blocks to be modified would mean that loop invariants can be calculated outside the loop and passed in when the instructions are loaded from memory. The advantage is that many STOT's (MV's and MA's) can be saved - thus reducing code size, although there will added complexity in the otherwise simple instruction memory. A possible solution for the free lists is to have multiple free lists with some mechanism to determine which list to use. It was thought that the mechanism will have to be left for the programmer to control - not a very good solution. More thought still needs to be put into this.

    Murray and Richard have to create outlines for papers about aspects of the WarpEngine. Murray's paper is going to talk about the Time-Space cache, while Richard's is to be about the WarpEngine simulator. Another paper about the WarpEngine in general will be written - but at a later date when more data has been collected.


    19th October 1995

    The two papers of Murray's and Richard's were discussed. The main points considered were that their introductions and backgrounds should not overlap too much and what information should be included in each. All important details were noted and the papers are to continue down their creation path. You had to be there to get anymore out of the discussion.

    Richard found a paper on a system called Tetra. Tetra is a tool for evaluating serial program performance under the resource and control constraints of fine-grained parallel processors. He has to look into it further and report back at next weeks meeting.


    26th October 1995

    No meeting.


    2nd November 1995

    This meeting was a catchup session on what everybody has been doing over the last couple of weeks. John has finalised the latest version of the instruction set which has now been put on the web. Murray and Richard have continued writing their papers.

    This then led into a discussion about the speedup results that Richard's simulator has produced. For matrix multiplication there was approximately a 1.5 times speedup for the fine-grained parallelism, and >O(N^2) speedup overall. The fine-grained result is very dependent on the cycle timings of the instructions and may change when more accurate cycle times are given.

    Murray then went into more detail about how the TS-Cache is to be implemented - at the gate level. The cache is to produce one result per cycle. A problem that might occur is when a frame is reset and a rollback occurs on that frame (but the old version of it that has been removed). How does the system cope with the anti-reads and anti-writes that may have been produced from the old version of the frame? The answer is that this will never happen.

    The process for reads and writes in the TS-Cache are known and can both be performed in one cycle. In the case where a write produces multiple anti-read messages (form reads with later timestamps in the cache) then one message is returned per cycle. For the anti-read case can the TS-cache receive an anti-read before are read for a given location? - this could happen through network delays. This event is going to be rare so any processing can take longer if necessary.

    Other issues that Murray will talk about next week are, how the cache handles GVT fossil collection and the process of timestamp subtraction. Richard is to collect some results from the simulator and report on them.


    9th November 1995

    Richard showed some graphs of the results of some runs on the WarpEngine simulator. Many of the results were interesting especially the number of instructions and reads & writes per frame. Also the number of rollbacks, frames in use at a given time, the size of timestamps, and the speedup characteristics of the various programs. From these initial graphs it was decided that they should be re-plotted with normalised ranges. ie. plot frames in use as a fraction of frames in the program. Another statistic to collect is a trace of the number of frames in use in each bucket through out the programs execution.

    Other thoughts that arose were:

  • Should child frames be queued in the time-space cache? If so what would would need to be stored (timestamp,code address,hint address), and how would the cache control mechanisms have to be changed to deal with them?
  • How would throttling the optimism of the WarpEngine be achieved? Through the timestamp re-alignment process?
  • How to simulate transient children (those falsely created in an optimistic execution). These are not handled in the current version of the simulator as the will never occur.

    Things to look forward to:

  • a new optimised version of the Matrix Multiply code.
  • a recoding of the naive fibonacci code.
  • recode some of these code fragments with the new instruction set so that comparisons can be done between the two instruction sets.


    16th November 1995

    In today's meeting Murray continued his talk about the time-space cache implementation, focusing on how GVT Fossil collection and time-stamp re-calculation was going to work. To get the TS-cache to perform time-stamp re-alignments the make up of a time-stamp and the re-calculation process needs to be known. The rest of the meeting was devoted to coming up with time-stamp requirements.

    Firstly, each timestamp is to be 32 bits long in a floating point notation. 16 bit exponent and 16 bit mantissa. So the sub-ranges in a execution tree would look like: (exp man)

                                       (0000 0000 - FFFF FFFF)
                         /              /                  \             \
    (0000 0000 - FFFF 3FFF)(FFFF 4000 - FFFF 7FFF)(FFFF 8000 - FFFF BFFF)(FFFF C000 - FFFF FFFF)
    
    Points to consider about the time-stamps and the TS-cache if the re-calculation process is to be distributed amongst the TS-cache entries are:
  • the cache needs to maintain its "one operation per cycle" property.
  • how long a period is each entry going to stay in the TS-cache? and
  • how many entries will there be for a given location?
  • how many new versions of the time-stamps will be in the cache at any time?

    Some possible alternatives are:

  • only recalculate timestamp when that location is used - not the entire cache as a whole. This may hurt the "one operation per cycle" policy.
  • give each location two time-stamp, an old and a new one. This would only work if it is guaranteed that there is only ever one new version of the time-stamps at a time (highly unlikely). Also the system will eventually hang if there is an old timestamp in the distant future.
  • have two caches, one for each version of time-stamps. Same problems as previous.
  • broadcast 2 time-stamps with each read and write. The cache would then check both timestamps when performing its tasks. This method has performance trade-offs.

    None of the pervious alternatives were that appealing so it was concluded that the system needs to actively substitute old time-stamps for new ones. This means that the cache will re-align time-stamps all in one hit through the use of specialised gates. This is all good and well but the issue of the time-stamp re-alignment process still remains. Some possibilities to solve this problem are:

  • subtract 1 from the exponents of all times-stamps in system each time GVT passes a time-stamp whose exponent is >= 1. This would mean that each time-stamp in the TS-cache would have subtract 1 circuitry attached to the exponent field - a viable alternative. The problem with this method is that the range in the mantissa does not get larger. So there are no new time spaces created, and thus the system will not proceed.
  • a second alternative, an extension of the first, is to subtract from the exponent and shift the mantissa. ie. the shift and align method would be
    	exponent  mantissa
    	+-------+ +--------+
    	| exp   | | x  FFF |
    	+-------+ +--------+
    	if x == 1 then
    	+-------+ +--------+
    	| exp-1 | | FFF 0  |
    	+-------+ +--------+
    	else
    	+-------+ +--------+
    	| exp   | | x-1FFF |
    	+-------+ +--------+
    
    
    This is also achievable in the TS-cache with the gate complexity dependent on the base chosen for the exponent. This problem is still open to much deep thought and any suggestions or solutions are welcome.


    23rd November 1995

    No meeting.


    30th November 1995

    No meeting.


    7th December 1995

    No meeting again. We must be getting lazy, or is it the holiday season?.


    14th December 1995

    And again...


    21st December 1995

    Merry Christmas


    1996
    Send any questions or suggestions about this page to mpearson@cs.waikato.ac.nz or rhl@cs.waikato.ac.nz .