![]()
Results of discussions from meetings held on:
August 3rd, 10th, 17th, 24th, 31st
September 7th, 14th, 21st, 28th
November 2nd, 9th, 16th, 23rd, 30th
: CHILD CC1 @A 1 : CHILD CC2 @B 1 : STOT CC3 V1 Rx 1 :which child does the STOT belong to?
The assembler either has to assure that the STOT's register Rx is the same in
both children (a hard problem), or that the STOT belongs to one of the CHILD
instructions. A solution is to associate a tag with each child in the block,
ie. extend the child numbering to 1.a, 1.b, 2.a, etc...
This makes the previous piece of code look like:
: CHILD CC1 @A 1.a : CHILD CC2 @B 1.b : STOT CC3 V1 Rx 1.a :How the hardware deals with multiple children at the same child number is a separate issue. In the above code the two children have different conditional codes, but there may be cases where for short periods of time both condition codes are true. This means that both children will be fired. In situations like this time-space cache may have to cope with multiple writes to the same address at the same time. The T-S cache will have to maintain both copies until one of them is anti-written. To remove this necessity from the T-S cache design it has been decided that we can assume that the processor will never fire multiple children if it finds two more with the same time stamp. These children will be held until the appropriate anti-messages have been sent (and received) making only one of the children valid - the others will be removed.
Also discussed was the layout of the Warp Lab and what shelving, white boards hardware etc... is to be placed in the lab. This is not very interesting unless you were there.
Also as part of the specification for the Time-Space cache each component of the cache should have a set of modes associated with it. ie. waiting for input, GVT, fossil collection, etc... From these modes a finite state machine can be developed. This will both show the state of the cache at any given time and help in the hardware layout stage.
Some of the operations within each of the modules of the T-S cache need to be better defined so that there are no ambiguities as to the operation of these components given any set of input parameters. ie. for each set of possible control line inputs some explicit action needs to be defined.
This walk through was only interesting if you were there.
The possibility of another instruction (for the Warp Engine) was mentioned. This came about for the case when a value is loaded form memory and a page fault occurs. In this situation some kernel code needs executed to deal with the page fault. As the instruction set currently stands a load and store to the same address can be done with the same time stamp. If the page fault occurs on the load then there is no time to perform the kernel code before the write is executed.
A solution is to replace the LOD instruction with one similar to a STOT. An address is given to the instruction and the value at that address is then stored in the specified register of the child. This allows time for the kernel code to be executed (the time between parent and child). - This proposal needs more thought as to how this instruction will affect the way in which programs are written.
Matrix Size | Max Speedup |
Number Procs | Max T-S size |
Number Procs |
| 5x5x2 | 11 | 30 | 38 | 16 |
| 10x10x2 | 21 | 100 | 55 | 10-20 |
| 20x20x2 | 35 | 200 | 105 | 10-20 |
Richard has to get his WarpEngine simulator to produce the same statistics and to compare the results to those of Husam's.
Murray talked about the number of gates and the complexity of the gates to build the T-S cache. Each cell is constructed as follows:
+---------+---------+---------+---------+---+ field | Address |TimeStamp| Tag | Value |Dir| # bits | 32 | 32 | 32 | 32 | 2?| +---------+---------+---------+---------+---+These cells are aligned as a linked list. Each read and write is inserted into the T-S cache in timestamp order. To evaluate the timestamp ordering 241 gates are associated with the TimeStamp field of each cell, and similarly there are 64 gates associated with the Address field to determine address equality. Thus for each bit in the cell there are <=20 gates associated with it. In total for a T-S cache with 200 entries there are approximately 0.5M gates (130x20x200 = 520000).
When performing a read the value in the cell meeting the following condition
is returned:
greatest_TimeStamp( (cell_Address == I_Address) && (cell_TimeStamp < I_TimeStamp) && (cell_Dir == write) )
Other interesting (non dataflow) parallel architectures mentioned were:
Some of the above machines mentioned above had parts similar to our proposed WarpEngine. These features include the routing networks between instruction storage and processors (MIT's static architecture), the two input multiple output instruction format (LAU) and the grouping of instructions to form instruction blocks (EDFG). In dynamic data flow machines some form of tagging is required so that messages to the same node can be distinguished. Many of the above architectures solve this by associating tags with the data and instructions as they move around the system, where as the WarpEngine solves this using time stamps.
From the discussion it was unclear as to why these architectures have not succeeded (none have been mass produced). One reason may be the fact that single threaded programs do not perform well on them due to the long communications paths (ie. the five stage cycle of the Machester dataflow machine) or their instruction scheduling mechanisms (ie. the 8-to-1 multiplexed pipeline in Monsoon). Another feature of these architectures adding to the problem is the (relatively) large size of the messages that are sent. The WarpEngine may solve the problem through its optimistic execution approach and relatively small messages.
Further investigation into the reason why dataflow architectures have not succeeded has to be carried out.
John had some ideas for changes to the instruction set and will mail them to the appropriate people for discussion at a later meeting (21st September).
Richard continued his discussion about other dataflow architectures in the pursuit of the reason for failure of these machines. This discussion will be continued at the next meeting.
To build a successful dataflow machine the above mentioned points need to be addressed. The WarpEngine (though not purely dataflow) does provide solutions (in some form or another) to most of these problems.
Richard's simulator has reached the point where some statistical information can be collected. He has to create and collect the following:
The instructions that have been removed or replaced are, CHILD, CHILD4, FORK, TRAP, STOT, STO (now ST), LOD, LDV, and NOP. They have been replaced by CH, MV and MA. NOP has been replaced by the use of the select control bits. CH is a more powerful version of the CHILD instruction, but its exact layout has needs to be determined. The STOT and LOD instructions have been replaced with MV and MA which Move a Value (or value at an Address) to a given child - this allows us to deal with memory page faults. The CHILD4 and LDV instructions have simply been deemed unnecessary. The rest of the comparison, logic and arithmetic instructions remain the same except that their destination register determination has been changed to allow for the reduced F-field length. The provision for floating point instructions has also been allowed, although their exact instruction layout has not been determined.
NOTE: These new instructions and their layouts are not set in stone and may change. The new instruction set will be added to the WarpEngine instruction set web page when it has been finalised.
The SC bit has now been incorporated into the op-code. Op-codes 0-15 use their
low order bit, SC, to control the Select register. This means that there are
only 8 op-codes in this range and 24 in total. So the first 8 instructions can
be conditionally executed (by setting the SC bit), and the remaining 16 cannot -
this makes more sense than having all instructions conditional.
NOP is achieved by using SC=0 and never setting it.
The CH (now CHILD again) instruction takes the address of the child block of
code in R1 (A1) and a migration hint in R2. This allows the child to be dynamically
allocated to a processor at run-time. In the F-field there:
Two more functions were discussed but not finalised, they were ITOF and FTOI to allow conversion between integer and floating point numbers.
Also discussed was a new strategy for the time stamps. This involved having three bits of a time stamp per child (not just 2). The extra bit is added so that there are time intervals in which faults can be handled. The bit patterns for these are:
NOTE: Once again the instruction set and time stamps will be explained in more detail when they have been finalised. They will be added to the WarpEngine Instruction Set page when complete.
Also discussed were some of the preliminary results that Richard has gathered from his simulator. One notable result was that instructions that return more than one copy of a result (OR's, ADD, DIV, etc..) tend to, on average, return around 2 results. Thus the number of results returned from the functional units to the child frames can be halved if the frames can route multiple instances internally. This result may be significant when determining the types of switching mechanisms to use in the routing networks. Other results were as expected.
One issue that was raised is how the WarpEngine is going to deal with Free Lists (for dynamic memory allocation). This needs some deep thought!
Murray and Richard have to create outlines for papers about aspects of the WarpEngine. Murray's paper is going to talk about the Time-Space cache, while Richard's is to be about the WarpEngine simulator. Another paper about the WarpEngine in general will be written - but at a later date when more data has been collected.
Richard found a paper on a system called Tetra. Tetra is a tool for evaluating serial program performance under the resource and control constraints of fine-grained parallel processors. He has to look into it further and report back at next weeks meeting.
This then led into a discussion about the speedup results that Richard's simulator has produced. For matrix multiplication there was approximately a 1.5 times speedup for the fine-grained parallelism, and >O(N^2) speedup overall. The fine-grained result is very dependent on the cycle timings of the instructions and may change when more accurate cycle times are given.
Murray then went into more detail about how the TS-Cache is to be implemented - at the gate level. The cache is to produce one result per cycle. A problem that might occur is when a frame is reset and a rollback occurs on that frame (but the old version of it that has been removed). How does the system cope with the anti-reads and anti-writes that may have been produced from the old version of the frame? The answer is that this will never happen.
The process for reads and writes in the TS-Cache are known and can both be performed in one cycle. In the case where a write produces multiple anti-read messages (form reads with later timestamps in the cache) then one message is returned per cycle. For the anti-read case can the TS-cache receive an anti-read before are read for a given location? - this could happen through network delays. This event is going to be rare so any processing can take longer if necessary.
Other issues that Murray will talk about next week are, how the cache handles GVT fossil collection and the process of timestamp subtraction. Richard is to collect some results from the simulator and report on them.
Other thoughts that arose were:
Things to look forward to:
Firstly, each timestamp is to be 32 bits long in a floating point notation. 16 bit exponent and 16 bit mantissa. So the sub-ranges in a execution tree would look like: (exp man)
(0000 0000 - FFFF FFFF)
/ / \ \
(0000 0000 - FFFF 3FFF)(FFFF 4000 - FFFF 7FFF)(FFFF 8000 - FFFF BFFF)(FFFF C000 - FFFF FFFF)
Points to consider about the time-stamps and the TS-cache if the re-calculation
process is to be distributed amongst the TS-cache entries are:
Some possible alternatives are:
None of the pervious alternatives were that appealing so it was concluded that the system needs to actively substitute old time-stamps for new ones. This means that the cache will re-align time-stamps all in one hit through the use of specialised gates. This is all good and well but the issue of the time-stamp re-alignment process still remains. Some possibilities to solve this problem are:
exponent mantissa +-------+ +--------+ | exp | | x FFF | +-------+ +--------+ if x == 1 then +-------+ +--------+ | exp-1 | | FFF 0 | +-------+ +--------+ else +-------+ +--------+ | exp | | x-1FFF | +-------+ +--------+This is also achievable in the TS-cache with the gate complexity dependent on the base chosen for the exponent. This problem is still open to much deep thought and any suggestions or solutions are welcome.