next up previous
Next: Non-block transfer traffic Up: Architectural Support for Block Previous: Simple read requests

Block transfer support

We will now examine how block transfers can be performed in the multiprocessor. Remote-to-local block transfers are performed in the base system using a sequence of read and write commands. Each read command causes a single word to be transferred from the remote memory to one of the processor's registers and is followed by a write command which writes the received word into the processor's local memory. Each read that misses in the cache causes a read request to be sent through the network. The writes are all local writes; they do not cause any interprocessor traffic.

Now consider an ``enhanced'' multiprocessor in which the operating system (or the user program) can transfer a block from a remote memory to the local memory using the single instruction:

tabular64

Figure gif shows the new node interface. A new queue ( tex2html_wrap_inline155 ), large enough to contain the largest block of data that will be transferred, has been added. Also, queue tex2html_wrap_inline149 has been made large enough to contain an entire block.

   figure71
Figure: Enhanced node architecture

Consider what happens when the processor executes a block transfer instruction in the enhanced multiprocessor. First, the processor signals the local node interface that a block transfer is to occur, and transfers the source and destination address to the node interface. A request packet, called a block read request, is constructed and stored in tex2html_wrap_inline149 . The first empty slot on the ring will be filled with the request packet.

When the remote node interface receives the request (and accepts it), it reads the entire block from memory. We will refer to this as the read phase. Upon reading data from its memory, the remote node interface constructs data return packets and stores them in tex2html_wrap_inline149 . Eight bytes of data can be packed into each packet. When an empty slot appears on the ring, the top entry in tex2html_wrap_inline149 is returned to the requesting node. The packing and transmission of packets is called the transfer phase. If the network is heavily utilized, it is possible that much (or all) of the block will be read from memory before it is sent; queue tex2html_wrap_inline149 must therefore be large enough to contain an entire block.

When the requesting node interface receives the data packets, it writes the data directly to the local memory through the cache-memory buffer. This is the write phase. If the data arrives on the network faster that it can be stored in memory, queue tex2html_wrap_inline155 will hold the overflow. Thus, tex2html_wrap_inline155 should also be large enough to contain an entire block. Once the write phase has been completed, the node interface signals the processor that it can continue.

Note that the data being written during the write phase goes through the cache-memory buffer. All other requests, including read requests originating from another node, go through the interface-memory buffer. Consider a node in the write phase of its own block transfer. If an unrelated request from another node arrives, the new request can be accepted, since the path through the node interface-memory buffer is not being used by the block write. The round-robin scheduling of requests in the two memory buffers described earlier would result in the memory bandwidth being shared between the block write and the new request. Note that this is only possible because we have assumed that the processor will not issue a new request until the previous request has been satisfied. During a block write, we are guaranteed that the cache will not issue a new request and therefore will not use the cache-memory buffer.

It is intuitive that such a system would perform block transfers much faster than the base architecture; not only is less time wasted transferring individual request messages, but also if many processors are trying to access the same memory, this new ``enhanced'' architecture will only have to compete once to transfer a page, while the base architecture will have to compete for each word (or cache line).


next up previous
Next: Non-block transfer traffic Up: Architectural Support for Block Previous: Simple read requests

Steve Wilton
Tue Jul 30 14:40:51 EDT 1996