next up previous
Next: Concluding remarks Up: Architectural Support for Block Previous: Non-block transfer traffic

Evaluation

The effectiveness of the proposed architectural enhancements for block transfers was investigated through trace-driven simulation. Two types of workloads were considered: address traces and synthetic algorithms. The address traces used were SIMPLE, WEATHER, and FFT, and are discussed in [6]. The synthetic algorithms produce access patterns that would result from the execution of matrix multiply and successive over-relaxation (SOR) algorithms. To insert block transfers into the workloads, the page replication operations of a typical operating system were simulated as described in [2].

Simulations were run using both the base and enhanced architectures. In all simulations, a 16-node system with a page size of 4KB was simulated. Block transfers occur as a result of page replications; therefore, all block transfers move 4KB of data. Since page replications are most common in the initialization phases of programs, we concentrated on the beginning of the address trace simulations (the first five million cycles of SIMPLE and WEATHER, 1.5 million cycles of FFT, and two million cycles of the two synthetic algorithms).

Figure gifa shows the average time required to replicate a 4KB page on each system. A page replication on the enhanced system takes between tex2html_wrap_inline173 and tex2html_wrap_inline175 as long as it does on the base multiprocessor.

 

  figure107


Figure: Simulation Results

It is expected that the enhanced multiprocessor will suffer from a longer non-block transfer remote access time. Figure gifb shows that this is indeed the case. The vertical axis in this graph is the average remote access time for normal read and write requests. The amount that the enhancement slows down these requests depends on the block transfer behavior. Workloads which have long periods of low block transfer activity (such as FFT) do not show a marked increase in non-block transfer remote access time, while those which are constantly performing block transfers do show a marked increase (the WEATHER trace, for example).

The above results show that in the enhanced multiprocessor, replications are satisfied much faster than they are in the base multiprocessor, while non-block transfer requests take longer. The extent to which these two factors affect the overall system performance is shown in figure gifc. The height of each bar represents the total number of program accesses satisfied in a fixed time (five million cycles for SIMPLE and WEATHER, 1.5 million cycles for FFT, and two million cycles for the two synthetic algorithms). A program access is defined as any memory reference made by the program. Many of these accesses will result in page faults, which might, in turn, cause the operating system to perform a replication. These accesses performed by the operating system during a replication are not program accesses. The number of program accesses is proportional to the proportion of the trace file or algorithm that has been completed.

As shown, the enhanced multiprocessor shows performance that is significantly better than the base multiprocessor in most cases. The improvement ranges from 20% in the FFT trace to almost 50% in the WEATHER trace. The only workload in which this system exhibits a poorer performance is the SOR algorithm, in which non-block transfer remote accesses are intermingled with block transfer requests.


next up previous
Next: Concluding remarks Up: Architectural Support for Block Previous: Non-block transfer traffic

Steve Wilton
Tue Jul 30 14:40:51 EDT 1996