The effectiveness of the proposed architectural enhancements for block transfers was investigated through trace-driven simulation. Two types of workloads were considered: address traces and synthetic algorithms. The address traces used were SIMPLE, WEATHER, and FFT, and are discussed in [6]. The synthetic algorithms produce access patterns that would result from the execution of matrix multiply and successive over-relaxation (SOR) algorithms. To insert block transfers into the workloads, the page replication operations of a typical operating system were simulated as described in [2].
Simulations were run using both the base and enhanced architectures. In all simulations, a 16-node system with a page size of 4KB was simulated. Block transfers occur as a result of page replications; therefore, all block transfers move 4KB of data. Since page replications are most common in the initialization phases of programs, we concentrated on the beginning of the address trace simulations (the first five million cycles of SIMPLE and WEATHER, 1.5 million cycles of FFT, and two million cycles of the two synthetic algorithms).
Figure a shows the average time required to replicate a 4KB page on
each system. A page replication on the enhanced
system takes between
and
as long as it does on the base
multiprocessor.
Figure: Simulation Results
It is expected that the enhanced multiprocessor will suffer from a longer
non-block transfer remote access time. Figure b
shows that this is indeed
the case. The vertical axis in this graph is the average remote access time
for normal read and write requests. The amount that the enhancement
slows down these requests depends on the block transfer
behavior. Workloads which have long periods of low block transfer
activity (such as FFT) do not show a marked increase in non-block
transfer remote access time, while those which are constantly performing
block transfers do show a marked increase (the WEATHER trace, for
example).
The above results show that in the enhanced multiprocessor, replications are
satisfied much faster than they are in the base multiprocessor, while
non-block transfer requests take longer. The extent to which these
two factors affect the overall system performance is shown in
figure c.
The height of each bar represents the total number of
program accesses satisfied in a fixed time
(five million cycles for SIMPLE and WEATHER, 1.5 million
cycles for FFT, and two million cycles for the two synthetic algorithms).
A program access is defined as
any memory reference made by the program. Many of these accesses will
result in page faults, which might, in turn, cause the operating system to
perform a replication. These accesses performed by the operating system
during a replication are not program accesses. The number of program
accesses is proportional to the proportion of the trace file or algorithm
that has been completed.
As shown, the enhanced multiprocessor shows performance that is significantly better than the base multiprocessor in most cases. The improvement ranges from 20% in the FFT trace to almost 50% in the WEATHER trace. The only workload in which this system exhibits a poorer performance is the SOR algorithm, in which non-block transfer remote accesses are intermingled with block transfer requests.