It is well known that the design of a multiprocessor operating system is highly dependent on the architecture of the computer on which it is to be run [1]. The reverse is also true; the architecture of a multiprocessor should be greatly influenced by the requirements of the operating system. For example, it is quite common for an operating system for a non-uniform memory access time (NUMA) multiprocessor to transfer large blocks of data from one memory unit to another. Such an operating system will be much more efficient if the system architecture facilitates efficient block transfers.
In this paper, we will examine how one shared memory multiprocessor can be enhanced to support efficient block transfers. We are especially interested in blocks that are equal to the virtual page size. In the base multiprocessor, one processor must issue separate instruction(s) to transfer each word in a page. In the enhanced system, any processor can issue a single instruction which causes the multiprocessor to transfer a block of words from a remote memory to the processor's local memory. Remote-to-local transfers are supported, since they are common in many NUMA machines [2]. To examine the extent to which block transfer support affects performance, a simulator was developed to model the multiprocessor at the register transfer level.