In the 1970's, The Xerox Office Systems Division produced an operating system called "Pilot" with a world-stop debugger [Redell80] called "CoPilot". When you hit a breakpoint or pressed a reset button on the computer, the entire virtual memory image would be written to disk, and the CoPilot debugger would be restored from a previously unmounted superior disk volume. In essence, the "World-Stop" Debugger controlled everything about the computer system from an unmounted disk volume, which served as a firewall. In this system the OS was just an application like any other. Using this idea, you worked with 3 volumes : your main volume, where you did your work, an inferior volume where you debugged new code, and a superior volume, in case your OS crashed.
This is an amazing idea. It provides a strobelight for looking at software. You may set breakpoints and debug the debugger, running on a subordinate disk volume, using the debugger on a superior volume. You may debug or rewrite the file system with no worries about corrupting your execution environment. You may write a bootable new operating system (within reason) using the old debugger for the entire project. You may step every line of code in the system, including every line of every interrupt routine. We still can't do these things with 99% of all the debuggers available today, 25 years later.
The only code that was untouchable was the debugger "nub" - a piece of code that performed the world swap, and talked to the network (in the case of remote debugging). Because the CPU microcode was stored and swapped with the volume, you could even debug microcode or run INTERLISP on a remote volume!
In the implementation of real-time software, the idea of a time-stop debugger is just as appealing as the world-stop debugger. However, there are a number of problems to be resolved. When time is frozen, important system services (i.e. timers, time sliced tasking, time-triggered applications, watchdogs) do not work. Therefore, to implement a time-stop debugger, some sort of world-stop debugger is required. When CoPilot was written in 1979, a line of code could be stepped in about 5-10 secs. By 1986, main memory was 8 times bigger, and it took sixty seconds to swap in the debugger, and sixty seconds to swap out the debugger. The debugger turned into a failure because its benefits didn't justify the 2-minute cost of stepping a line of code. This is amusing, because Xerox generally practiced a form of "future prediction" research, where you predict the hardware that will be widespread in 10 years, build it now, and get going on that software to take advantage of the hardware. Clearly, they were right about how hardware would get better from 1978-1985, but they misjudged the size of the software binaries by a mile !!
Eventually, the Xerox OSD programmers modified the system to allow debugging in your own environment, incresing productivity. However, self-corruption was possible, and many types of code (i.e. certain interrupts and critical debugger tasks) could not be stepped. They could still be debugged from the superior volume, though.
For effective time-stop debugging, you must be able to freeze time for some portion of the system, while time proceeds for the rest of the system (i.e. for the debugger itself). Thus, a time-firewall is necessary to allow part of the system to proceed in time, while the rest of the system is frozen in time.
These days most complex real-time systems involve many CPU's (Iridium has 7 onboard ppc 603 CPU's). A Globalstar transciever subsystem (forward link or reverse link) has at least 20 VME cards. Without a facility to stop all the CPU's at once, the debugging in such a distributed system would probably be useless.
To simplify the implementation of a time-stop debugger, the hardware should be restricted to a single timing device. The clock fed to this device could be enabled or disabled via software.
(C) 1999, Donald W. Gillies. All Rights Reserved.