Article - Feb 4, 1999.
Next
-
Previous
-
Home
The Iridium Experience - Performance Monitors
My job on the Iridium project was to help the systems engineers to
improve system software performance and predictability during
overload. During this development, many hardware/software performance
monitors were implemented. In this article I examine advantages and
disadvantages of the Iridium tools, in the hopes that future projects
can learn from these experiences.
One area where UNIX lacks adequate tools is in software performance
analysis. Sure, there is "prof" and "gprof", but these tools only
take statistical samples. If you have an inline function that is a
pig, or if your slow routines disable interrupts, you will never
notice with these tools. If you have a function whose runtime varies
by 1000x, you will never know with these tools. These tools are
ill-suited to real-time performance analysis.
Of all the software engineers on earth, real-time software
engineers write the most negative cycles. Thus, it is
surprising that most real-time kernels have little or no support for
performance testing. About the only kernel with good performance
tools is VxWorks, in my experience. I guess the kernel vendors assume
that most engineers have a good logic analyzer nearby. While a logic
analyzr is critical in debugging kernel faults or MMU problems, most
of them allow software performance analysis at the assembly-level
only. In other words, there are no tools to find slow C++
functions to optimize.
The DAS Monitor
Tektronix makes a modular $50,000 logic analyzer called a DAS. These
boxes monitor 64 to 512 wires at the same time. The probes can trigger
on a wide variety of boolean expressions, and up to 256k samples can
be stored. There are two huge problems with the DAS. First, at
$50,000, you aren't going to buy them for all 100 engineers on your
project. And, if your code runs on UNIX, the DAS won't work there.
Second, the DAS is controlled by a clumsy, slow, and needless
X-windows GUI. Tektronix really should implement a system like VxWorks
consisting of low-level UNIX scriptable commands, and high-level
Tcl/Tk GUI's that can be reprogrammed. Tek should halt GUI development
while the scriptable commands are perfected. GUI's on software
tools are really just decelerators for expert programmers.
The Iridium systems engineers chose a well-known memory address to
dedicate to DAS monitoring. A macro was written to write code words
to this location. Then, when the DAS was running, the software would
look like
DAS_WRITE(xyz_class_function_start);
xyz_class_function(arg1, arg2, arg3)
DAS_WRITE(xyz_class_function_end);
and the DAS was programmed to record (bus write, time sample) tuples.
The cost of taking a sample was about 4 instructions. This allowed us
to calculate the cost of running xyz_class_function(). The DAS
codewords were stored as large enum lists in a C++ class, one class per
subsystem. There were several problems with this setup.
- Subsystems had to manage the codeword space so as not to overlap.
- It was error-prone (taking 5-10 min) to start the DAS. I placed
10 "NULL" codeword writes in my code, and stepped them with the
debugger. This gave me 10 chances to get the DAS running. It would
often take all 10 tries to get the DAS properly configured.
- At one point we changed the memory map and all these macros began
smashing instruction memory. We were lucky later - enabling
the MMU gave us memory faults, and we updated the DAS_WRITE macro.
- The 256k buffer was too small, and my 4kloc test code for 3kloc of
source code would overflow the buffer. The DAS could not transfer the
buffer to disk or broadcast data packets on Ethernet
- The codeword writes had to be added manually to the code, and this
was error-prone. Coverage and unmatched-pair problems were
common.
- Monitoring would not work with 603 caches turned on. We could
turn off caching, but this changed the behavior of the whole system.
We could disable caching for one MMU page, but this can't be done on
most processors (it's a special PowerPC feature). This is what we
did.
The DAS log was pretty meaningless - it contained a series of
timestamps and codewords. A PERL tool was written to substitute
symbolic names in for the time stamps, by parsing the C++ class
enumerations to reverse-map the codewords. The PERL software had
several problems of its own.
- Interrupts and/or task preemption tended to ruin samples. On
small synthetic benchmarks you could turn off interrupts, but for
larger benchmarks this was never fixed.
- Not all DAS writes correlated in pairs. For example, some
subsystems had different _end codewords depending on how a procedure
terminated. Sometimes a C switch() statement has one codeword for the
switch and one for each switch arm.
- Inline functions as fast as 1 microsecond were measured; for this
you want to cancel sampling error. A special calibration routine was
written to measure the cost of DAS_WRITE() and this was fed into the
PERL calculator. After cancellation, sometimes a very short routine
would appear with a zero or negative runtime.
- There was a constant battle to keep the gcc compiler from
optimizing away the DAS_WRITE() functions. Finally we changed
DAS_WRITE() to make the well-known address volatile, and this fixed
the problem.
- When measuring fast inline functions, it was important to run it
10 times to smooth out cache effects. However, the compiler would
optimize away 9 out of 10 calls if we did not use every result.
- Special system events such as "turning on icaches" and "turning on
dcaches" were also written as das codes. Otherwise, it was hard to
known why there were times when a routine ran 3x faster than
normal.
Clearly, the DAS and gcc made this technique cumbersome, and the PERL
script was never completed to our satisfaction.
Since the Iridium system runs on both the PowerPC processor and on
UNIX, I wrote a UNIX software DAS using the gettimeofday() function.
This DAS used a giant lookup table (indexed by codeword) to store time
samples. So, code under DAS analysis would macro expand as follows:
{
das_array[ ARD_MALLOC_START ] = gettimeofday();
// do some work to allocate memory (non-blocking)
das_array[ ARD_MALLOC_END ] = gettimeofday();
}
The _START and _END enum codewords were set 500 indices apart, so that
they could be correlated on-line. A DAS_Print() function was used to
subtract codewords and print the most recent runtime deltas for
functions on UNIX. This allowed us to tune primitives under UNIX,
using the much better tool set, before porting over to pSOS. I could
throw away samples damaged by context switches using my eyes. This
UNIX performance analyzer was able to help in tuning the code several
times.
In general, these performance analyzers were useful. In a few
cases we found software bugs when a routine used an unexpected amount
of time. Some routines that were supposed to be deterministic and
fast were not, and this pointed out bugs in our implementation.
(C) 1999, Donald W. Gillies. All Rights Reserved.
Next
-
Previous
-
Home