Article - April 18, 2003.

The Clearinghouse: Anti-Entropy and Infection Debacle

1. Introduction

The grapevine was one of the earliest distributed email systems implemented anywhere. It was written for the Xerox Alto computer at the Xerox Palo Alto Research Center (Xerox PARC) in the mid to late 1970's [1]. The Grapevine, was more than an email system : it also included a hierarchical name lookup services and also distribution list management. It stored this data in a replicated database distributed in America, Europe (Rank Xerox), and Japan (Fuji Xerox). The "Grapevine" was perhaps the very first computer system to perform layer-4 routing for replication in a network.

An interesting thing about The Grapevine (and the later XNS product, The Clearinghouse, which was based upon The Grapevine), is that there was no "primary site" in the database update mechanism. Therefore, administrators at various sites worldwide could make conflicting changes to the database, and the update propagation system (e.g. the mail delivery system) would cause the two updates to "fight" for permanence in the system. The one with the latest update timestamp was supposed to win, but for various reasons it might lose (more on this later.)

The Grapevine suffered from at least three problems. One problem was the 2-level naming system (e.g. "redell.pa") that was too flat. Second, each user had a primary and a secondary mailbox and this machinery was overkill because 8010 and D-Xerox hardware was much more reliable than the Alto hardware that prompted this feature. Third, there was a desire to manage all network resources - not just people, distribution lists, and mail server names - in a central directory and the integrated design of Grapevine did not provide enough support for the types of queries that customers wanted to perform.

The solution was a new system called the XNS Clearinghouse. I worked in the XNS Distributed Systems Group at The Xerox Office Systems Division (OSD) in Palo Alto, from 1984-1986. I worked primarily on the XNS Mailing System, however, the three services (Mail, Clearinghouse, Authentication) were hosted together in one system and sold as a single product.

The mail systems and authentication systems worked reasonably well, but The Clearinghouse was a trouble area and it is therefore the subject of this paper. In fact, The Clearinghouse was in trouble almost continually from 1984-1987, and external researchers from PARC were called into fix the problem. These external researchers made some changes and lightened the protocols and declared that the problem was solved. I believe that the roots of this problem were never fully addressed, and so in this article I will discuss the problem and set the stage for a general solution which i will present in a later article.

2. What is a Clearinghouse?

2.1 Clearinghouse Purpose

A Clearinghouse is a Name Lookup server, like an Internet DNS server. Xerox envisioned a 3-level name system was all that was required for a global name system. They were correct to +/- 1 level, but Xerox blew it in estimating the semantics of the three domain levels. This mistake was a lucky thing, because Xerox had a tremendous influence over the OSI protocols, and had they guessed correctly, we might all be using OSI protocols right now (*snicker*). Xerox names appeared as follows:

     Object:Org_Unit:Corporation

An object was something like a user name, a "group", or machine. An object was nothing but a name in The Clearinghouse to which you could attach attributes. So a machine object could have the print server attribute, the file server attribute, the mail server attribute, etc. Groups were just generic objects that had a "group-member" attribute. My Xerox email was addressed to "Donald W. Gillies:osbu_north:xerox". My name "Donald W. Gillies" was an object in the organizational unit "osbu_north" (Office Systems Business Unit / Northern California"), within the organization "Xerox". There was a second name in the clearinghouse - an alias - that was "gillies:osbu_north:xerox", and this alias pointed to my fully qualified name record. My attributes were things like a password, a home file server, and a mailbox server name. In theory, my name could also be used as a print server and a distribution group if additional attributes had been added to my name. This sort of structure could exist in the database but the admin interface would prevent users from creating "hydra" names of this type.

2.2 Clearinghouse System

For every few hundred users and machines you needed a clearinghouse. The software system was about 25,000 lines of MESA code and the Pilot OS was another 25,000 lines of code. This software ran on a 0.5 VAX MIPS 16-bit processor with virtual memory. The STAR 8010 was as fast as a VAX-11/750 (washing machine) from Digital Equipment Corp. To get my estimate I ran some for(;;) loops on both machines because the STAR 8010 did not have a C compiler and I did not want to translate Dhrystone 1.0 into the MESA language.

If you as a customer bought a Clearinghouse, you got a STAR 8010 with 384 Kb of Memory, a 20, 40, or 80 GB 8" disk, Ethernet, CPU, and a serial console display. You had to have an authentication system to be able to log into a xerox network and access data in the Clearinghouse. Almost all RPC's in an XNS network relied upon the authentication system.

Login to an XNS system was very similar to login to a Novell Netware network login, however, the Novell system is completely different internally and much simpler to build than the XNS system. The Clearinghouse software was incredibly complicated and overloaded your network, so Novell improved the idea by creating a simpler system (SAP - service advertisement protocol) that would overload your network just as much as the Clearinghouse *snicker* but with fewer lines of code.

The Clearinghouse attempted (unsuccessfully, in fact) to distribute database updates to other Clearinghouses through mail delivery. If you did not pay for a mail server, you got a "Degenerate Mail Server" which was embedded in The Clearinghouse, with a database of 1500 pages (750Kb), that was used to propagate updates among clearinghouse sites. Customers did not know about the degenerate mail servers.

3. Problems in The Clearinghouse System

3.1 Four Colors Suffice?

The XNS system was fixed at 3 levels which is too few for a worldwide distributed system. The Internet has benefited from a fourth top level, ".com", ".edu", ".org", ".net", ".us", ".ca" and the many other country codes for the world. Also, people and machines are not inter mingled in the DNS system. The DNS system names only machines; mailing groups and users are implemented as if they are one more layer in front of a DNS name, by the mail protocols and by mailing list servers and many different types of such systems exist. Therefore, you can think of the DNS system as having 4-5 levels, if you add groups/users to the typical 3-level DNS names. For example, gillies@people.ece.ubc.ca is a 5-level name.

3.2 Names, Addresses and Routes

The organization chart at Xerox rearranged itself more frequently than a cheap plastic mobile. Every six months our entire org chart would change and all the department names would change all the way up to the CEO. I started out working for the Systems Development Department (SDD) of OSD. By the time I finished I just worked for ISD, and SDD/OSD had long ceased to exist. During this time my domain remained osbu_north, much to the consternation of the executives and the people who had selected that name.

Xerox had made the twin mistakes of confusing names, which are meaningless words, with acronyms for what we did and "mixed things up" by placing locations in the names. It's lucky that they didn't use "osbu_pa" for for our Palo Alto location, which would have followed the grapevine convention, because within 2 years we had moved from the Stanford Business Park to the Palo Alto Golf Course to the burned out warehouse of a startup Xerox had killed (Daisy Printer Systems) in Sunnyvale. If we had started with "sdd_pa" we could have changed names to "isd_pa" and then to "isd_sunnyvale" for no meaningful reason, much to the confusion of everyone.

3.3 Lies, Damned Lies, and Clearinghouse

The Clearinghouse is described in an 80-page document from OSD that is rather long-winded and contains platitudes about how great the phone system is and what a great thing is is to have yellow pages in a network [2]. Don't believe just about anything in this paper. The only thing that is correct and accurate in this document is the naming schema. In particular, these mistakes exist in the document.

3.3.1. The Bell System Yellow Pages are Great

Since the mid 1990's, YAHOO! has proved that it is better to have a yellow pages system that anyone can make entries in. It is much better to have an open database than to rely upon network druids. The Altavista / Google search sites have shown that its perhaps better to just link every computer together and crawl the distributed network to produce an automatic index of everything. By linking this page to "the known web", i guarantee that it will be available online in a few short weeks. And I guess the guys at Bell Labs figured this out when they produced the permuted index to the UNIX man pages way back in 1971.

Xerox was wrong to give tight naming control only to administrators. Even today, at a company as large and administrator-heavy as Qualcomm, you can still get your Qualnet content linked into the web search engine without the help of an administrator.

3.3.2. We have a solution to hierarchical lookup.

Maybe you do, but the solution described in the paper was never implemented. The discovery process for looking up a name consisted of steps listed below.

1.  8010 user system boots.  It has station ID in its 48-bit mac id.  It
    broadcasts on the wire for a router : "what is my network?"  As
    a result, 8010 system has address "s_mac/s_net".
2.  8010 user system performs expanding-ring broadcast.  It can get routing
    tables from the router and send directed broadcasts to remote
    Ethernets.  It uses a special clearinghouse expanding-ring RPC
    primitive that only clearinghouse servers respond to.  
3.  8010 user system finds clearinghouse and is now open for business.

4.  User walks up to 8010 and requests an FTP to "FS1:osbu_north:Xerox".
5.  8010 sends request for name resolution to clearinghouse.
6.  Clearinghouse looks in its own database for an
    "osbu_north:Xerox:CHS", and pulls the "nameservers" group from
    this object.  It then selects one of the names from the group,
    e.g. "a_name_svr:CHS:CHS"
7.  Clearinghouse looks in its own database for address of
    "a_name_svr:CHS:CHS" answer is something like "a_mac/a_net"
8.  Clearinghouse sends proxy request for attribute to resolve
    "FS1:osbu_north:Xerox" to the remote CHS at "a_mac/a_net".
9.  Remote clearinghouse responds with FS1 address to local CHS.
10. Local CHS responds to 8010 client stub with address.
10. Client stub caches the name/address mapping so that next time 
    it happens faster.  Then goes back to step #4.

I believe that this is the correct procedure; its been about 15 years since i worked on this system. If it is not accurate, then the step 6 is replaced by three steps, (a) Look up Xerox:CHS:CHS and pull out a group of domain servers for Xerox, (b) map the domain server in the :CHS:CHS domain to an address, (b) contact a domain server for Xerox, and Look up osbu_north:xerox:CHS domain at the domain server for Xerox:CHS:CHS, get the name of a host who can execute our desired lookup, e.g. "a_name_svr:CHS:CHS", and map it using our ":CHS:CHS" domain.

This is the actual solution that was used to build the system. It is interesting to note that this is a proxy system (your local clearinghouse does all the heavy lifting of looking up remote names). the DNS system is _not_ a proxy system - users need a much more complicated protocol stub to resolve DNS names. Among the clients for the XNS Clearinghouse were the Memorywriter typewriters; they behaved like smart teletypes. You don't want to be building a DNS protocol stub into every 8-bit Memorywriter typewriter and so this is why the proxy architecture was selected.

3.3.3 A mail system is all we need for updates.

No. The XNS Mail system was the first I ever saw to get into distributed deadlock. The 1500 page degenerate mail database would fill up and the database would get into "silly window syndrome" (oscillations near the 1500 page limit.) When a target for an email would free up a few pages of space, it would advertise immediately and all the other Clearinghouses on the network, most of them with clogged databases of their own, would jump on the poor unsuspecting system and try to ram many messages into its database. In many cases, these databases were filled with messages for clearinghouses that were crashed or partitioned in the network. As a result, a system that should forward 20 messages per second were reduced to forwarding as few as 1 message per day.

This is a common problem at Layer 3 (TCP), but surprised us all because nobody expected to see the same problem at Layer 4. In fact, only our MIT software engineers knew something about control theory and could even understand what caused the problem. Once the system got into distributed deadlock, excessive retries and the silly window syndrome managed to keep the system constantly in distributed deadlock.

3.4. Pick a onetime transport and make it as reliable as possible.

No. Even the PHD's from PARC can do really stupid things like this, do them twice, and still get it wrong, and still insist it was the right thing to do. A Mail system is, fundamentally, an unreliable transport. The early Xerox PhD's worked very hard to create an object commit system, to implement a transaction-transfer protocol so that messages could not be lost during forwarding (however, duplicates were possible in rare crash cases), and to put duplicate elimination machinery in place at the destination. When this machinery didn't work, they argued that it wasn't important [1]. They argued that a distributed name system was self-healing because administrators would re-enter the data that was lost.

The XNS internet suffered from at least 4 problems that made the mail transport system unreliable.

The transcontinental links were too slow (9600 baud). Therefore, half the time the network was in a virtual state of partition because of link bottlenecks.
Some domains were replicated on every clearinghouse (all 150) in the network. For example, all clearinghouses host names were registered in the "glue" domain "CHS:CHS". This allowed all clearinghouses to find addresses for other clearinghouses and to therefore exchange mail. However, any large updates to this database would blow the transcontinental links, causing a partition. This is because one email would have to go across the ocean at least 75 times in the mail system.
When a clearinghouse got an "old" update that it hadn't seen before, it would "forward" the update to all other clearinghouses serving the domain, figuring that nobody got the message either. As a result, if a network partition was "fixed", 1 clearinghouse could send an update across the partition to 5 clearinghouses, who would forward it to each of the 4 other clearinghouses, so 5 messages became 25 messages (O(n^2) updates to fix O(n) hosts.) This was just plain ignorance of complexity theory exhibited by some of the members of the clearinghouse team.
The system was designed to compensate for network failures by performing endless retries. This is fundamentally impossible to do because if the system performs N retries, then a failure of N+1 time units will produce a lost update. Because the system was designed with so much "forward error correction", and "reliability", there was no protocol to handle instability when the system became overloaded. When the system became overloaded it turns out that failures lasting for (N+1) time units became quite common. This was a fundamental instability in the system design. The system had no protocol for distributed backoff (slowing down) if it became overloaded. So once the network grew to a large natural size, it naturally failed in a permanent way.

3.5 Dead Objects

Objects were deleted by creating a "dead object" and mailing it using the unreliable transport to all other servers. The "dead object" would wipe out the other objects in the system, and after a week or two, disappear from the database, resulting in a storage cost of zero.

This didn't work for two or three reasons. First, the network was almost always in a constant state of partition. Whenever good connectivity was established, invariably, some objects that were never deleted would come back to life. Or worse yet, an administrator would leave a crashed machine disconnected from the network for a week or two, then restart the machine and bring deleted objects back to life. Or even worse, a crashed machine would be restored from tape backup (preferably a backup months or years old), causing armies of dead objects to arise from the dead.

This was the wrong design. All modern routing systems keep unstable copies in all routers that die off if they are not refreshed. All modern routers use a "primary site" to determine whether an object is dead or alive. This is how The Clearinghouse should have worked, but the mail update mechanism was thought to be not efficient enough to support it, so naturally this stable system design was ruled out in favor of the unstable system design. In fact, the dead objects problem caused a lot of distributed updates in the very expensive "glue" domain of 150 nodes, and so the designers may have actually produced MORE traffic by trying to "save bandwidth" in this area.

3.6 Simulate, simulate, simulate.

The Clearinghouse was never simulated, and this was a huge failure in systems engineering, especially for such a global distributed system.

4. Solving the problem

4.1. The Anti-Entropy Non-Solution

The implementors detected problems with the "heavyweight" mailing protocols and decided to implement an even more heavyweight protocol (called "Anti-Entropy") that could reliably and directly synchronize two databases. The Anti-Entropy process did a side-by-side comparison of the checksums/timestamps associated with each pair of objects, and picked the winner. It took a few minutes to run this protocol across an unloaded network with a small database. A new release of the software was designed that would run pairwise anti-entropy every evening at midnight and try to synchronize all the databases served by a particular clearinghouse.

This solution added insult to injury. First, there was no attempt by the anti-entropy protocol to talk to the nearest clearinghouse, even though such routing information was available in the XNS RIP routers. Therefore, on any given evening perhaps 75 clearinghouses would fire up high-speed transcontinental connections to Europe and Japan over the 9600 baud transcontinental links. It didn't help that the IDP (=XNS TCP) protocol had an overly simple congestion control algorithm. As a result, anti-entropy usually exacerbated the problem it was designed to fix.

4.2. Solving the problem right

The end-to-end argument in systems design [3] provides a hint as to what was wrong with The Clearinghouse design. The mail system was intended to support all higher layer protocols, including users and the clearinghouse. It turns out that this was impossible and led to slow and inefficient bloatware in the middle of the system. The correct solution was to implement a lightweight "performance enhancement" for database updates, because the application "Clearinghouse" would have to reimplement the feature anyway in an end-to-end sense to guarantee end-to-end reliability. This argument was just lost on most of the people in control of the Clearinghouse system.

4.3 More bad ideas from PARC

Several researchers visited from Xerox PARC when the Office Systems Division Team called for help. They proposed enhancements to the clearinghouse transport protocol known as an "Epidemic algorithm" (also known as "gossiping") to reduce XNS internet overload. This algorithm selected nearby neighbors with a higher probability and reduced the load on the transcontinental links. This made things better, but at a fundamental level, the system was still bloatware that would not scale to a worldwide internet. The Epidemic Algorithm, which replaced the Anti-entropy algorithm, was really replacing a band aid with a better, more powerful, and lighter weight band aid. The original problem was never addressed by the men from PARC. This fiasco is probably one of the reasons why Xerox and the XNS protocol suite never succeeded in the market place. The Mailing [5], Authentication, and Clearinghouse Design specs were half a decade late because of these problems. Perhaps that main influence this system had on the Industry is that it was intentionally not copied by other groups, which was probably a good thing.

5. References.

[1] Birrell, A. D., Levin, R., Needham, R. M. and Schroeder, M. D.: "Grapevine: an exercise in distributed computing". Communications of the ACM, 25(4), pp. 260-273.

[2] D. C. Oppen and J. K. Dalal, "The Clearinghouse: A Decentralized Agent for Locating Named Objects in a Distributed Environment," ACM Trans. Office Info. Sys., July 1983, p230-253. Also Technical Report OPD-T8103, Xerox Office Products Division, October 1981. 111 pages.

[3] J. Saltzer, and D. Clark, End-to-end arguments in system design, ACM Transactions on Computer Systems 2(4), November 1984, pp 277-288.

[4] A. Demers et al, Epidemic algorithms for replicated database maintenance", Proc. 6th ACM Symposium on Principles of Distributed Computing (PODC), 1987, pp. 1-12.

[5] Xerox Corporation (Donald W. Gillies and David D. Redell), "Mailing Protocols (Xerox System Integration Standard).", Xerox Systems Institute Report No. XNSS 148805, Stamford, Connecticut, 1988 (114 pp.).