A very brief and extremely selective history of OLPC and collaboration technology, performed entirely from memory

OLPC’s founders wanted to improve education, and in their vision, education requires communication. They envisioned a computer system in which students, and teachers, could easily work together on projects, and share all kinds of documents and media. To describe this vision, they adopted a buzzword: “collaboration”.

Implementing this collaboration required a network. In schools with electricity, that network could be provided by standard wireless networking systems, and in schools with excellent systems support, collaboration could be supported by a server. In schools without electricity, or outside of schools entirely, the laptops would have to talk to each other directly, and so OLPC became perhaps the first adopter of the IEEE 802.11s standard for “mesh networking”, using a chip sourced from Marvell to implement the required behaviors.

Collaboration also requires a software system to perform communication over the network, and for this, OLPC contracted Collabora, a free software development firm working on a then-new project called Telepathy. Telepathy’s original purpose was to provide an abstraction layer over chat services, like AIM, MSN Messenger, or Google Chat, so that a chat client could work without knowing the details of each system. OLPC contracted Collabora to extend Telepathy’s XMPP (i.e. Jabber) support to arbitrary data channels, not just human-readable text. They called these channels “Tubes”.

Both Marvell’s mesh system and Collabora’s Telepathy software took years to debug. Debugging was especially hindered by the NDAs surrounding the firmware on Marvell’s chip, which prevented volunteer experts from fixing problems or adding features. (Such NDAs have become deeply ingrained in the culture of wireless device manufacturers, not least due to concerns about liability for FCC compliance violations.) Telepathy too proved difficult for outsiders to improve, due in part to its use of specialized technologies like XMPP, and a large, intricate codebase.

When both systems seemed to be approaching a degree of reliability independently, testing began on using them together. OLPC’s engineers quickly discovered that the combined system was extremely fragile, even in somewhat idealized tests. In particular, two major problems were discovered. The first was that Telepathy’s serverless communications component, known as Salut, could not be used simultaneously by more than roughly a dozen users in a room. With more users than this, typical collaborative applications would begin to fail.

After a great deal of discussion and testing by expert engineers, a rough consensus was reached, that the failure to support more users could be attributed to the behavior of multicast routing. Salut was written with a reliance on efficient routing of multicast packets, and makes deliberate use of multicast given this assumption. Marvell’s mesh routing algorithms did not provide efficient multicast routing for a large number of nearby users. (More efficient routing algorithms have been the subject of numerous research papers in recent years, but have not yet reached broad implementation.)

With Salut’s high volume of multicast traffic being routed inefficiently, a small number of users could quickly saturate the available wireless network bandwidth. Performance improved if a wireless access point was provided, but most wireless access points use the inefficient “basic rate” for all broadcast and multicast transmissions, which results in a similar saturation of bandwidth, typically seen at around 20 participating users. Salut did appear to work well on wired 100Mb ethernet, where broadcast is highly efficient and there is a great excess of bandwidth, but this use case was of little interest for OLPC, since its hardware did not have an ethernet port, and its schools could not afford to install the wiring.

A second problem was observed when a server was used to enhance collaboration. When communicating via a server, Salut and multicast are not used, and so network problems are substantially alleviated. However, the only server software recommended by Collabora, ejabberd, proved to have substantial scaling problems of its own. In testing, ejabberd had a tendency to crash when supporting more than about 100 simultaneous users, as might be common in even a small school. While several potential issues were identified and resolved, testing has proven difficult, and recent tests have run into server instabilities again. Debugging is made difficult both by the need to have several hundred active clients, and by ejabberd’s unusual internal structure (it is written in Erlang).

Telepathy also suffered from its immaturity. The developers implemented necessary features as fast as possible, and as such they were often not implemented in the most efficient way. For example, in many cases, Telepathy would take compact binary data, expand it using base64 so that it was valid plain text, and then send that text over a zlib-compressed channel, spending a great deal of CPU time in the process. Telepathy’s network transports have since been made much more efficient in many respects, but not until after OLPC began sending laptops to customers. Much efficiency work still remains.

The final problem that I will mention here is opacity. Telepathy is an abstraction layer; its goal is to hide things from the developer. This can make it difficult to determine what went wrong when things are not working. This is especially true due to the way that Sugar uses Telepathy. Sugar carefully hides all mention of IP addresses, XMPP identifiers, and all other technical matters, behind an extremely simple non-textual user interface, designed to be used by children who do not yet read well.

Sugar still uses Telepathy for collaboration, and its deep integration into every collaborative Activity makes it unlikely to ever be otherwise. However, frustrated with the various issues with the Telepathy stack, some OLPC engineers began searching for a better architecture for networked collaboration. I’ll discuss their proposals in my next post.

This entry was posted in sugar. Bookmark the permalink.

3 Responses to A very brief and extremely selective history of OLPC and collaboration technology, performed entirely from memory

  1. Gabriel Eirea says:

    Thank you for this informative post.

    What you are saying is basically that OLPC knew that collaboration over the mesh was known not to work for a typical number of students in a class, and yet this feature was marketed as one of the great innovations of the XO. Everyone marveled on the “under a tree” networking mode and teachers were encouraged to apply this technology in the classroom.

    I hope OLPC would realize the huge amount of frustration and class-time lost by teachers and students all around Uruguay trying to use collaboration over the mesh to perform some very basic tasks. This is one of the main reasons why the adoption of the XO in the classroom has been very low, this frustration added to the fears of teachers about the new technology, most of them thought they were doing something wrong. It tooks many months to finally realize that “the mesh didn’t work well or didn’t work at all”.

    I am extremely dissapointed by learning that this was a known issue. Too much pain could have been avoided.

  2. D Smith says:

    ejabberd should perform and scale well. I worked with it a couple of years ago and my benchmarks were excellent. A single node should scale well up to several thousand users.

    Make sure your OS limits are reasonable. Specifically, increase your open file limit. On a linux system use “ulimit -n” to set a reasonable value.

    The ejabberd community (http://www.ejabberd.im/) is always quite helpful.

  3. D Smith says:

    The other thing you might want to do is to force connection reuse for connections stuck in a time_wait state. Again, if you are on linux:

    echo 1 > /proc/sys/net/ipv4/tcp_tw_reuse
    echo 1 > /proc/sys/net/ipv4/tcp_tw_recycle

    In any case, if you are still having stability or performance issues, it’s most likely to be an OS limitation. ejabberd is run in many commercial and public sites.

Leave a Reply

Your email address will not be published.


*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>