Chapter 5: Non-Stop Change

2015-09-08 :: Live Upgrade, Upgrade, Types, Schema

Ngnghm, or Ann for her human friends, decided that while stranded among us Humans, she would conduct an ethnographical study of Human computer systems, and took to heart to examining my programming habits. In return, I was more and more curious of how Houyhnhnm (pronounced “Hunam”) systems worked, or failed to work. That’s when, trying to imagine what the Houyhnhnm computing systems might keel over, with their making everything persistent, I had this a-ha moment: surely, they must have extreme trouble with live upgrade of their data schema, and their programmers must spend their time in hell trying to reconcile modifications in what amounts to an unrestricted distributed database, that anyone can modify at any time. Ann wasn’t sure what I was talking about that could be a major issue, and so interrogated me as to the Human practices with respect to handling change in persistent data. And she found that many of the issues stemmed from limitations with how Humans approached them; these issues were not intrinsic with the problem of ensuring persistence of data, and all but disappeared if you considered code change transactions as first-class objects.

The Best Laid Schemas of Houyhnhnms and Men

I challenged Ann to explain to me how Houyhnhnm systems dealt with code and data upgrade, because they were some of the hardest problems I had faced while working with Human computer systems. Ann wasn’t sure what I meant, or why whatever she suspected I meant would be particularly difficult. That surprised me, so I started by stating that any software that survives long enough has to change its code as the software became obsolete, in a phenomenon humorously known as bitrot: as time goes by, the software becomes no good anymore, as if it were being degraded by some kind of rot; of course, and that’s the joke, no external force can possibly cause the bits that compose the software to change and degrade — on the contrary, the problem is precisely that the software fails to change when the world around does change and requires the software to adapt so as to remain relevant. Ann told me that the analogy makes sense to Houyhnhnms but isn’t as humorous: it isn’t the software, but the relationship of the software with the world, that rots; the degradation thus applies not to the computer system, but to the computing system, as constituted not just by the computer, but also by the sentient programmers and users, and by the larger context in which the computation happens. To a Human focused on the software artifact, it may be a funny paradox; but to a Houyhnhnm focused on the softwaring process, it’s a painful truism.

I was intrigued, but I went on. As a program evolves, the way that it organizes data, its schema, will eventually be found wanting, as well designed as it might once have been for its original purpose. New data types, new file formats, new database schemas, will be needed, as the old one become insufficient and inadapted to newly required kinds of computations. Now, the correspondances between old and new data is often non-trivial, and upgrading data from the old format to the new format is hard; what more, upgrading all the data, consistently, without any race condition between old and new data processors, was even harder. And if the upgrade has to happen without any down time from the system as it keeps processing data, then achieving it without a hitch becomes a feat of virtuosity. Finally, bugs in such an upgrade process are both all too easy to introduce and all too hard to recover from. Therefore, Human computer systems shun these schema upgrades as much as possible, and some companies pay people full time just to manage the writing and/or smooth running of upgrade scripts that keep their data up to date, and to maintain the expensive database servers capable of doing it all dependably. However, humans are lucky enough to be able to avoid these difficult data upgrades altogether in most cases, precisely because Humans were not persisting data but throwing it away, so it didn’t need to be upgraded. When they really have to deal with the hard case, the best tools available to regular Humans, like protocol buffers or its successor Cap’n’Proto, are not very expressive in terms of types and leave a whole lot to the programmers in terms of managing upgrades; but at least they provide a framework for data to survive and be upgraded beyond the short life of a program that by necessity can only handle one version of the schema. And yet even these tools are only so good for their cost, and most people just do upgrades the hard way: either with completely disjoint old and new schemas, duplicated code, and big ad hoc functions to translate between them; or with ad hoc sharing of code and data between the schemas, a complexity that increases with each new version, and plenty of corner cases that are never considered, much less tested.

Surely, by persisting everything all the time, Houyhnhnm computing systems were forced to deal with this hard issue all the time, which of necessity must have made all programming difficult and tedious. Yet Ann claimed that she didn’t know data upgrade to be particularly tedious on Houyhnhnm systems.

Afterthought or Forethought

First, she said, the case that is easy in Human computer systems, namely eschewing any upgrade and dropping the old data is just as easy in Houyhnhnm computing systems: just don’t upgrade the data, and instead drop it, as you modify the code. If you don’t care about data, then by definition you don’t care whatever automation the system provides to upgrade it for you, and don’t care to fix it; if you know that for sure in advance, you can even tell the system, so it won’t bother saving the data, and you might get a nice speed up for it, and a reduced monthly bill for your globally replicated persistent data. However, in the cases the data actually matters and you do care about it, that’s when you’ll miss system support for data persistence, and that’s also when you’ll miss system support for upgrading your data.

Moreover, even in this easy case, since Houyhnhnms, as you may remember, save code and data together, you can simply fork the old data together with the correct version of the code that knows how to create and manipulate it, and keep using it while you work on a new version. “Orphaned data”, “version mismatch”, “DLL hell”, are issues that might sometimes delay upgrade, and cause a lot of grief to Houyhnhnm programmers indeed, but they never can prevent reusing current or old data — computation itself is immortal, even if its relationship to the world can suffer. And Houyhnhnms just don’t know the catastrophic failure modes of Human computer systems, such as a system becoming unusable due to failure in the middle of an upgrade, what more with missing or incomplete backups so you cannot downgrade, cannot upgrade, and must reinstall and reconstitute all local configuration from scratch.

Second, since the “canonical” representation of data is not low-level bytes, but high-level data types, a whole lot of the extrinsic complexity that Human computer systems have to deal with can instead be automated away: When the schema change is merely a change in low-level representation with identical high-level data types, then a trivial strategy is always available for upgrade — decompile and recompile. Moreover, the system can automatically track which object currently uses which underlying representation; it can therefore manage the upgrade without sentient oversight. The sentient programmers can then focus on the actual intrinsic issues of schema upgrade: non-trivial data transformations into semantically non-equivalent data types. And even then, the system provides a framework that automates a lot.

Third, and more importantly, upgrade is inevitable indeed; but the problem with Human computer systems is that since they both focus on software as a finished artifact and define things at a low-level, they drop all the data that matters about upgrade when it’s readily available, only to desperately try to reconstitute it the hard way after it’s too late. Upgrade automation is thus almost inexistent in Human computer systems, because it comes as an afterthought. By contrast, it comes naturally in Houyhnhnm computing systems, because it is part and parcel of how they conceive software development.

Change Comes from the Inside

To Humans, change happens outside the computer system that is changed: A program is an immutable object, especially so after it was invoked to start a process (a limited virtualized computer system). To change a program thus requires shutting down the processes started with the old program and starting new processes with the new program, although doing it naively can cause an interruption of service.

As for changing a program’s data types and having to preserve data, that to Humans is an exceptional situation to be dealt with using exceptional tools. Such a catastrophic event happens every so many months or years, when releasing a new “major” version of the program.

To Houyhnhnms, change happens inside the computing system that changes, because everything relevant is ipso facto inside the system, part of its ontology. What more, to Houyhnhnms, change to programs is what programming verily is: To program is to change programs, and to change programs is to program. That’s a tautological identity. Inasmuch as types are part and parcel of a program, then changing a program’s data types is part and parcel of programming and is supported by the system as a matter of course, including upgrading any existing data to use the modified types.

To a Houyhnhnm, the idea that change could happen outside the system is absurd on its face, and the notion of a programming language that would only be concerned with describing programs that never change is obviously lacking. Of course, change processes may or may not be automated — but as per the Sacred Motto of the Guild of Houyhnhnm Programmers, whatever parts of them can be automated, should be automated, must be automated, will be automated, until eventually they are automated.

Once again, Humans focus on programs as finished entities with fixed semantics, whereas Houyhnhnms think in terms of systems made of ongoing interactions. And once again, this fundamental difference in paradigm implies fundamental differences in the most basic structures of computing systems, including programming languages.

Human Languages that Support Live Upgrade

However, there are two notable exceptions amongst Human programming languages, whereby the language does — to some degree — support live upgrade, that is, changes to programs and data types in an actively running program, from within the language: Common Lisp, and Erlang.

Common Lisp lets you redefine functions, classes, types, etc., in an existing, running image, and immediately thereafter use the modified program. There are however a lot of limitations and a lot of pain in doing so, especially if the program has to be actively running while the modifications take place. For instance, you may have to declare your functions notinline to ensure the latest version is always used, or to shadow their symbol to ensure the version at time of definition is always used. Before you redefine a class, you can define methods on update-instance-for-redefined-class to ensure that all data will be properly upgraded after the class redefinition; this is somewhat low-level, there is no automated enforcement of consistency between such methods and the types, and you better keep supporting older versions of the class, because you can’t be sure that some older instances still haven’t been upgraded; but it’s there and it works. There are many limitations however to this hot upgrade support; notably, and depending on which Common Lisp implementation you use, there are many kinds of code modifications that will not be safe to issue if multiple threads are active at the same time (particularly changes to the class hierarchy); so you will have to somehow synchronize those threads before you upgrade your code. The big problem with Common Lisp is that its model of side-effects to a single global world doesn’t play well with a modern concurrent and distributed world, where a complete system is made of many processes running on many machines, or where you want to have multiple versions of the software running at the same time. Lisp hasn’t been actively developed as a self-contained system for three decades, and what remnant support it has for thinking in terms of system is a partial subset of whatever antiquated solutions existed when it was standardized. Yet that it supports at all the upgrade of code and data without resorting to magic from outside the language makes it miles ahead of all known Human languages. Except Erlang.

Erlang’s heritage and ambitions are very different from those of Lisp, and in many ways it is much more primitive — but as an upside its primitives are better defined. Through a reinvention decades later, Erlang embodies the Actor model of the early 1970s, which is what the “Object Oriented” utterly failed to be despite its own hype: a programming model where many well-encapsulated entities — called “processes” in Erlang rather than “objects” because they are active — communicate with each other by passing messages. Processes can be distributed, and reliability is achieved by assuming that individual processes will fail eventually, letting them fail, and restarting them, whereas important data will have been stored and replicated in several other processes that won’t all fail at the same time. Now, one thing where Erlang shines, far better than Lisp and far far better than anything else, is in its support of live upgrade of code and data — a topic that it alone seems to take seriously. When you upgrade code, it is always clearly defined which function calls will go to the old version of the code, and which function calls will go to the new version. And the strict evaluation model in terms of processes and messages makes it possible to reason about what state each process is in, how it will update its state, and how it will handle old or new messages. All in all, Erlang feels like a low-level language, but not a Human low-level language, that tries to track the underlying computer hardware for efficiency, and more like a Houyhnhnm low-level language, that tries to track the underlying computing model for correct and reliable behavior; what makes it less than a high-level language is that it isn’t easy enough to build higher abstractions on top of the provided abstraction level — though it may still be better at it than many alleged “high-level” Human programming languages.

Therefore, there are some precedents in Human computing systems that allude to what Houyhnhnm computing systems would be. But they are quite undeveloped compared to what would be required of a full-fledged Houyhnhnm computing system.

Automating Live Upgrade

So, I asked Ann, what do Houyhnhnm programming languages look like when it comes to supporting live upgrade of code and data in a running program?

First, Houyhnhnm programming languages have a notion of transactions for code and data upgrade, that can be composed and applied atomically, so that you can coherently upgrade a system from one version to another, without having to wonder what happens if some code is running right in between when two mutually dependent entities are being redefined. Of course, while developing, Houyhnhnm programmers will want to experiment and explore with upgrades of some part of the code or data only, that lack overall system coherence; such exploratory modifications are doomed to fail, but that’s alright, because until it is are properly vetted, each change is always run in its own virtual fork of the system, where failure is quite acceptable.

Houyhnhnm type systems track how incoherent the system is, and maintain for the programmer a to-do list for those parts of the system that haven’t been updated to follow the new types yet. At runtime, incoherent calls are intercepted before they may break the system, and the debugger offers the programmer a chance to give a new function definition before to enter (or re-enter) the function that fails — among Human programming languages, good Lisp implementations provide the latter (though they can’t undo the side-effects of bad functions), whereas good statically typed languages provide a bit of the former (though they don’t maintain to-do lists and thus gloss over functions that fail to break the type system e.g. because they have an otherwise clause that implicitly covers a newly added case when a correct code update ought to explicitly handle the new case, of at least requires the programmer to vet that the otherwise clause indeed applies to the new case).

Houyhnhnm systems, since they remember the history of type modifications, require every type modification to be accompanied by a well-typed upgrade function, taking an object in the old type and returning an object in the new type. Simple upgrade functions can be trivially expressed, such as using default values for new fields, or erroring out. The system also uses linear logic to ensure that when writing an upgrade operator, you must explicitly drop any data that you don’t care about anymore, so you can’t lose information by mistake or omission (old versions of the data will persist by default, as for all data in a Houyhnhnm system, but upgrade mistakes will still make further fixes harder, so it is very valuable that Houyhnhnm systems support writing correct upgrade procedures easily). Because the same function doesn’t necessarily apply to all data of a given type, upgrade functions can be overridden at the granularity of individual variables, and they may take some contextual information as extra parameters.

The system also tracks the difference between on the one hand renaming some variable, and on the other hand deleting some variable and adding another: though the two lead to identical source code from the limited Human point of view, they imply very different upgrade behavior from the wider Houyhnhnm point of view, and are represented differently in a Houyhnhm computing system. Branching and merging are similarly supported. These are very easy to track in an interactive development environment, but hard to reconstitute, indeed impossible to reliably reconstitute, just by looking at the code before and after the modifications, when the language doesn’t allow to explicitly express the difference.

Houyhnhnm systems allow you to edit the effective history that applies to software releases; this history is made of coherent changes that describe how to upgrade from previous released versions to the latest one while preserving all system invariants. It can be contrasted with the actual history through which the result was attained, which is often made of incoherent changes, of backing from blind alleys and of downright mistakes. By upgrading data along the effective history rather than actual history, data corruption can be avoided, and bugs introduced in the upgrade process can themselves be fixed. Now, since Houyhnhnm computing systems accommodate for situations with many distributed agents each with their own version of the code and data and their history, the modifying, branching and merging of objects also apply to these histories of code and data changes. Histories themselves constitute a monotonic algebra where it is always possible to merge histories; unhappily, merge conflicts may cause the result to be incoherent according to the criteria that the users care about; however the system can also check the coherence of the results and reject incoherent transactions, and/or require manual intervention by sentient programmers to fix whatever incoherence it tracked. Of course, this can lead to arbitrarily complex situations when merging histories in a distributed system; but in practice, there is a finite amount of data that needs to be upgraded that people actually care about, so there is a cap on how much complexity Houyhnhnm programmers need to deal with. Moreover, since Houyhnhnm computing systems, unlike Human computer systems, do track where all that data is, Houyhnhnm programmers can be fairly confident that they did (or didn’t) complete the upgrade of all relevant data, at which point no further upgrade effort is required.

Now, none of this negates the interest of having a subset of the programming language that allows to express immutable programs and reason about them. Immutability is indeed a very important property that Houyhnhnm programmers can appreciate as well as Human functional programmers. But when programs do mutate, and they do, by the very virtue of programming itself, Houyhnhnm programmers much prefer being able to reason about these changes from within the system, with all the automation that the system makes possible, including automated refactoring, detection and tracking of incoherence, formal reasoning, etc., where instead Humans have to revert to completely informal reasoning on pen and paper, and must indulge in expensive manual drudgery which robs them from being able to focus on the difficult parts that really require their attention, drowning the important details in an ocean of boring repetitive tasks. Houyhnhnm programming languages thus allow to express immutable programs written in suitably restricted programming languages that make it easy to reason about them; and simultaneously they can express meta-level programs without these expressive restrictions (but with different restrictions) that can describe changes in the former programs, from outside their universe but still within the automated part of the overall computing system, also allowing for reasoning.

In conclusion, as compared to Human computer systems, Houyhnhnm computing systems handle all the easy cases of live upgrade, and automatically ensure coherence for the hard cases, where Humans instead rely on large bureaucracies of database administrators and upgrade experts to manually manage live upgrade of code and data through excruciating slow and still error-prone unautomated processes. And this, it bears repeating, is but a necessary consequence of the simple difference of point of view, of “philosophy”, between Human computer systems and Houyhnhnm computing systems.