Watching this documentary on (building) architecture
, I was fascinated (and amused) by the parallels with software engineering/construction of the pre-Internet era.
It turns out (which shouldn't be a surprise) that many of the great buildings -- those buildings that receive much press recognition for their design -- have ignored issues of maintenance in their design, and subsequently caused massive headaches and costs for the eventual occupiers and property managers. It also turns out that the architects never return to visit their "masterpieces" but move onto new things, never looking back, and never learning from actual use where their errors were. Sounds a little familiar, right?
This is a problem that plagued the software development (I'm avoiding calling it "software engineering") industry for decades and is only now giving way in the age of Internet-based application development and incremental development methodologies like Agile Programming.
In the Grand Olde Days, the "A" team of more senior (seasoned, not necessarily better) architects and developers would create a specification (system/software architecture) and hand it down from on-high to the waiting programming teams. These teams would code and deliver the first release of the software, immediately moving onto another important project that required their expertise. Maintenance would be performed by a "B" team of lesser people, unfamiliar with and having to reverse engineer the original code base in order to deliver bug fixes and feature requests, and to keep the code base working as the production computing environment around the application changed. Of course, the lessons learned by these maintenance activities were never learned by the "A" teams, because they never had to maintain their code bases and never even saw the consequences of their poor decision making during the design and coding phases of the project.
This formal, unidirectional waterfall model (described in Peopleware
as the "'Big M
' methodology" approach) reached its apex in the late 80s where a reductionist view of software engineering, conflated with bad management theories that refused to acknowledge the well documented [Facts and Fallacies in Software Engineering
, number 2] 28-fold difference between poor and great programmers
, software development was governed by formal methodologies (essentially multi-thousand page procedure manuals) which worked on the basis of all programmers being of equal productivity and zero creativity (and, hence, readily interchangeable and replaceable, just the way HR departments like it).
It was only a matter of time before this would give way under its own weight, and the pendulum began to swing back towards less formal methodologies around 1999 with Extreme Programming
. These approaches recognised the reality that more than 60% of the system life-cycle is spent in maintenance [Facts and Fallacies, Rule 41]; that the environment into which a system is deployed is fluid across its production life, and not at all like the system was at the time when design began; and that even if the above were not true, the desired functionality of the system is equally fluid, governed both by the dynamic nature of the eventual marketplace and actual experience with whatever system is finally delivered.
This more organic, iterative software development model was well suited to the Internet, where the software release costs are almost zero (no media production or distribution costs), the time-to-market more critical, and the ability to release and tune features based upon usage information greatly enhanced. Also, web startups used small software teams, where there would be less reason (or cost justification) to have separate primary build and maintenance teams. The perfect storm had arrived, and as primary software teams began to maintain their own code, and learn about dynamic production environments, a new maturity began to emerge.
Alas, because this new maturity was emerging from developers (with little or no sysadmin experience), and was emerging at each Internet start-up independently, this was going to be the beginning of another long journey.
When I first wrote [Production Ready Software, 1999
(pdf)] and spoke [SAGE-AU 2001 (Adelaide)
, LISA 2001
, OLS 2003
, Release Engineering (USENIX 2005
)] about these problems and the swinging methodology pendulum, I realised that whilst individual programmers were starting to gain insight into operational issues, there were two problems: firstly, each programmer was doing this individually, and thus there was lots of duplication and solutions were bespoke and rarely shared beyond an IT group; and secondly, there were actually three layers of operational controls
(aka serviceability criteria
) that needed to be implemented, exposed and documented for the system administrators: (1) application-level maintenance controls; (2) site-level maintenance controls, which should be common amongst all applications for a given site; and (3) industry-level controls, which should be present on all applications from all vendors, and consistent across all applications in their names, and usage.
Operational Controls cover things like: installation and customisation instructions; daily production management activities (availability management, user management, log management, data management); and exception management (error conditions, failure modes, diagnostic tools). Additionally, applications must expose rather than mandate things such as filesystem locations such that they can be made to conform to local filesystem and naming conventions.
Wouldn't it be nice if these controls were standardised across all products, so that sysadmins only had to learn one "maintenance language"?
In future essays I'll go into details on the various controls that need to be exposed.
 Oh, and it was interesting to see one of the architects interviewed was Christopher Alexander