You don't work in the software industry.

The software industry has been around a lot longer than ours, and it continues to thrive in parallel to ours. There's some overlap, just as the hardware and software industries have some overlap. But it's a lot less than you probably realize.

Not knowing that we're not in the software industry is hurting you every day. For one thing, it's hurting you because you're applying old models and practices to a completely new and different industry, without thinking about them. You take old ideas for granted because you haven't fully grasped how different our industry is. (I include myself in the "you" here.)

But it's also hurting us in that any competitor who does understand that it's a different industry is going to start coding circles around us, to whatever extent they've figured it out.

Our Sister Industry

So what's the software industry, and how do we differ from it?

Well, the software industry is what you learn about in school, and it's what you probably did at your previous company. The software industry produces software that runs on customers' machines — that is, software intended to run on a machine over which you have no control.

So it includes pretty much everything that Microsoft does: Windows and every application you download for it, including your browser.

It also includes everything that runs in the browser, including Flash applications, Java applets, and plug-ins like Adobe's Acrobat Reader. Their deployment model is a little different from the "classic" deployment models, but it's still software that you package up and release to some unknown client box.

It includes almost the entire game industry: every company that makes games for consoles like the Playstation, XBox, or GameBoy Advance, every company that produces any kind of single-player game, and every company that produces multi-player LAN-based games like Half-Life. The only exceptions to date in the game industry are companies that host games on servers. This includes (to some extent) games like Blizzard's Battle.net for Diablo, but those are typically just ports of game clients, and can't support that many players.

The only game companies that really have a foothold in your industry are the ones that produce massively multiplayer games like EverQuest or Lineage. And I can tell you this: they're struggling with it, for the same reasons that you are. But they're learning, fast. The ones who don't get eaten up by the ones who have figured it out.

The software industry also includes virtually all open-source software, because typically you download it and run it on your own servers. Much of it is geared towards enabling development in your industry, but it's all being produced by folks in the software industry, so it's completely different from what you're doing.

Servware

Our industry is so different from the software industry, and it's so important to draw a clear distinction, that it needs a new name. I'll call it Servware for now, lacking anything better. Hardware, firmware, software, servware. It fits well enough.

Servware is stuff that lives on your own servers. I call it "stuff" advisedly, since it's more than just software; it includes configuration, monitoring systems, data, documentation, and everything else you've got there, all acting in concert to produce some observable user experience on the other side of a network connection.

Servware is such a fundamental leap from software, at least in the traditional sense of "software", that almost none of your ideas and ways of thinking apply to it. Some do, but before you try to decide which ones apply, you should assume that none do, and work your way up.

I'm not going to try to cover all the differences in this little article, in part because I don't know all of them. I don't think anybody does yet. Instead I'll cover a few interesting ones, and reflect on how discarding a few time-honored assumptions gets you a step closer to writing good servware.

Software Lifecycle

Let's start with the obvious difference: we don't build shrink-wrapped software. There are various deployment models in the software industry:

  • You can create an image on a CD ROM or other transportable medium, and distribute that. Example: Microsoft Windows XP on a CD. Or the game "Age of Empires".

  • You can offer a binary software image for people to download from your servware, often in the form of an archive or self-extracting file. Example: the Apache http server. Or the game Nethack.

  • You can burn your software into a ROM or other writeable hardware medium for inclusion in some sort of consumer device or embedded device. Example: your cell phone's operating system and applications. Or the BIOS for your laptop.

  • You can push the software to your users via an automatic updater program. Example: Windows XP service packs via Windows Update. Or updates to your QuickTime plugin for Internet Explorer.

There are other models as well, but the common theme is that the software leaves your servers, alas, never to return. So you'd better have it right before it leaves.

We in the Servware industry, however, host software on our own servers, and that changes so many things that it could be a decade before we figure out how to do it right.

Broken/Incomplete Models

The client-side deployment problem has shaped the software industry. Through trial and error, people have come up with various models for delivering software predictably and repeatably. Those are the models that you use when you think about doing servware development. We all think about it this way.

But they're the wrong models. There may be pieces of them that work for us, but you're best off throwing the whole model away and starting over, or you're likely to take something for granted that doesn't work well for servware.

One (inappropriate) software lifecycle model is the Waterfall. It goes something like this:

  1. gather requirements and write a functional spec
  2. get buy-off on the spec, then write a technical spec
  3. estimate how long all tasks will take
  4. compute dependencies between tasks, throw into a gantt chart
  5. start coding
  6. give lip-service to iteration while keeping original schedule
  7. finish development
  8. start testing, if time permits
  9. test and fix bugs until regression rate is low enough
  10. burn it onto a "gold" image of some sort
  11. ship it off to manufacturing/production
  12. have a big party
  13. start on version 2, if the product sold well

I've deliberately oversimplified, but you get the idea, and this should all look familiar to you. Many companies today use this exact model, with 2- to 5-year project cycles. Government agencies and big non-technical firms (e.g. the pharmaceutical and hospital industries, public-education systems) still do this, often using consulting firms who specialize in this kind of protracted development. As outdated as it may sound, the Waterfall lifecycle is The Way for many, many companies out there.

The other big, inappropriate model is Extremely Agile Development, into which I lump all other models: Extreme Programming, Agile Development, and their ilk. They differ from the Waterfall in that they acknowledge that the Waterfall totally sucks if you want to get any software delivered. Customers change their minds, and in fact never know what they want until they see it; engineers do a lousy job of estimating tasks and discovering dependencies; project managers are cowed into accepting impossible schedules, and so on.

The Extremely Agile camp claims they deliver software much faster, through a combination of techniques that mostly involves Not Doing Waterfall Stuff, or in any event doing it on 2-week iterations. They actually can deliver software much faster, but it's an open question whether these techniques will ever be accepted by the people writing the checks, because you can't tell them how much it's going to cost, or when it's going to be finished.

At first glance, software and servware development appear to be similar enough that you can pick whichever of these models you like best, discard anything obviously inapplicable (such as burning the software onto a CD), and run with it.

But then you actually try it, and you find that, well, gosh, there are a lot of things to think about that aren't really addressed by either of these models (waterfall or agile extremities.)

One problem is that none of the books about software development process really talk much about data: storing it, retrieving it, searching it, updating it, replicating it, caching it, pruning it, and so on. That's a set of problems that are essentially unique to servware. To be sure, there are books that talk about these problems, but we all know the books don't really help much, except as general guides. Our data problems are too hard for the books out there today.

Another problem is that our software is never actually finished. Not ever. It's a garden that we're tending, and we can't ever stop, or it will be overrun with weeds. We're constantly expanding our servware garden. The books don't talk about how to maintain a garden that covers an entire continent. Nobody's really sure how to do that yet. Software gardens can get a lot bigger than real gardens. A lot bigger. You can grow them in the most extremely agile way possible, but you still have to tend them, and that's not documented in any software lifecycle methodology. For servware, you're on your own, at least today.

Documentation

Another problem is documentation. It's pretty much a solved problem in the Software Industry: you write a boring book that nobody reads. Or, if you're an avant-garde company with a savvy Human/Computer Interaction group, you write an exciting, friendly, web-based User Guide that nobody reads.

The game industry is way ahead on this one, because there are actually no guarantees of literacy in their target customer base of 12-year-olds. So games use tutorials, and if you've played any games in the last few years, you'll know that this is sufficient. The tutorials are amazing. They go on for hours. Tutorials are starting to make their way into software, but the only tutorials in the servware industry are the ones in multiplayer games. The rest of the servware industry doesn't even realize it's the servware industry, and has no idea who to write tutorials for, or what form they should take.

Documentation is a great example of a problem for which we in the servware business have inherited a bunch of solutions, none of which actually work. We don't know how to write documentation for software that's never finished, and we don't know where to put it. None of the old models seems to satisfy, because we don't have a software image to bundle the docs with.

Meanwhile, while the ex-Software Industry folks have been puzzling over how to map their old documentation models to the Servware industry, the problem has solved itself, in the form of Wiki, which is a kind of "documentation servware".

Wiki (without making reference to any particular implementation) is utterly different from the documentation models of the Software Industry. All the old models were more or less thrown away. Wikis are the only kind of documentation that have been able to evolve content as fast as the fastest-changing servware systems evolve code. As a result, everyone is making Wikis, even Microsoft.

Amazon's internal documentation is all going to wind up in Wiki; you might as well get used to the idea. Our Wiki implementation is probably the worst in the world, since we threw it together almost five years ago, long before anyone realized that it was where all servware documentation would wind up. But now that we know, we're going to modernize it, in the only way that we know will work: a little at a time, with the help of the people using it. And all documented in Wiki.

Wiki's interesting because it's the way a great deal of software and servware will be built in the future. For both, their usefulness will be a function of the extent to which they can be modified and programmed by end-users.

For software, this manifests itself as user-extensible systems. Some examples:

  1. Scriptable debuggers, where you can write code that knows how to traverse your custom data structures.

  2. Web Browsers, which give you full programmatic access to the document contents (and browser UI) through JavaScript or some other language.

  3. Extensible editors like Eclipse, Emacs and VIM, which allow you to write productivity extensions and little applications in various languages.

  4. Extensible games such as Quake, which provided (in QuakeC) a way to create your own monster AI behaviors, special effects and other enhancements.

  5. Macro languages in apps like Excel and Word, which give you access to the DOM models and internal APIs, giving you more flexibility than any amount of UI.

  6. Programmable Shells like Bash have been around for decades; even DOS has a batch language, albeit a terrible one.

  7. Stored Procedures in databases — it's hard to imagine a useful database that didn't provide something like Oracle's PL*SQL.

The list goes on — in fact, nearly all really useful apps and systems these days allow you to program them. Desktop environments, image editors, math and statistics packages; virtually every application domain is distinguished by one or more extensible systems, which provide open APIs and at least one programming language for manipulating those APIs.

Programmable Servware

I see no reason to believe that Servware is any different from Software and Documentation in needing to be user-extensible. Google, Amazon and Microsoft are all rushing to open their APIs up to end-users via Web Services. But we've all got a long road ahead before we're doing it right.

In order to make a software system programmable, you have to design it for programmability from the ground up. Layering an extensibility engine on top aftwards has never worked very well, because the internals are just a huge, un-splittable wad of code. Imagine trying to add an extension language on top of, say, Nethack.

Instead, you have to do what amounts to a full rewrite, because to do it properly, you have to redesign your app as two layers: a framework (with APIs, a data model, data structures, etc.) and an application built on top of that framework. That's how Eclipse is designed, and how Emacs is designed; it's how Microsoft rewrote their entire Office suite. It's how any reasonably maintainable user-extensible software system has to be written.

Right now, though, we (and probably Google too) are trying to do it the hard way, by adding in our extensibility layer after the fact. Microsoft didn't even bother; they've been down this road several times before. They've finally figured out how to build a platform, and they're doing it, with the .NET framework.

I know it sucks to think about rewriting our systems; we'd rather not have to. I'll talk more about that later on in this essay.

Servware Requires Metrics

One last "obvious" difference between Servware and Software, before I go on to the more interesting and non-obvious ones, is the problem of measurement.

You'll have noticed, of course, that we seem to do a lot more metrics than they do over in the Software Industry. Over there, metrics just aren't that important, relatively speaking.

For instance, the software industry has data-mining metrics for marketing campaigns, but they don't really know who the hell is using their software, except for people who fill out surveys. Microsoft came up with the clever idea of running spyware that reports usage info back to them, but most software development shops don't have that luxury. So marketing metrics all suck, and people learn to do without them if need be.

They also have QA metrics for bug regression rates, so they can tell when the software appears to be stable enough to send out to customers. Gulp. But this is, again, a sort of special-case thing that you could in theory live without, and lots of companies do.

In Servware, though, we have metrics for everything. And because all our code gardens are effectively immortal, virtually every metric you can think of, no matter how complex, can be sampled and plotted over time. So all our metrics look like line graphs, and anything that doesn't have "real-time" metrics is something that we're operating blind, which is no fun.

In the Servware Industry, we're quickly finding that you need to instrument all systems, and you need to do it from the ground up, in part because retrofitting instrumentation onto a system that wasn't designed for it is almost as hard as adding an extensibility layer.

But there are more compelling reasons. For one, Servware is different from Software in that you can release fixes or features as often as you like. There aren't any external constraints, such as how often your users can tolerate re-installing your software (since they never have to).

Well, when you have no constraints on the length of your release cycle, you can, if you choose, grow your software bottom-up instead of constructing it top-down. You launch a small piece of functionality, get some feedback, and make the appropriate change — which is sometimes a change in direction or requirements.

To get this kind of feedback, you need everything to be instrumented as you go. Your instrumentation will tell you how well your customers like things, how well your stuff is performing, and so on, so you can make the right changes as you go, rather than later when it's hard to fix things. Metrics help you grow your system in the right direction from the outset.

One specific kind of metric that all Servware systems need is Availability. This one tells you if the servware is available (whatever that means).

(Note: You've probably noticed that I'm deliberately avoiding the use of the word "Service" in this essay, since it's so overloaded with old baggage, much of it originating from our mistaken belief that we're in the Software industry. The "hardened API" teller-call model, for instance, is our best effort to map the tenets of Object-Oriented software design onto the Servware domain. But we're finding that it's a lousy model for a zillion reasons beyond the scope of this essay. Suffice it to say that our current idea of the concept of a teller-call API is pretty different from the OO interface model it started out as.)

Anyway, availability: this metric is interesting in that if you add enough availability monitoring, you wind up with a unit-testing system: software QA, for all purposes. Weird, eh? But logical: if you start by monitoring a system to see if it's available, you find that it may be "available" but its dependencies aren't, so you need to reach through and monitor them as well.

But the definition of "available" is fuzzy; any servware can get into a brownout state where it claims it's available, when in fact the only thing that's actually available is the part that's responding to the availability ping. And boy, that's a nasty problem to debug, since your servware looks like it's fine, and the problem appears to be downstream, but your servware is really what's horked.

So in order to test true availability, you have to start sending "smart pings" through the system, which are (in effect) testing actual use cases. And until you're actually measuring all your use cases, any one of them can potentially be unavailable without your knowledge. Hence availability monitoring always evolves into real-time QA.

Anyway, metrics are yet another obvious way in which Servware is radically different from Software. There aren't any books that tell you how to do Servware metrics, because most authors (just like most programmers) haven't yet realized that Servware and Software are different disciplines. They will, though, and then maybe some decent Servware books will start to appear.

Somewhat Contentious Stuff

At this point, I may or may not have convinced you that we're not in the Software Industry, and that what we're doing is different enough to be called something else and revisited from first principles.

Even if you're not convinced (yet) that it's a truly significant difference, well, you still already knew that Amazon is pretty different from your job at ShrinkWrapSoft. I've gone through some of the obvious differences:

  • The traditional software lifecycle and software development processes don't cover our situation very well.

  • The traditional documentation process and frameworks don't handle Servware very well, but a new type of documentation Servware called a "Wiki" is filling the gap rather nicely.

  • Servware (like documentation and software) needs to be as user-extensible as possible, up to and including user-programmability. But we don't really know how to do it yet, since nobody has gone back and re-architected their Servware systems for user extensibility.

  • Servware has a deep-rooted need for real-time metrics; this has no real correspondence in Software.

Yawn. So what, you ask. I haven't said anything particularly new or useful so far.

To make it interesting, let's start throwing out some more of our time-honored traditions from the Software Industry, and see where it gets us.

Tradition #1: Using C++

Yeah, you saw that one coming. I know. Surprise! Stevey doesn't like C++. Tell us something we don't know.

Unfortunately, debunking C++ for Servware development requires an essay at least as long as this one, in part because the belief that C++ is a good choice is so deeply ingrained in people with backgrounds in the software industry, where it's still (for the most part) the only real option.

For now, I'm just mentioning it as something that Servware Industry professionals like ourselves need to revisit if we want any hope of staying ahead of our competitors.

Tradition #2: Rewriting Software

Folks in the Software industry have traditionally held that rewriting working software is a really bad idea. Joel Spolsky rants rather effectively against rewriting software in his Things You Should Never Do essay from April 2000.

But we're not in the software industry, are we?

Rewriting your code is a different proposition in the Servware industry, for a variety of reasons.

One reason is tied closely to my argument that C++ is a lousy choice for Servware development, so it'll have to wait for that essay.

The gist of it is that once you've decided you don't need to use C++, rewriting all your C++ code will eliminate huge classes of bugs that simply aren't possible in higher-level languages. Java eliminates some of them, and as you move up the scale of expressive power, even more fall by the wayside. Moreover, you open yourself up to being able to use more libraries (since code sharing is easier in virtually every other language), you shorten your build times, you shrink your code base, you eliminate non-essential system complexity, and you actually start to enjoy programming again. But really — it's another essay.

Language choice aside, rewriting Servware doesn't suffer from some of the key problems with rewriting Software.

For one thing, rewriting Software is pretty much an all-or-nothing proposition. But you can rewrite Servware a little at a time — that's how we've been migrating from Obidos to Gurupa, for example: one URL at a time, basically.

Also, you can migrate smoothly by running the old and new versions in parallel, even with a complete rewrite. This is how we migrate to new order-processing systems: by routing a small percentage of orders through the new version, and gradually dialing up the percentage as we gain confidence that the new system actually works.

That's just not possible in the Software industry.

Software that runs on client machines is often filled with special-purpose code intended to run only on one kind of system: device drivers, for example, or other platform-dependent code. It's a nightmare to test all the combinations that you've debugged, and this is one reason Joel says you're crazy to rewrite it. But in Servware development, you don't (or at least, shouldn't) have platform-dependent code. So there aren't anywhere near as many oddball code paths to cover.

Rewriting software is often just what a system needs; re-architecting or refactoring into a new design can be significantly harder than a straight rewrite. Sometimes it's easier to rewrite it because the original authors are gone, and the code base was never really documented. Sometimes the code has simply grown so large and crufty that it's physically less effort to write a new system that has the same externally-observable behavior (also known as a "spec", for systems lacking formal specifications; i.e., most of them.)

And rewriting is usually more fun than hacking on an old system. Don't underestimate how fired up SDEs can get if they're given a chance to build something from scratch. In addition to the other advantages, it can be a powerful motivation for developers.

So avoiding software rewrites is another tradition that has little value in the Servware business.

Tradition #3: Monolithic Services

Monolithic service designs, which you can usually spot by the presence of the word "Master" in the service name, are a relic of our roots in the Software industry, where everything is delivered to customers as a Big Ball of Software.

There's no reason they need to be one giant binary that exports 1000 API calls; this probably causes more problems than it solves. About the best that can be said for it is that it's easy to roll back to a previous version in production, which is a good thing, because with 1000 calls in one binary, the thing is going to break all the time.

Obviously a better design is to factor the service into sub-services. But even that isn't really thinking in the Servware domain, where we throw out all our old Software Industry assumptions and start fresh.

What if every "API call" were hosted by a different server? Well, you'd have a lot of flexibility in fixing any given call without needing to impact clients that don't use it. And it'd be easier to distribute it over multiple machines — it seems like it'd be a better design than the BEA WebLogic design, which is to run a few copies of your MonolithService on each of a handful of machines. There would be new problems to solve, but who's to say it wouldn't be a better design overall? We won't know unless we ask the question.

Tradition #4: Apps and Libraries

In our old friend the Software Industry, there is a relatively clean delineation between library writers and application writers.

Simply put, an app is something that users use, and a library is something that programmers use. An app is often (but not always) synonymous with at least one dedicated operating system process, whereas a library is typically thought of as something that's linked in as part of an application, rather than running on its own.

Well gosh. Those things are meaningless in the Servware domain, since right off the bat, we have no corresponding analogs for the concepts of "user", "operating system", "process", or even (possibly) "application".

Is MRT (our recruiting tool) an application? Sure, it's a web application, you say. Well, if I want to add a feature to MRT, and it doesn't provide me with an extension interface (and of course it doesn't, since almost no Servware is extensible today), can I add features to it?

Sure. I pull down the HTML, parse it using any of several equally ugly techniques, and feed it into my own web application, perhaps one that sends me mail when there are pipeline resumes available for me to review.

We do this all the time in the Servware Industry. It's far less common in the Software Industry: people don't usually think of applications as components that you use in building other applications.

Actually, it does happen, in (of all places) Microsoft Windows, where they've taken all the major applications and exposed DOM models that can be accessed via any language through the OLE/COM interface. So, for example, you can write a little script in Visual Blub (or your favorite language) that can automate Internet Explorer, e.g. to remove advertisements from the MapQuest "driving directions" web page.

But with the exception of Windows automation, which not many people realize they can do, you don't often find Software Industry folks using apps as components.

We Servware folks do it all the time, though, because there isn't a clear distinction between a "web app" and a "web service". O'Reilly, for instance, has been scraping Amazon's site for sales-rank information for years: they started long before we began offering it as a web service interface. Making it a SOAP interface certainly made it a lot more convenient, but it didn't really enable anything new for O'Reilly. They were using Amazon the user-application as Amazon the data-service.

Darren V.'s TsTk is another great example. He's created a language that allows you to source and analyze data from any number of disparate sources. Think about a having language like that for websites - one that lets you create custom metrics (or any other view) from hundreds of websites. I guarantee you that such a language will eventually exist, probably sooner than anyone expects. By treating websites as atomic building blocks, it'll open up possibilities my poor mind has trouble even imagining, at least at 2:47 a.m.

I could give you dozens of other examples of using apps to build apps in the Servware domain, but you get the idea.

Building apps using other apps so much easier in the Servware industry than in the Software industry that it's likely to completely change the way we think about building things. Servware will be composed in layers — probably many more layers than exist in a typical large Software application. Google and Amazon build services, I build a service on top of both of them, someone slaps a language on top of my service, someone else builds a service with the language, ad infinitum.

I don't even want to think about deadlocks. Obviously there will be automated dependency-discovery mechanisms available to people, similar to the ones we're just starting to build for our internal systems now. But who knows how it's all going to work.

Regardless, it's clear that the time-honored Software distinction between "app" and "library" is a relic of traditional Operating Systems that doesn't necessarily carry forward to a distributed world.

Tradition #N: <Your Tradition Here>

Face it. You're not in the Software Industry. This is something altogether new and different; it's as different from traditional software engineering as software design is from hardware design.

I haven't even touched on issues like versioning, deployment, configuration, distribution and parallel computing, state management, caching, service throttling, brownout detection, and a host of other problems that have no analogs in the software industry. We need to solve them, though, so we might as well own up to the fact that this is an entirely new problem space — one that doesn't necessarily have much in common with "software engineering" in the traditional sense.

So next time you have a problem to solve, stop and think about whether the "obvious" solution really is obvious, or if it's just something you've inherited from our old ways of thinking about development. You might surprise yourself.

(Published Oct 12, 2004)


Comments

I liked your post.

One thing I felt you should have talked about was the tradeoff between "time to market" and "do it right".

Should we release services and products now, while they are good enough or delay releases by weeks or months to do them right the first time?

btw: I'm in AWS so this issue is especially relevant (atleast to me).

Posted by: Rahul S. at October 15, 2004 08:32 PM

Steve Yegge is my hero.

Posted by: Aaron F. at October 19, 2004 10:36 AM

Rahul — I was really shocked when I first came to Amazon and saw how seemingly flippant we were about launching half-finished stuff, with minimal QA testing.

Over time, though, I've become convinced that doing it this way has the best survival characteristics, for reasons that are outlined nicely in Dick Gabriel's famous "Worse is Better" essay. I think our approach of "launch early, launch often" (which sounds an awful lot like the Linux kernel release philosophy) is one of the main reasons for Amazon's initial early success.

I see a lot of people coming in and wanting to impose heavyweight processes and rigorous testing, trying to make Amazon more like a traditional software development house. I personally don't think this would be a good idea, and I favor launching early and often, even if the releases result in extra CS contacts.

It's a long discussion, though, with lots of gray areas and caveats, so I'll just leave it at that for now.

Posted by: Steve Yegge at October 19, 2004 11:28 PM