How we learned nothing from the Space Shuttle disasters

This is a work in progress not linked to from the index. Before the link appears there, I cannot guarantee the truth of anything here.

After the Challenger disaster, there was a lot of investigating. After all, NASA already got a lot of bad reputation for the costs. The whole project of the Space Shuttle was questionable - it was supposed to be a reusable vehicle that would make going to space cheaper, yet large parts of the vehicle were discarded and the rest took months to refurbish. In the end, everything was even more expensive than before. So a fatal accident of course didn't help.

Accidents in space happen. Space is hard and there are times when nobody could have possibly expected the problem. But this was not that case. There were several reasons that worked together to cause the explosion. At least some were known about in advance - some rubber seals became brittle due to very cold weather, which led to a fuel leak. There were way too few failsafes, since everybody believed the shuttle is as safe as regular airplanes. How did NASA screw up so badly?

There are a couple technical problems, but it quickly became clear that the real problem is in management and that an accident was waiting to happen. Richard Feynman asked the managers what's the chance of failure for a given flight. The answers were around 1:100 000. Now, there were 2 fatal accidents in 135 flights, so empirically, the chance is about 1:70. This points to the first problem:

Groupthink

When you have ten experts that seem to agree on everything, and especially on how great they all are, you have a problem. If one of the experts finds an issue, s/he has to fight nine others who all believe everything is fine. And because for each of these nine experts, eight of the others agree and there's just one dissenter. So each of the nine experts will be even more confident. Problems get buried, so the system is believed to be safe, so it can't have problems, so they are buried again. We have a vicious cycle.

This took the form of a meeting with the engineers that developed the rubber seals (that later failed). They warned that they couldn't guarantee their product would work during the launch conditions (the low temperature). They were ignored.

Groupthink causes problems everywhere. It is a problem in every big company. Many startup founders mention it as a huge change when they are acquired by large companies - that too many people working together cause efficiency to plummet. In a way, groupthink was even behind the Shuttle program itself, in that it was objectively badly designed and failed to save money. It was pretty clear just from the fact that a half of the shuttle had to be thrown away anyway.

Yet much has been said about it. It has a nice name - one word, can be put on posters...

Image source: https://www.jabra.com/blog/groupthink-kills-collaboration/

And it's easy to describe fast. Its causes can usually be traced to individual people and can be fixed by firing them and assembling a new team. Groupthink is too immediate a cause - it's just like saying the problem was in the seal. It was, but even if that particular instance of a technical problem didn't happen, something else would. Even if groupthink didn't happen, there is still a cause further up the chain.

Normalization of Deviance

Just look at the name. It is more informative than "groupthink". But it's just so ugly. If you make a presentation about groupthink, the word can be pasted all over and it's fine. If you make a presentation about this, the name will stand out and it will sound awkward. So what is it?

To see what NoD is, look at the second Space Shuttle disaster, the disintegration of Columbia. The oversimplified story is this: During the start, a part of isolation foam flew off and hit a wing. Mission control was aware of this, but believed there was no damage. When Columbia was returning to Earth, it had to survive very strong winds at extreme temperatures. Unfortunately, it turned out that the wing had been damaged. The foam created a hole which grew under the stress, leading to the whole wing falling off, at which point Columbia spiralled out of control and disintegrated. Seven astronauts died.

Now let's look at the story around the piece of foam.

1) In previous launches, this had been a common occurence. Pieces occasionally fell off. It never caused any problems, so it was ignored.

2) When it happened in this case, there was a guess that it's probably fine. It was hard to question this implicit assumption. That's the groupthink part.

3) The software used for simulations was outdated and predicted possible damage. It was ignored because it had been giving out false alarms now and then. More groupthink here.

4) Other ways to find out more (like using spy satellites) were deemed "not necessary".

We do see groupthink here, but there's more. The problem happens several times in a row, never causing any immediate trouble. So it's ignored. Later, other such problems happen. They ball up, causing a fatal accident. The piece of foam is the deviance. It happens many times, but fixing it seems unnecessary and hard. So it's normalised, made the standard.

Before we move to the point I want to make, let's revisit a sentence from earlier:

Accidents in space happen.

That's NoD in practice. Yet you believed me. In a way, it's true. But all fatal accidents in space were easily preventable. There were only four. Two Space Shuttle disasters already discussed and two Soviet ones. One had three cosmonauts run out of air due to a badly fastened ventilation valve (which was one of several hundred problems of the craft). The other had a failed parachute. It was rushed, not tested enough and there was a lot of politics in the background.

When things are done properly, accidents almost don't happen. Some still do. Apollo 13 was caused by bad insulation of a cable. It was a random accident that only resulted in a few redesigns of the failed systems, not in firing hundreds of people for doing terrible jobs. Notably, everybody survived.