3 Lessons from big software failures

We all learn from mistakes. Ideally, we learn from someone else’s mistakes. After the frantic glitch dissection, after the final customer update, it’s time to pick up the pieces and dissect. Herein, a collection of more or less recent, decidedly epic software disasters. May they spark conversation that helps your shop to avoid more of the same.

Software fails. Sometimes, software fails spectacularly.

Case in point: Black Friday 2019. Because of a mysterious service outage, at around midnight on the Friday after Thanksgiving, thousands of people who rely on Dexcom’s software to follow the blood sugars of their diabetic children and other loved ones were plunged into darkness. And the outage lasted for days.

At the very least, these failures can spark the kind of discussion that can lead your IT group to avoid disastrous software mistakes. Perhaps they can inspire a team conversation that begins, “How can we be sure we won’t repeat their errors?”

Ordinarily, the Dexcom Follow app allows followers—including parents or other caregivers—to remotely view a diabetic’s glucose data directly from a smart device, whether they’re down the hall or on the other side of the country. When Dexcom’s system failed, some diabetics’ blood sugar levels bottomed out into potentially fatal levels, but alerts weren’t sent. Caregivers were oblivious to the issue, because they trusted the software to send alerts. The lack of data—and its timing, in the middle of the night—could have led to the dead-in-bed syndrome, when diabetics die in their sleep due to undetected and untreated hypoglycemia. Dexcom failed to quickly pin down and mitigate the underlying problem. During the days-long outage, as Dexcom’s Follow feature flickered in and out of life, the online diabetes community seethed over the company’s failure to inform customers until hours after it had started to receive complaints.

The Dexcom failure, like all epic software failures, left a host of questions: What caused it? How could it have been prevented? How could the company better inform customers? What lessons does the failure offer for other IT shops to avoid similar software catastrophes?

This article pulls together a few recent software failures, in no particular order. They’re all serious, though.

The purpose is not schadenfreude or corporate shaming. Rather, these are opportunities for dedicated software development and IT departments to recognize the vulnerabilities in their own systems (human and technical). How could they be prevented? What can you do to ensure your company avoids a similar mistake?

Mind you, we don’t have all the answers. The companies themselves might not know the genesis of the problem. For all we know, they’re still puzzling it all out. At any rate, there’s only so much detail a company can publicly release without arming cyber-attackers who might exploit known flaws once they know the components—and the corresponding security issues—of a company’s IT infrastructure.

But at the very least, these failures can spark the kind of discussion that can lead your IT group to avoid disastrous software mistakes. Perhaps they can inspire a team conversation that begins, “How can we be sure we won’t repeat their errors?”

The self-driving car that killed Elaine Herzberg

In March 2018, 49-year-old Elaine Herzberg became what was believed to be the first pedestrian killed by a self-driving car. One of Uber’s prototypes struck Herzberg as she walked her bicycle across a street in Tempe, Arizona on a Saturday night. A human test driver was behind the wheel, but video from the car’s dash cam showed that the driver was looking down, not at the road, in the seconds leading up to the crash.

What we know about the causes

Following the fatal crash, two anonymous sources told The Information that the car’s failure to detect Herzberg was likely caused by a software bug in its self-driving car technology.

According to police, the car didn’t try to avoid hitting the woman. It was equipped with sensors, including video cameras, radar, and lidar—a laser form of radar. Given that Herzberg was dressed in dark clothes, at night, the video cameras might have had a tough time. But the other sensors should have functioned during the nighttime test.

Uber’s autonomous programming detects objects in the road. Its sensitivity can be fine-tuned to ensure that the car only responds to true threats and ignores the rest. For example, a plastic bag blowing across the road would be considered a false flag, not something that a car should avoid by slowing down or braking. The sources who talked to The Information said that Uber’s sensors did, in fact, detect Herzberg, but the software incorrectly identified her as a “false positive” and concluded that the car didn’t need to stop for her.

The incident put a brake on autonomous driving. Uber settled with Herzberg’s family, avoiding a civil suit and thereby sidestepping liability questions about self-driving cars, particularly after they’re out of the test phase and operated by private citizens. Uber chose not to renew its permit for testing self-driving vehicles in California when its permit expired at the end of March 2018. Arizona halted all of Uber’s self-driving tests following the crash. Other companies, including Toyota and Nvidia, voluntarily suspended autonomous vehicle tests in the wake of Herzberg’s death, while Boston asked local self-driving car companies to halt ongoing testing in the Seaport District.

Takeaway: How much are we ready to trust AI?

Are we ready to let artificial intelligence (AI) make choices for us, if it might choose to kill us?

Self-driving car technologies are forced to make trade-offs. The outcome can be tragic.

The AI’s fatal choice in this case was reportedly made because of a software glitch. But lives are at stake, and they are subject to an AI’s choices. Do we want a smoother ride that’s more prone to ignore potential false positives (such as plastic bags or bushes on the side of the road), or a jerky ride that errs on the side of “that object might be a human”? At the other extreme, at what point does the technological advance’s weaknesses mean that it’s not possible to deploy the product? Who makes that decision, in your company? Who should?

The $2.1 billion HealthCare.gov cluster-fork

Millions of Americans were expected to sign up for health insurance following the Oct. 1, 2013 rollout of Obamacare. It should have come as no shocker that, on its first day, the HealthCare.gov health insurance exchange website was reportedly visited by over 4 million unique visitors—visitors who were destined for pure, banging head-desk frustration.

Unfortunately, that tidal wave was indeed a surprise to the inexperienced government employees and contractors behind the website rollout.

It’s hard to overstate how badly the site performed. According to notes taken the following day at the war room of the Center for Consumer Information and Insurance—which is part of the Center for Medicare and Medicaid Services—the number of people who managed to sign up, in spite of the ongoing issues of “high capacity on the website, direct enrollment not working, [Veteran’s Administration] system not connecting, [and] Experian creating confusion with credit check information,” came to a grand total of six. In the words of Saturday Night Live’s Kate McKinnon as she portrayed Secretary of Health and Human Services (HHS) Kathleen Sebelius, the site was “crashing and freezing and shutting down and not working and breaking and sucking.”

HealthCare.gov crashed within two hours of launch. As McKinnon deadpanned, it was a sign of pure success: “If our website still isn’t loading properly, we’re probably just overloaded with traffic. Millions of Americans are visiting HealthCare.gov. Which is great news. Unfortunately, the site was only designed to handle six users at a time.”

Causes: Shall we count the ways?

Fingers initially pointed to inadequate site capacity. But according to a Harvard Business School analysis of the spectacularly failed launch, that was only part of the picture. Other faults included utterly lame site design, inadequate staffing, and lack of experience. Need specifics? They included:

  • Incomplete drop-down menus
  • Incorrect or incomplete user data transmitted to insurers
  • A login feature that could handle even less traffic than the main site and hence formed a sign-in bottleneck
  • HHS employees’ and managers’ inexperience in dealing with technology product launches
  • Key technical positions left unfilled
  • Project managers’ ignorance regarding the amount of work required and typical product development processes, leading to insufficient time devoted to testing and troubleshooting
  • Lack of leadership, resulting in delays in key decision making or a lack of communication when key decisions were made

Then there was the politically-dictated but wrong-headed devotion to deadline instead of to results and actual preparedness levels. The Affordable Care Act mandated a launch date. Regardless of the amount of software testing that HHS employees managed to accomplish, and regardless of the results of the testing and troubleshooting they did manage to get done, the deadline ruled supreme.

A team of developers and designers took up shop in suburban Maryland, creating a virtual startup-within-government, in order to rescue HealthCare.gov and Obamacare. They devoted months to rewriting all of the site’s functions, replacing contractor-made applications with ones costing one-fiftieth of the price.

Takeaway: Know your limitations, and call in experts to resolve those limitations

Only a handful of sites handle the same traffic volume and backend complexity that HealthCare.gov requires. And they’re not within the Beltway, as The Atlantic pointed out following the disastrous rollout and the subsequent site rescue. Those websites are in Silicon Valley.

If you can’t change the schedule, you have to change the budget—and bring in the people who’ve demonstrated that they can get the work done, both well and on time. It’s up to an organization’s top management to know whether they have a team with so little experience in technology projects that they don’t even know what they don’t know. Evaluating the team’s preparedness and experience is the job of the top brass, but if the rank and file don’t see that evaluation happening, by all means, they should raise the red flag.

Disaster: Dexcom’s fainting follow

Ten year old John Coleman-Prisco has worn a Dexcom glucose monitor since he was 6. Normally, if he gets into trouble at night, the monitor sends an alert to his mother’s smartphone, which relies on the app’s “Follow” feature to tell caregivers what the diabetic’s blood sugar is, whether it’s trending up or down, and when it’s veering into the red zone. But that’s not what happened in the early morning hours of the Saturday after Thanksgiving. Instead, as John’s mother told the New York Times, his brother heard him moaning and screamed for their parents. They jumped out of bed, managed to rouse him, and fed John apple juice and candy to restore his glucose levels to a safe range.

How does it come to pass that an application on which people’s lives depend can fail, without warning?

Causes: Your guess is as good as theirs

As of December 2, Dexcom was still scratching its head over the failure’s root cause, though the company did determine that a server overload occurred due to “an unexpected system issue that generated a massive backlog, which our system was unable to sufficiently handle.” At the time, it was working 24/7 with its partners at Microsoft to address the problem.

On December 3, Dexcom said that it had identified the problem, but the company didn’t provide details. A December 10 app upgrade in the Google Play store summed it up as “performance enhancements and bug fixes.” However, following the update, numerous reviewers said that they still weren’t getting any data in the app. In short, Dexcom failed to quickly pin down and mitigate the underlying problem.

Takeaway(s): Maybe you can’t prevent software defects. But you can communicate responsibly.

Server overload is one thing. Leaving people uninformed for hours at a time is another, completely avoidable thing. Dexcom didn’t announce the outage until 8 a.m. Pacific time on Saturday—11 a.m. on the East Coast. It informed users with a brief notice on its Facebook page. During the days-long outage, with the Dexcom app’s data-transmission Follow feature flickering in and out of life, the online diabetes community seethed over the company’s failure to inform customers until hours after it began to receive complaints.

Dexcom says that the incident “revealed some areas for improvement, both with our system and in how we communicate with our users.” Its mea culpa: “Once we have solved the issue immediately at hand, we will follow our standard assessment procedure to learn from what happened and help prevent issues like this from happening again. Additionally, we are committed to creating a more optimal customer communication experience moving forward.”

Translation: Oops. Maybe we should have let you all know as soon as we learned that we’d left you high and dry. Dexcom is far from the first company to be taken to task not only for a disaster, but for the lack of transparency and prompt communication to users that followed.

In the realm of cybersecurity, it’s one thing when an unpatched vulnerability comes to light. In such a case, it’s smart to keep the problem close to the vest until a patch is ready to be plugged in, so attackers aren’t alerted about opportunities to exploit compromised systems. But in an application failure like Dexcom’s, potential cyberattack isn’t the issue. Does your company have a policy about how much info they’ll share about major, non-security-related outages, and a timeline on how promptly to do so? Should it?

Disaster Etcetera

This list obviously could go on and on. We could have included, say, Toyota’s uncontrolled acceleration issues, which were partially related to what software experts said was defective source code that caused unintended acceleration. Then too, there was the spectacular crash of Knight Capital Group; a company with nearly $400 million in assets that went bankrupt in 45 minutes because of a failed deployment. … or when British Airways’ IT systems nosedived, stranding or rerouting tens of thousands of passengers in 2017 and yet again in 2019.

A full list of every software disaster ever would be long. A full list of every truly epic software failure would make a hefty paperweight. If you were presenting that paperweight to your IT department, which disaster would be etched onto its surface, its lessons most particularly suited for your industry and your organization’s strengths and weaknesses?

Who’s responsible for this conversation? Perhaps you should consider establishing a Chief Quality Officer. We’ve got a free, downloadable white paper about that very topic.

by Lisa Vaas

Lisa Vaas has been writing about technology, careers, science and health since 1995. She rose to the lofty heights of Executive Editor for eWEEK, popped out with the 2008 crash and joined the freelancer economy. Her main gig is writing for Naked Security, the blog run by the cybersecurity company Sophos. She’s also written for CIO Mag, ComputerWorld, PC Mag, IT Expert Voice, Software Quality Connection, Time, and the US and British editions of HP’s Input/Output.

Sign Up Today

The Functionize platform is powered by our Adaptive Event Analysis™ technology which incorporates self-learning algorithms and machine learning in a cloud-based solution.