The pod bay doors didn’t open. Why didn’t someone test for that?
One of the most famous bits of movie dialog, one that’s become part of popular culture, is in the film 2001: A Space Odyssey. The protagonist, Dr. Dave Bowman, says to the voice-driven AI, “Open the pod bay doors, HAL.” To which HAL responds, “I’m sorry, Dave, I’m afraid I can’t do that.”
(Spoilers abound. You’ve been warned.)
Most people only pay attention to the story. But if you’re a software QA expert, what you heard was an obvious failure in software quality testing.
The AI should have obeyed the order immediately. But obviously, the HAL 9000 computer went off the rails. In the process, the computer killed all but one of the crew of Discovery, the space ship that HAL was operating, and it failed in its mission to discover more about a mysterious monolith. That qualifies as a serious software failure – though, happily, a fictional one; fortunately, in the real world, NASA does a better job.
It’s equally obvious that thorough testing of the computer and its AI software had missed HAL’s homicidal bent.
If you accept the premise that HAL was a mission-critical application, then there are actual lessons one can learn about ensuring important software meets expectations – particularly in an environment with a lot of unknowns.
So what happened? One reason is that testing an AI is complex. It’s also likely that the makers of the HAL 9000 didn’t think testing was necessary. Why? Just ask HAL. “The 9000 series is the most reliable computer ever made. No 9000 computer has ever made a mistake or distorted information,” HAL said as an introduction.
HAL didn’t view its actions as a mistake. Its programming didn’t include sufficient safeguards to prevent things like, say, killing your passengers. This is clearly a programming failure, and while it’s unlikely that the programming would specifically allow such an action, such an error also must not have been seen as contrary to mission success.
However, there are also other issues. The HAL 9000 computer, and its AI software, clearly used machine learning to modify its programming. That meant that when news reached the computer about a potential alien civilization, the AI changed its programming to help achieve what it now viewed as mission success. But as in many cases of AI failures, it had insufficient data on which to base such a conclusion. Likewise, it had (and did not create) safeguards to prohibit some types of actions, regardless of what the machine learning indicated.
By now, science fiction fans are muttering about the “Three Laws of Robotics,” but those don’t apply in the universe where this movie happened. Those laws, governing how robots are to interact with humans, were invented by author Isaac Asimov, and were intended to prevent robots (or other AI devices) from hurting people. They may not have been part of Arthur C. Clarke’s 2001 premise. There is something to be said in favor of incorporating them into HAL’s programming, but if nothing else, that would have ruined the story.
“If there’s one thing we’ve learned from HAL 9000 and other major software glitches in the automobile, health, and safety sectors, it is to improve testing procedures,” says Eric Carrell, a DevOps engineer at RapidAPI. “It may sound counter-intuitive, but the best way to minimize bugs in software is to make sure another machine is not doing the job.”
“Testing automation looks extremely profitable in terms of cost-effectiveness and high productivity levels for development teams,” Carrell explains. “However, the truth is that a machine can barely make sure that another machine can be passed as safe or complete.”
Carrell suggests that what’s really needed for testing an AI is a well-thought-out culmination of automated and manual techniques. “It is true that human bias also exists. But, for the most part, the learning activities and judgment of a human will still try to emotionally as well as practically vet a particular choice,” Carrell says.
“There is no single recipe to describe how to properly test an AI-based system, just like there isn't a single right method for testing any kind of system,” says Denis Leclair, VP of engineering at Trellis. “The core principles that make for a successful QA program are the same, by and large, whether or not the product employs AI.”
“One such principle is the need for a thorough understanding of the system's requirements,” Leclair says. “This must include a thorough understanding of tolerances and the pass/fail criteria. For example, an AI system designed to drive an autonomous vehicle must be bound to a much lower tolerance for error than a recommender system used to recommend the next series for you to watch on your favorite streaming service.”
In addition, you have to find a way to test the AI’s decision-making. Is it making the right choice given the information it has available? “The determination of the right decision in and of itself is not always as clear cut as one might think,” Leclair explains.
“Furthermore, the appropriate test strategy will depend on a lot of the AI and machine learning (ML) techniques used in the solution,” Leclair says. “With ML, we can broadly classify any system into one of the following groups: fully supervised learning systems, semi-supervised learning systems, unsupervised learning systems, and reinforcement learning systems. The differences between these groupings are fundamentally how much information is given to the algorithm for it to train and possibly even evaluate its own correctness.”
A primary reason for using AI is so the machine can make decisions and take action based on the inputs it receives and the range of allowed responses contained in the AI programming. But in an AI that also uses machine learning, those allowed responses may be altered. So the question then is just how much can the AI change, or what are the limits of its authority?
“The ‘authority’ given to the AI system can be thought of as the range of possible output values that the black box [the AI] might emit under any conceivable set of inputs and then any follow-on effects that those outputs might be connected to,” explains Leclair.
“For many AI applications, particularly with deep learning systems, it won't be practical to analyze every micro-calculation being performed by the algorithm.” Leclair says. “Instead, a more statistical approach to testing these algorithms is generally the way to go. The idea is to exercise the system across a range of inputs and verify that the system's inferences are correct (within tolerances) for those inputs.”
“Even if the AI models employed are deeply nested, rendering them effectively opaque, the system designer can always apply limits on the model's outputs, effectively shaping the output to ensure that the answers remain within acceptable bounds appropriate for the target application. The QA engineers, for their part, must exercise and validate those limit checks by conducting boundary testing at the inputs and outputs of the system,” Leclair says.
Finally, make sure that the AI is producing data and making decisions as it’s supposed to. Leclair says the way to do this is through cross-validation. “Cross-validation is the act of testing a trained AI model against a series of previously unseen input data and evaluating the level of correctness of the output coming from the algorithm. From the QA standpoint, the objective is to validate the algorithm across a sample of input data sets representative of the range of inputs the system might expect to see in production.”
“We are talking about creating systems that will end up being smarter than their masters,” Carrell says, “and we’re doing this by feeding them data. And so, in order to fully control how they behave or evolve over time, the key is to optimize training data. How this data is regressed and classified by the AI system will then eventually be responsible for the machine’s behavior.”
Look for patterns that indicate bias, including any human bias and then determine whether the machine is ready to process world data. “These are the questions we want to be looking at before the machine is pushed out into the wild, to process data and dish out an entirely new result,” Carrell explains. “And so, new QA tests need to take into account security and governance measures that make an AI model completely prepared for real-world situations.”
Some of those tests must take into account the type of work the AI is being tasked to perform. As Leclair mentions, there’s a big difference between an autonomous vehicle and recommendations for a streaming service. As the level of responsibility grows, so must the need to confirm the limits to what the AI can do without either intervention or approval. While it may be useful for a defense AI to determine when a missile launch might be hostile, it’s critical that the AI receive approval before launching a retaliatory strike.
This means that even though the AI may have great latitude to learn from its inputs, it must still have limits. Perhaps, for HAL, those limits might have included Asimov’s laws.
Our testing software is built on AI and machine learning, and we’re awfully proud of it. For example, Functionize uses AI to learn how your UI really works. Renaming or restyling a button, even moving it on the page won’t break your tests. And we promise to close the pod bay doors when requested to do so.