As applications grow more complex, companies have moved away from the traditional approach where new features are launched on a red-letter day to great fanfare. Nowadays it’s more common to release features incrementally, both to test how users react to them and to test how they perform in the technical sense. Two approaches are widely used for this, Canary Testing and Dark Launching. In this blog, we look at both of these in detail.
The need for a new approach
In the old days, applications were relatively simple, often monolithic and able to be tested pretty completely prior to launch. Companies such as Microsoft tried hard to make sure their applications such as Word and Excel were reliable and bug-free prior to launch. (Although they were less good at gauging user reaction to new features, as the now-infamous Office Assistant proves!). One reason for this approach was that Microsoft’s core market has always been large corporate customers. Such businesses care more about stability and reliability than they do about features. And having chosen a software suite, they will stick with it for many years (this is even more true of government agencies!)
By contrast, most software nowadays is aimed at the consumer market. Consumers are far more fickle. What they care about is usability, UX and new features. Consequently, developers are under pressure to constantly improve their applications and launch new features. The problem is that with consumer applications it’s never easy to know which features will be proven popular. And with the complexity of most applications, it’s not easy to know how a change to the application will affect the end-user experience or the stability of the backend infrastructure. Canary Testing and Dark Launching are two common ways to get answers to these questions.
Canary Testing refers to the practice of releasing a code change to a subset of users and then looking at how that code performs relative to the old code that the majority of users are still running. This is achieved by setting up a set of Canary servers (or containers) which run the new code. As new customers arrive, a subset of them are diverted to those canary servers.
The idea is that you can then use standard performance and monitoring tools in your infrastructure to detect whether the new code is working correctly. For instance, you might monitor the compute load of the Canary servers relative to those servers running the old code. If the load increases substantially you know that’s a potential issue. Equally, if you see a much higher rate of I/O that might also indicate an issue.
The great thing about this approach is that it’s easy to automate the process using tools such as Spinnaker to assign a suitable proportion of users to the new code. Even better, if you spot any issues it’s easy to roll back simply by redirecting all arriving connections back to the old code and steadily migrating the customers off the Canary servers.
According to their technology blog, NetFlix further refines this process. Rather than compare the performance of the canary servers with the existing production servers, they create new instances of the existing servers as well as the canary servers. This so-called Baseline cluster is the same size as the Canary cluster. The performance of the Canary cluster can then be compared with the baseline. This then gives them a directly comparable set of results against a clean setup (with no potential issues caused by long-running processes in the production cluster).
One important thing when doing Canary Testing is to be cognizant of the expected impact of the new code. It may be that the new changes are known to increase the I/O in the system, in which case the fact you see increased I/O is not a sign of a problem. In other words, what you need to do is to carefully identify which metrics you mind about for each test and define what are the acceptable parameters for them. Of course, things such as crashes, stuck processes or timeouts are almost always going to be signs of an issue with the new code.
Dark Launching is in many ways similar to Canary Testing. However, the difference here is that you are looking to assess the response of users to new features in your frontend, rather than testing the performance of the backend. The idea is that rather than launch a new feature for all users, you instead release it to a small set of users. Usually, these users aren’t aware they are being used as guinea pigs for the new feature and often you don’t even highlight the new feature to them, hence the term “Dark” launching.
You are then able to use the UX instrumentation in your app to monitor things like whether the user actually finds the new feature, whether they interact with it, whether it seems to improve their experience, whether it increases your revenue (e.g. the new feature may encourage them to spend longer using your app and thus consume more ads or make more in-app purchases). This is essentially exactly what any product manager is doing when they assess how well an app is performing. The only difference is that you are now looking at the performance of a single new feature.
If you decide to adopt the Dark Launching approach to feature testing it can be done in several ways. Probably the most powerful is to ensure that every feature in your application is able to be toggled on or off. This then allows you to use your API to enable the relevant set of features you wish to test. This approach also allows you to do classic A/B Testing of users, where you wish to compare two versions of your new feature.
Google provides a good example of Dark Launching. Google Maps frequently gets new features that are initially only seen by a very small number of users before steadily being rolled out across their user base. Some of these features are never rolled out globally. Users of Google Maps who live in or near Zurich are particularly likely to see new features such as these since the core development team for Google Maps are based there.
Automated Canary Testing with AI
Here at Functionize, we are working on a project with Google to use Artificial Intelligence to automate the process of Canary Testing. As with normal Canary Testing, the new code is released to a subset of users and the relative performance of the two code bases is compared. The difference is that the comparison is made automatically using machine learning to create a model of expected behavior and then using anomaly detection to identify whether the new app version is performing as expected. This is shown in the diagram below:
At the heart of this project is a form of machine learning with the rather confusing name of Long Short-term Memory Recurrent Neural Networks (LSTMs). In this form of neural network, you create a long chain of “neurons” with short-term memory. The benefit of this type of machine learning is that it is able to learn sequences of actions and thus predict the next action in the sequence. This is ideal for modeling how users interact with applications.
We have undertaken extensive testing of this approach. The LSTM models were trained on the current version of a test app using 80% of the user base. The models were then verified using the remaining 20% of the users. We found that this process was able to accurately predict the next user actions 65% of the time, which is pretty good given the huge number of potential user interactions in such a web application.
Having created the models, we can use them for Canary Testing. The behavior of users of the new version of the application is compared with the expected behavior from the LSTM model. We allow for some variation in behavior, but significant outliers are identified as anomalous and reported back. The developers are then able to explore the issue and release a new version of the code. If there are no anomalies, then the updated code can be released as a new version and the models can be retrained using that new code.
A refinement of the approach is to create several LSTMs, one for each type of interaction with the system (e.g. login interactions, account interactions, etc.). Users are assigned to the correct group using clustering. This will generally improve the accuracy of the models.
Canary Testing and Dark Launching are widely used for testing of new features in complex applications. Canary Testing is ideal when you wish to test the performance of your backend. Dark Launching is more focused on testing new features in your frontend. Both approaches lend themselves to automation and can even be coupled with AI to produce a completely automated testing approach.