In the first part of this blog, we explored what AI is, looked at some history and some of the applications of AI in business. I left you with the thought that the application of AI to vision is one of the big areas for growth. This blog will explore this in more detail and includes links to cool things that have been done recently.
Visual applications of AI can be split into three: machine vision, the creation of artificial scenes and the visualization of big data. Each of these uses different techniques and they have different applications. But before we go into detail I want to give some more background on neural networks, especially Convolutional Neural Networks (CNNs), which are widely used in computer vision and AI applications.
Typically, neural networks are 1-dimensional constructs. They take a vector of input data, transform it via one or more interconnected hidden layers of neurons and give a vector of outputs. At each stage, nodes sum their inputs with different weights and send that sum on to the next layer. The final outputs sum to 1 and represent the probability that the input matched a given pattern. During the learning process, the weights are adjusted using a method called back propagation. The aim is to get the outputs as accurate as possible.
In computer vision, the network may be trained to recognize hand-drawn numerals. The input is a simplified version of the drawing (reduced to something like a 32×32 matrix and then rasterized into a vector). There would be 10 outputs reflecting the probability that the input was the numeral 0, 1, 2, etc. In the ideal world, a numeral 3 input would give an output of [0,0,0,1,0,0,0,0,0,0]. But in practice, the probability will never be as high as 1.
In a Convolutional Neural Network, the input picture is divided into small pieces. Each small tile undergoes a convolution operation (a form of 2D filter). The results of this convolution are then passed through one or more Pooling layers. These further simplify the result by collapsing each square to a smaller square using a simple function such as max, min or average. The example below shows max pooling with a stride length of 2. This divide, convolute and pool process may be repeated in several stages until you have extracted the minimum feature set that identifies the item you are looking for.
Because CNNs operate in 2 dimensions, they are much better at identifying visual features such as edges or shapes. For instance, when they are used to perform the hand-written digit task described above they will function much better because they can rapidly identify features like the cross shape in the center of a figure 8. They are also relatively simpler and require fewer neurons to achieve the required accuracy.
Nowadays we’re all familiar with the concept of computer vision, especially applications like facial recognition or handwriting recognition as described above. However, there are other applications that are less familiar. One of these is called image segmentation. CNNs like those described above are ideally suited to identifying a specific object within a larger scene. But typically, a scene will contain multiple objects. This is particularly true of things like a view of a road as seen from a self-driving car. Image Segmentation is the process of identifying and classifying all the objects in a scene. In the example below from NVidia, the system has been trained to identify several categories of object including cars, pedestrians, street furniture, road, and sidewalk.
Obviously, this technique has direct application to autonomous driving. However, it also has the potential to be applied in other fields, particularly medicine. This is the basis of the deep learning system for skin cancer identification that I described in the previous blog.
Video is one of the big areas of AI development. As hardware gets faster and as people like Amazon bring online more and more powerful machines like the new C5 and P2 instances, it becomes increasingly easy to do this in real-time and at scale.
One key area of research is to identify and track human figures within video. This allows you to do some amazing things. For instance, by analyzing people’s feet as they walk around a shopping mall, you can recognize their gait and use this to track them from shop to shop. This is now being used in malls to track footfall more accurately. This technique can also allow you to construct “stick-men” figures that follow the movements of people in a crowd. This has real potential for improving CGI in movies as well as for identifying suspicious actions such as pickpocketing/bag snatching.
You can also use similar techniques to identify overcrowding on subway stations. This is difficult since the video feeds of the platforms are usually severely foreshortened, making it hard to identify how densely packed the passengers are. By identifying individual heads and their relation to known landmarks on the platform, the system knows whether passengers simply need to be moved down the platform, or whether to prevent more passengers entering the station until the situation improves.
One of the exciting new fields is that of using AI to construct artificial images. There are a number of approaches to this. One approach is to take a segmented image like the one above and artificially fit in pieces that fit the segmentation. Another really powerful approach is using Generative Adversarial Networks (GANs). Without going into detail, these are able to construct incredibly accurate artificial pictures by combining features from multiple input sources. The two photos below are an example of this. While many of you may feel you recognize these actors, they have in fact been created artificially from a database of celebrity photos.
Clearly, this approach has application in things like computer gaming (hence NVidia’s interest). But it could also be used to significantly enhance the world of augmented reality (where artificial content is overlaid onto the real world).
Another interesting application is converting text to images. On Google image search you can input something like “white and pink flower with petals that have veins”. That would return lots of results where images have been accurately labeled, but it will also contain quite a few random other pictures. But if you gave the same instructions to a GAN, it would artificially generate results like this:
This has potential in many fields, and it could be used for more than simply constructing images. It also opens the possibility of constructing videos simply by describing them and of allowing new approaches to industrial design.
If you’ll forgive the pun, Big Data is Big Business. But Big Data is useless without good analysis and visualization. This is where AI can come in. Let’s take a relatively simple example. As you know, when you take a photo with a smartphone, it is geotagged, so you can see where the photo was taken. If you use an image hosting site, then all those photos have been uploaded to a central location. Imagine if you could take the geotags from every single photo, anonymize them and plot them as a heat map.
That looks cool, right? But what can you actually learn from this? Well, it turns out you can use this technique to identify the location of landmarks in a city. Even cooler than that, you can use the same data to do the process in reverse: given a photo of a scene, Google has shown how you can use a CNN to work out the geolocation without needing a geotag.
Big Data allows a business to extract previously unknown insights into customer behavior and to visualize information in new ways to improve their Business Intelligence. AI can help here in several ways. Firstly, you can use AI to automatically find best-fit curves for highly complex and large datasets. Secondly, AI can automatically clean and analyze your data. When combined with IoT, this can transform industry, for instance allowing you to identify flaws as goods come off the line. Thirdly, by using tools like Jupyter, businesses can create collaborative AI projects that dynamically display visualizations alongside the code, allowing technical and non-technical team members to work more closely. AI can even be applied to traditionally non-technical industries like farming, combining data from satellite/drone images with sensors on the farm equipment to optimize the productivity of the land.
The use of AI and deep learning to analyze and construct video and stills images is one of the most exciting developments in recent years. It has now become so mainstream that Amazon has released a product called DeepLens, allowing anyone to learn and play with applying deep learning techniques to images. Over the coming months, the expectation is that a huge number of startups will begin to leverage and develop these techniques further.