Unlike a normal product/ solution, an AI offering will have 3 pieces to it… the product, data and algorithms/ architectures. When one thinks of product strategy in the domain of AI, one needs to think algorithm and data strategy in addition to the product strategy principles. Historically there is enough management theory on product strategy and very less on the other two. In this article, we will focus on one of them.
There is also a lot of research on Deep Neural Network algorithms. Multiple algorithms and architectures are available in open source. But in most cases, data is much harder and expensive to collect than developing and applying the algorithms to run on it.
Companies like Google, Amazon & Microsoft (LinkedIn)have a tremendous lead in various contexts due to the humongous amounts of data they are already sitting on. As I spend time with various AI driven product building, I have found a gap in thinking of how to build AI products with less data. My previous blog was on this topic and lot of people asked me about what other methodologies are out there. Here I compile what I have learned in the form of a framework. The compilation is my learning in a tremendously fast-moving technology area and hence is not necessarily comprehensive. The framework, however, is structured for extrapolation.
How does one reduce the need for data? Three interventions are explored as shown in the picture below…
Play around with whatever data access one has, to increase the size of data points (this assumes one has access to various known publicly available data sources from mere google search to databases like Imagenet)
Recast existing known architectures in a way that helps in the generation of data. Basically, use deep learning architectures to generate the data
Build models in such a way that it inherently requires less data. Here the need for data is traded-off with the architecture of the product. This would involve more effort in product development which are known compared to data strategies which is relatively new
Of course in real conditions, a mashup of these three approaches is seen. One would do data augmentation along with transfer learning for example. Each technique assumes a baseline data availability on which it builds. The two dimensions of interventions versus base data availability make each method conducive to different problems and have to be carefully chosen.
Note: The positions of the boxes is relative and not absolute
Each approach is illustrated below. This is a compilation and hence detailed explanations of how the method works are not provided but each section gives pointers and sources for further exploration. For ease of understanding the article focuses on images as data points but the same is applicable to other forms of data. One non-image data example is also illustrated to show that these methods can be made to work across multiple data streams.
It’s named method “0” on purpose as this is a method of a basic google search and downloading of data. Surprisingly not many companies do this. The approach is to download images of the internet and annotate using fast annotation tools or tags associated with the image already. Internet would give images that have a watermark and also with copyrights/ varied license requirements. Be mindful & respectful of copyrights. An alternative would be to implement a service (GitHub) that downloads data from great photo stock — Pexels. These images can then be annotated using tools that inherently use AI like Supervise.ly.
Source: Supervise.ly — AI based fast annotation
Where direct data is not available images can be clubbed as illustrated below. Numbers are added to a car for Automated Number Plate Recognition (ANPR)
Remember Deep Learning works with watermark and when images are clubbed if trained appropriately.
Face recognition when watermark exists
Data augmentation are a set of techniques where you “augment” your data via a number of random transformations so that the deep learning model would never see the same picture twice. This also helps prevent over-fitting and helps the model generalize better. Techniques include rotation, flipping, zooming, width or height shift etc applied to an image in combination. The sheer number of combinations pushes up the number of possible images tremendously. The picture below illustrates how data augmentation can multiply a single cat image into a multitude of images that the deep learning model sees.
In practice, very few people train an entire Deep Learning Network from scratch (with random initialization), because it is relatively rare to have a data set of sufficient size and even more rare to have a data-set very different from the publicly available data-sets of real-world images, videos, music, text etc. Instead, it is common to pre-train a Deep Learning network on a very large data-set (e.g. ImageNet, which contains 1.2 million images with 1000 categories), and then use this model either as an initialization or a fixed feature extractor for the task of interest.
This is the most amazing thing about building learning models today as you don’t need to do it from scratch. Just like we humans hire an expert and learn from them, one can find a pre-learned model from the internet which has been trained on ~1.2M images (ImageNet) to see real-world images and build on its learning. This model already knows how to detect edges, curves, faces for example and various objects. One has to make it undo some of its learning (delete a few layers) and make it learn the new task of interest lets say “distracted driver” challenge like talking, drinking, reaching backward, texting etc (by adding new layers). You suddenly have a ~20 layer deep learning model trained on GPUs. That’s it… it can now detect drivers who are distracted at a good accuracy. The data-set to learn your task of interest dramatically falls from the millions to 1000s.
Till now we played around with data-sets and a little with the architecture of the model like changing a few layers in Transfer Learning. What if we need a differently tagged data than what is directly available. For example, if one needs a data-set of images where people are walking with some coming towards you and some walking away from you. How do you get this? Here we step into a class of methods where we need to play with Deep Learning architectures and recast them to get to our objective.
A key to this method is to train oneself in converting a given problem into a series of classification problems. For illustrating the method I have simplified the experiment though there could be other methods.
Here I convert the above need into applying multiple levels of classifiers on a publicly available data-set like ImageNet. You first classify the 1.2M images using a person classifier to arrive at a data-set of images with persons. Now apply a face classifier to detect faces in this subset. Remember if a person is walking away the face detector will not detect a face. Create a subset of images where the number of faces detected is less than the number of persons detected. It’s less as it implies there are persons standing but facing the other direction. The image below illustrates this. You have a new data-set by recursively applying different types of classifiers on the same set of images.
Recursive Classification (Picture source — https://www.nytimes.com/2017/11/24/nyregion/pedestrians-new-york-walk-signals.html)
Of course, you can now apply a pose classifier to determine if people are walking. You can apply a traffic light classifier and a vehicle classifier to create a data-set of pedestrian crossing images versus let’s say a pedestrian walking lane. A great method to generate new labeled data via recursive classification.
Imagine a situation where you have X labeled image data and 3X unlabeled data. How do you use the unlabeled data for building your model? This is where semi-supervised learning techniques come in. Instead of getting constrained by X labeled data one would build a classifier using any technique. Let’s say the accuracy of the classifier is 80%. Now one can run this classifier on the unlabeled data and classify them (annotate them). These annotations are also called pseudo-labels. Now you have 4X labeled data at >80% accuracy which can be used to build a deep learning classifier. Having a prior sense of how the data is set-up and classified will help in choosing this approach. It’s called semi-supervised as for part of the data (here X) one uses supervised learning technique.
Let’s say you need a data-set of paintings. It’s hard to obtain? Not anymore. See below how a new previously non-existing painting image is generated. This method is known as artistic style transfer and was popularized by applications like Prisma. The technology behind this has more serious applications like “generating” new data. A Deep Learning Model that understands content and style of an image is built and mixed. One can generate a huge magnitude of data through such a method. This is an example of a “Generative Model”.
Source: Gatys et al. https://arxiv.org/abs/1508.06576
There are multiple classes of “Generative Models”. One really powerful approach to real life data generation is GANs/ Generative Adversarial Networks. All the pictures of bedrooms shown in the illustration below don't exist in any house. They were generated using GANs.
Picture from Alec Radford’s original DCGAN paper
GANs are a kind generative models designed by Goodfellow et al. In a GAN setup, two differential functions, represented by Deep Learning Models, are locked in a game. Think game theory applied to 2 Deep Learning models. The two players, the generator and the discriminator, have different roles in this framework. The generator tries to produce data of how a bedroom would look like while the discriminator, acts like a judge. It gets to decide if its input comes from the generator or from the true training set. This results in amazing realistic images.
As you would realize now one is re-casting deep learning models in different ways to generate data which is different from the initial approaches.
Another approach to Generative Models is LSTMs (Long Short-Term Memory). LSTMs can be used as a generative model. Given a large corpus of sequence data, such as text documents, LSTM models can be designed to learn the general structural properties of the corpus, and when given a seed input, can generate new sequences that are representative of the original corpus. Below is a handwritten set of notes generated by an LSTM.
Automatic Handwriting Generation Source: https://arxiv.org/abs/1308.0850
The approach has also been applied to different domains where a large corpus of existing sequence information is available and new sequences can be generated one step at a time, such as:
Wiki articles, poetry generation etc
Let me here give an example of how data for non-image domain can be generated. If you need speech data for example. A recasting would involve… using a generative model to generate text which you would convert to speech using a text to speech converter.
Automated speech synthesis using multiple Deep Learning Models
To make this real find my voice generated by a Deep Learning model where my input was the text “Hello there. This is my voice synthesized using Deep Learning”.
My voice synthesized using Deep Learning (AI)
All forms of data may be tough to generate from the above method. That’s when one evaluates if synthetic data generation can help. Think of all the computer games you have played. That’s an example of synthetic data. Depending on what the deep learning system is going to learn synthetic data can enormously boost your data set. Synthetic data can be generated using gaming engines for example. Significant data sets can be generated but this would require understanding gaming engines and writing code to synthesize the data. To illustrate this approach is real I ran an object classifier on “Need For Speed” game image that can be seen below. The Deep learning model doesn’t differentiate and identifies the cars.
One more example of a Deep Learning model recognizing synthetic images.
As the name suggests here one learns by playing with oneself. We have all played chess with ourselves. This is exactly that. It’s best illustrated through an example. AlphaGo is the the first computer program to defeat a world champion at the ancient Chinese game of Go.
“AlphaGo Zero” has been newly introduced and this is even more powerful and is arguably the strongest Go player in history. Previous versions of AlphaGo initially trained on thousands of human amateur and professional games to learn how to play Go. AlphaGo Zero skips this step and learns to play simply by playing games against itself, starting from completely random play. In doing so, it quickly surpassed human level of play and defeated the previously published champion-defeating version of AlphaGo by 100 games to 0. The below chart shows how within days via self-play the AI system surpasses a champion’s play.
It is able to do this by using a novel form of reinforcement learning, in which AlphaGo Zero becomes its own teacher. The system starts off with a neural network that knows nothing about the game of Go. It then plays games against itself, by combining this neural network with a powerful search algorithm. As it plays, the neural network is tuned and updated to predict moves, as well as the eventual winner of the games.
To be more specific, this is what AlphaGo Zero has been able to accomplish:
Beat the previous version of AlphaGo (Final score: 100–0).
Learn to perform this task from scratch, without learning from previous human knowledge (i.e. recorded game-play).
World champion level Go playing in just 3 days of training.
Do so with an order of magnitude less neural networks ( 4 TPUs vs 48 TPUs).
Do this with less training data (3.9 million games vs 30 million games).
This is truly disruptive in dramatic data reduction and as an approach to building an AI system. It’s a fundamentally different approach to building an AI system architecturally and hence falls under the third bucket of methods.
Usage of this technique has been covered in one of my previous blogs —Making AI learn like humans… with less data.
This is an approach where you architect the product in such a way that it requires less data. You trade off between a large data-set and the effort to build a new system.
The approach is to make two identical twin models (Siamese networks/ one-shot learning) of a face detection Deep Learning model and come up with a distance metric between the two. Think of it as a metric that detects similar/ dissimilar faces. One model was built on one person’s face. The second was fed with random faces and only when it is close to the output of the first model will it show green. At an abstract level that’s what it does… the deep learning model represents the person’s face as an embedded vector and when any face is shown to the second model it also generates a similar embedded vector and if its close to the first vector then it’s that person else it is somebody else. To construct such a face representing vector very minimal images are required. With as less as two images here is a person(me) recognizer machine that recognizes the presence or absence of a person. This is super useful in multiple use case scenarios like security, door alarms, computer/ mobile logins, identifying driver of a car etc.
This is one of many programmatic approaches to change the way the system is built so there is an inherently less need for data.