How do you recognize a person? You see them once maybe twice. How does a kid recognize a bird? See them once maybe twice and then they will recognize birds in general. Can we make products using AI (Deep Learning)do the same? Nah, one of the big barriers to building AI (Deep Learning) systems is they need 1000s to millions of images to recognize a cat or dog. It’s incredible how difficult it is to teach a machine.
Neurosciences & genetics tells us that some of this learning is hereditary. Years of evolution have “coded in” concepts of physics for example. Kids learn about gravity in the first couple of years that objects don’t float and they will fall. Object permanence typically starts to develop between 4–7 months of age and involves a baby’s understanding that when things disappear, they aren’t gone forever. An object the baby keeps behind it will remain and without looking you can reach out & find it. It’s mother who goes out of the room exists and will come back. Before the baby understands this concept, things that leave his view are gone, completely gone. They also learn shape constancy… a ball will continue to be spherical and they interact with it that way. They don’t see 1000's of footballs to recognize this. They may see the same ball multiple times.
The target was to build a system that recognizes a person doing a particular activity using a minimal number of images. These are repetitive tasks done by the same person (or a closed group like employees of a company).
Below images show exactly this. The system/ machine recognizes a person (person identifier )which is me, in this case, doing multiple actions like typing/ working/ on a mobile/ reading (action classifier) on a computer.

A person identified by name and activity classified
To train the person identifier, 2 images taken on a mobile was used. To train the action classifier, images from google search were used. “People typing” was an example search string. Deep learning in such contexts is no longer a “big data” problem. By using 2 selfies and approximately 50 images from google search a system that recognizes me in any picture on the internet or captured using a webcam/ mobile was built. This work excited me as Deep Learning & AI based product entry barriers can be reduced and no longer needs to be with the data-rich Google, Amazon or Facebook.
How was this done? The diagram below illustrates the system as layers of deep learning blocks.

Representative Model of a Person-Action Recognition System
Just like kids seem to be encoded with aspects of life that they build on, here a pre-trained Deep learning classifier (VGG16) was used to provide those aspects. This classifier had already been trained on millions of real-world images. The last couple of layers were removed and 1000s of faces were provided to convert this to a face detector. This was accomplished by a method called transfer learning. This whole model was taken off the shelf from open source. This machine didn't do this. This machine just built on top of this. It’s like the machine was born with a face detector that had been trained on millions of natural world images.
What the machine did was it made two identical twin models (Siamese networks/ one shot learning) of this face detection network and came up with a distance metric between the two. Think of it as a metric that detects similar/ dissimilar faces. One model was built on one person’s face. The second was fed with random faces and only when it is close to the output of the first model will it show green. At an abstract level that’s what it does… the deep learning model represents the person’s face as an embedded vector and when any face is shown to the second model it also generates a similar embedded vector and if its close to the first vector then it’s that person else it is somebody else. To construct such a face representing vector very minimal images are required. With as less as two images here is a person recognizer machine that recognizes the presence or absence of a person. This is super useful in multiple use case scenarios like security, door alarms, computer/ mobile logins, identifying driver of a car etc. Below are a few more pictures of it identifying me in various contexts.

A person clearly identified in multiple contexts
Do the same for activities. Train a classifier (can be done as above or as a classifier)(also see my previous blog Deep Learning fails Hollywood drivers) that categories activities. This again leverages transfer learning on the same VGG16 deep learning model as the person identifier. An action classier can be abstracted as an object classifier overlaid with some rules. Like typing implies hand on the keyboard. Washing implies foam, soap, and wetness on hand and so on. Train a classifier to detect actions.
What we now have is a machine that can identify a particular person doing a specific job. It’s not a product yet. Read my blog on what it takes to build a Deep Learning product (Mirror Mirror on the wall… AI (Deep Learning) needs to learn it all). Recently was in discussion with two of the biggest eye hospitals by volume in the world and one of their asks was how do we monitor for example if a person has washed their hands before surgery? The above approach can in a simple way start addressing for each step in the workflow.

Do the same for activities. Train a classifier (can be done as above or as a classifier)(also see my previous blog Deep Learning fails Hollywood drivers) that categories activities. This again leverages transfer learning on the same VGG16 deep learning model as the person identifier. An action classier can be abstracted as an object classifier overlaid with some rules. Like typing implies hand on the keyboard. Washing implies foam, soap, and wetness on hand and so on. Train a classifier to detect actions.
What we now have is a machine that can identify a particular person doing a specific job. It’s not a product yet. Read my blog on what it takes to build a Deep Learning product (Mirror Mirror on the wall… AI (Deep Learning) needs to learn it all). Recently was in discussion with two of the biggest eye hospitals by volume in the world and one of their asks was how do we monitor for example if a person has washed their hands before surgery? The above approach can in a simple way start addressing for each step in the workflow.
A plethora of use cases exists when you can identify a person in a workflow/ process. From industry 4.0 to self-driving cars to even know-your-customer (KYC) aspects.
Want to leave with two thoughts…
Add to this the GPS location information and time of the act to open up what I call the Next Generation of AI enabled Operations Excellence.
Is pre-coded deep learning models (similar to genes) and mashing them up in a hierarchy or as a workflow of AI models a way to machines replicating human intelligence or Artificial General Intelligence as it is called?
More on those two later. Would love to hear your thoughts. Till then… the old adage… Seeing is believing… Deep Learning is today’s best tool for machines to see…