We humans can see Vin Diesel is looking away as he drives at high speed!!! Can machines do?
Distracted driving is a ~20% reason for road accidents and given the increased use of mobiles in vehicles this has become a serious concern. Top Automotive OEMs and suppliers have had “driver distraction detection” as one of their top innovation programs for years. But where are the products? It’s clearly a nontrivial problem to make a machine detect Vin Diesel distracted & alert on this basis.
Rules-based systems have been attempted for years involving big teams and accuracy of detecting distractions in real time is tough. Deep Learning with GPU infrastructure is changing all that.
This note is to show the incredible magnitude of the shift, that is happening in solving problems like this due to Deep Learning. As an individual (contrast with an Automotive OEM innovation team of engineers) who is a business leader (contrast this with a techie) using internet resources (~20000 images of drivers driving in various positions) could build a learning system that can detect distracted drivers at ~90% accuracy.
The pictures above (obtained from the Internet) show some results. Liam Hemsworth is in a position where he is not driving and the deep learning model recognizes that he must be talking to a passenger given his position. Jason Statham is talking on his phone, doing something with his head and his left-hand position also seems to suggest he is drinking something. Clearly, both hands are not on the wheel. Dwayne Johnson is on a radio but the machine detected that as drinking which is incorrect but flags off a distracted case.
The most amazing thing about building such learning models today is you don’t need to do it from scratch. Just like we humans hire an expert and learn from them, I had to find a pre-learned model from the internet which had been trained on ~1.3M images (VGG16) to see real-world images and build on its learning. This model already knew how to detect faces for example and various objects. I had to make it undo some of its learning (delete a few layers) and make it learn the “distracted driver” challenge like talking, drinking, reaching backward, texting etc (by adding new layers). A ~20 layer deep learning model trained on GPUs. That’s it… it can now detect drivers who are distracted at a good accuracy.
Movie buffs may remember “Over The Top” where Sylvester Stallone installed a pulley-weight set right inside his truck so that he can exercise his wrestling arm. The machine in the modern age deciphers it as a distracted driver talking & texting with the right hand. That means there is further room for deeper interpretation as the model identifies it as a distraction but classifies it wrong as pulleys were not part of the learning. Contrast Jason Statham and Ryan Gosling both seem intensely driving but there is a big difference in results. That’s where the opportunity to push towards higher accuracy presents itself. Jason hand position seems to suggest he is reaching for a drink in the cup holder while Ryan with a prominent seat belt and position comes out as safe driving. Oh yes, didn’t see seat belts in the earlier pictures… that’s the intense beauty of deep learning systems. It recognizes such nuances and learns as we humans do. No rules were built in to recognize this.
How can this be improved in addition to training with more classes like the pulley??… a couple of ways jump out
Add a camera that points outwards showing the road context and a deep learning model that learns it
Augment with accelerometer data (or GPS etc) from the mobile to get a sense of speed/ movement of the vehicle
These “hints” can be modeled and their learning concatenated into the earlier learning model like minds combine. The below picture is self-explanatory on what the learning model would do basis the front camera picture merged with the distracted driver image.
When machines see & learn as we do the magnitude of change in solving problems can be mind-boggling. Significant reduction in the input — people/ resources required and manifold increases in output accuracy.