Introducing Steve O’Hara, a Principal Research Engineer leading the Sensing and Perception team in the Highly Automated Driving division at HERE, based in Boulder, Colorado.
Steve, what are you and your team working on?
The Sensing and Perception team tries to provide useful and comprehensive information extracted from video and other data sources—like GPS, inertial measurement units (IMUs), and odometry—that’s collected by cameras and sensors mounted in or on top of vehicles.
Our focus is on developing techniques suitable for edge processing to enable real-time road scene understanding. The idea is that consumer-owned devices can help us generate and maintain HERE’s HD Live Map, which is part of HERE’s crowd-sourced mapping initiative. In this context, a consumer-owned device might be a dash camera, a smart phone, or the sensing systems built into future automobiles. We’ve been pushing the limits of running deep [neural] net[work]s in constrained/embedded hardware.
What’s an example of a road-scene comprehension problem you’re trying to solve?
Determining if a sign that we have in our HD Live Map was not seen because it’s no longer there or because the view was blocked by a truck. To figure this out, we develop semantic segmentation models using deep learning. Semantic segmentation is the process of assigning every pixel in an image with a semantically meaningful class label. For example, we might classify each pixel as road, vehicle, sky, terrain, building, lane, sign, etc. This is an important input to understanding the observed road scene so that we can not only detect features useful for mapping, but also the context of the observation.
Segmentation results from a HERE SpotCam image
The classification process you describe reminds me of a toddler learning language, pointing at things and saying aloud “sky, grass, dog.” I’m looking forward to congratulating my parent friends on their child’s semantic segmentation next time I witness this! Speaking of language, Sensing and Perception is an interesting team name. It sounds more like something from a Meyers & Briggs test than a corporate org chart. Could you talk a little bit about this confluence of language between psychology and technology?
During the first few years of my PhD studies, we were focused on biomimetic vision. We studied the literature from cognitive psychology on how the mammalian visual pathway works (to the best of our knowledge) and we sought to mimic the functional decomposition of biological vision for computer vision systems.
As long as the field of AI has existed, there have been cross-over efforts with cognitive psychology. How we learn, how we perceive, how we compress/store experiences, how we move about the environment and build mental maps, and so forth. Deep learning is based on the work of Artificial Neural Networks (ANNs), which started as abstract models of how a collection of neurons can be connected layer-wise as a general-purpose learning mechanism.
What have been the biggest successes?
On the deep learning side of things, one of our big breakthroughs was being able to show 14-class (e.g., sky, terrain, car, road, sign, bridge, pedestrians, etc.) segmentation running at 30 frames per second on a development board that was drawing less than eight watts of power. This came about from thinking critically about the design of popular deep net architectures and what aspects of those designs were unnecessary to achieve sufficient quality within the computational budget of the device. Hats-off to Brad Keserich on our team for this achievement, but also to all our team members that were involved in many discussions relating to this topic.
This sounds like a good example of what you meant earlier when you said “We’ve been pushing the limits of running deep [neural] net[work]s in constrained/embedded hardware.” Could you talk a bit more about these constraints?
Deep neural networks in the conventional wisdom require an enormous amount of compute, so if you want to use them in the cloud, that’s great. You can spin up as many cloud computers as you want. You want to do it in the car without a trunk full of computers, a second battery, and a customer alternator? Then what do you do? You create highly optimized, efficient, small deep-neural networks that give you the performance you need.
Beyond hardware constraints, what have been the biggest challenges?
One of the biggest challenges is generating training data for this task. It takes about an hour per image for a human to generate full scene segmentation labels. We have a training set of about 20,000 images. Considering there are roughly 2,000 working hours in a year, this data set reflects about 10 person-years of effort to generate! There is tremendous value in lowering the cost or effort required to generate high quality training and evaluation data.
What do you know now that you wish you knew when you started working on this problem?
I have no regrets. We learn new things every day. We all have those head-slapping moments where we realize something that in hindsight maybe should have been obvious all along. But that’s the nature of fast-paced innovative work.
What do you see as the next big thing in machine learning?
Returning to the topic of training data, I think we need a big breakthrough in unsupervised/self-supervised learning before we achieve the next level of machine learning. Unlocking the information contained in our massive archive of images without having to label them all will be a key to significant advancement.
Any big misconceptions about machine learning out there that you’d like to clear up?
As with all trending technology, hype can lead to unrealistic expectations. It is true that deep learning, combined with massive data sets, has enabled major advances in computer vision, robotics, and related fields. But there is still significant expertise involved in fielding a robust and comprehensive product based on deep learning models, and the business processes required to support, maintain, and improve data-driven products is something that many companies struggle with.
Tell me about your side projects.
I have a smart home system and a set of outdoor security cameras that monitor the perimeter of my property. I play around with trying to make the cameras smarter and to reduce false positive alerts. My interest here dates to when I worked for a small business defense contractor developing AI support systems for force protection applications (i.e., monitoring the perimeter of bases to detect threats). This led me to computer vision for the first time when I was tasked with developing an algorithm to detect fast-moving small boats from the video streams of unmanned aerial vehicles (UAVs). It was then that I decided computer vision was my career passion, and, in my mid-thirties, enrolled in the computer science (computer vision/machine learning) Ph.D. program at Colorado State University.
I also maintain an open source python library called PyVision3 . It’s mostly a set of convenience functions for helping prototype and evaluate computer vision algorithms like annotating images and displaying results in a montage. I find it saves me from having to write a bunch of boilerplate code for every project I work on.
For fun, I practice and teach a Japanese martial art called Aikido, which I’ve been involved in for about 20 years.
Has Aikido influenced your work?
It helps in trying to stay calm when you’re facing frustration. Research and development is mostly failure, so you have to persevere, you have to keep trying different things, and come up with new hypotheses. But most of your hypotheses are probably wrong, and most things you’re going to try to make that next breakthrough are going to fail. So being able to keep your cool is beneficial to the machine learning practitioner.
Other than Aikido, what resources would you recommend for someone just starting out with machine learning?
There is no substitute, however, for putting in the time to learn theory/math and then to put it into practice. There are some classic texts, such as Christopher Bishop’s Pattern Recognition and Machine Learning and Michael Kirby’s Geometric Data Analysis: An Empirical Approach. I like these two because they come from different points of view of the broad topic of machine learning. Bishop uses Bayesian formulations to derive and motivate classic algorithms, while Kirby applies linear algebra and geometry to uncover patterns in large data sets. Both predate the rise of deep learning, but there are numerous deep learning tutorials and mature frameworks to help a newcomer get started.
How do you stay on top of developments in machine learning?
The top conferences in computer vision and machine learning are great to follow, especially CVPR and NIPS. Even if you can’t afford the time or money to attend, you can access the papers published at these conferences and learn a lot. For earliest access to new developments in the field, I monitor arXiv’s computer vision (cs.CV), A.I. (cs.AI), and machine learning (cs.LG, cs.NE) posts.
Learn more about Steve and connect with him here.