A popular self-driving car dataset for training machine-learning systems – one that’s used by thousands of students to build an open-source self-driving car – contains critical errors and omissions, including missing labels for hundreds of images of bicyclists and pedestrians.
Machine learning models are only as good as the data on which they’re trained. But when researchers at Roboflow, a firm that writes boilerplate computer vision code, hand-checked the 15,000 images in Udacity Dataset 2, they found problems with 4,986 – that’s 33% – of those images.
From a writeup of Roboflow’s findings, which were published by founder Brad Dwyer on Tuesday:
Amongst these [problematic data] were thousands of unlabeled vehicles, hundreds of unlabeled pedestrians, and dozens of unlabeled cyclists. We also found many instances of phantom annotations, duplicated bounding boxes, and drastically oversized bounding boxes.
Perhaps most egregiously, 217 (1.4%) of the images were completely unlabeled but actually contained cars, trucks, street lights, and/or pedestrians.
Junk in, junk out. In the case of the AI behind self-driving cars, junk data could literally lead to deaths. This is how Dwyer describes how bad/unlabelled data propagates through a machine learning system:
Generally speaking, machine learning models learn by example. You give it a photo, it makes a prediction, and then you nudge it a little bit in the direction that would have made its prediction more ‘right’. Where ‘right’ is defined as the ‘ground truth’, which is what your training data is.
If your training data’s ground truth is wrong, your model still happily learns from it, it’s just learning the wrong things (eg ‘that blob of pixels is *not* a cyclist’ vs ‘that blob of pixels *is* a cyclist’)
Neural networks do an Ok job of performing well despite *some* errors in their training data, but when 1/3 of the ground truth images have issues it’s definitely going to degrade performance.
Self-driving car engineers, please use the fixed dataset
Thanks to the permissive licensing terms of the open-source data, Roboflow has fixed and re-released the Udacity self-driving car dataset in a number of formats. Dwyer is asking those who were training a model on the original dataset to please consider switching to the updated dataset.
Dwyer hasn’t looked into any other self-driving car datasets, so he’s not sure how much bad data is sitting at the base of AI training in this nascent industry. But he has looked at datasets in other domains, finding that Udacity’s Dataset 2 was particularly bad in comparison, he told me:
Of the datasets I’ve looked at in other domains (eg medicine, animals, games), this one stood out as being of particularly poor quality.
Could crappy data quality like this have led to the death of 49-year-old Elaine Herzberg? She was killed by a self-driving car as she walked her bicycle across a street in Tempe, Arizona in March 2018. Uber said that her death was likely caused by a software bug in its self-driving car technology.
Dwyer doesn’t think bad data quality had anything to do with the tragic crash. According to a federal report released in November, the self-driving Uber SUV involved in the crash couldn’t figure out if Herzberg was a jaywalking pedestrian, another vehicle, or a bicycle, and it failed to predict her path’s trajectory. Its braking system wasn’t designed to avoid an imminent collision, the federal report concluded.
I’ve reached out to Vincent Vanhoucke, principal scientist and Director of Robotics at Google, who teaches the Udacity course on becoming a self-driving car engineer, to get his take on the bad data and to find out if he plans to update to the fixed dataset. I’ll update the article if I hear back.
Over the coming weeks, Roboflow will be running some experiments with the original dataset and the fixed dataset to see just how much of a problem the bad data would have been for training various model architectures.
For now, Dwyer’s hoping that Udacity updates the data set it’s feeding self-driving car engineering students and that the companies actually putting cars on the road are more diligent at cleaning up their AI training materials than what this open-source dataset might suggest:
I would hope that the big companies who are actually putting cars on the road are being much more rigorous with their data labeling, cleaning, and verification processes.