The autonomous vehicle (AV) industry is focused on imitating intelligent human behavior — “driving” as we know it. To achieve this goal, the industry is leveraging Artificial Intelligence (AI) and Machine Learning (ML) that allows computer systems to learn from experience without explicit programming. The ML models are constructed by a set of data points and trained through mathematical and statistical approaches that ultimately enable predicting new previously unseen data.
Applying AI/ML to replace human driving, one of the most cognitively complex behavioral competencies that we as humans exhibit, is hard! NVIDIA’s GPU hardware and CUDA software has been at the heart of the AV revolution, and AI, in general. Microsoft Azure brings AI-inspired secure and flexible high-performance computing, and big compute environments to the cloud. We at TCS, leveraging our Neural Automotive Framework and TCS Autoscape™suite of solutions, are delighted to be in a strategic partnership with world-class companies like Microsoft and NVIDIA. Combined with the scale and technical merits of TCS, the partnership enables us to offer connected and autonomous capabilities to serve AV customers and enable autonomous technologies to power tomorrow’s enterprises.
Importance of AV Safety
Autonomous Vehicle Safety is of paramount importance to ensure societal acceptance and to make autonomous driving a reality. AV safety determines the very existence of the AV industry. With the increased use of AI and ML, AV safety needs to complement functional automotive safety so that the perception and decision systems can perform in diverse environments and scenarios while sometimes encountering rare events that are typically not found in everyday driving.
In 2016, the National Highway Traffic Safety Administration (NHTSA) investigated the first of its kind when a driver was killed by a Tesla Model S with its Autopilot mode engaged. The driver, Joshua Brown, had his car’s Autopilot mode activated while driving down a highway in Florida when a tractor-trailer made a left turn in front of the vehicle. The car didn’t stop — it went under the trailer, then hit a fence and a power pole. On the other hand, humans driving regular cars are terrible at not getting into accidents. NHTSA published a report that found 94% of all car crashes between 2005 and 2007 resulted from driver error and preventable if one or all of the drivers involved in the crash paid more attention and reacted accordingly.
The objective of AV safety is to reduce the number of incidents caused by autonomous vehicles when compared to those caused by the human-driven vehicle. We believe AV safety is everyone’s responsibility and is a challenge that demands a collaborative approach from the AV industry. Looking at the ecosystem, we are addressing the AV safety concern, one model at a time, so that autonomous vehicles will be involved in far fewer accidents than their human driving counterparts.
AV Safety Strategy
The strategy is to build safety models around best driving practices and human intuition. Our roads will only become safer when AVs commit to safety by utilizing proven protective driving techniques. Accurate and timely prediction of other drivers’ behavior is an integral component of any autonomous vehicle (AV) control system. In this article, we’ll explore one of the risky driving patterns that we as humans exhibit, design the machine learning-based safety model, and much more.
We will describe the methodology and techniques used in building the safety model and highlight how the solution leverages the NVIDIA A100 chip architecture features to achieve significant performance gains by taking a specific use case and sharing performance metrics. We used Azure’s ND v4 A100 product family, available in Azure HPC and AI software platform at incredible scale in the cloud. It is powered by NVIDIA flagship A100 GPU, offering more memory bandwidth and NVLINK bandwidth than before. With features like multi-instance GPU (MIG), we could run more CUDA kernels in parallel and better utilize resources. Docker and singularity containerized workloads, with metrics and logs, enabled us to understand the environment better, making the process more transparent.
Driving is inherently risky when vehicles aren’t always obeying the traffic rules and are making aggressive maneuvers. When navigating the most complex environments in a human world, we humans rely on human intuition and common sense. We use implicit rules when explicit traffic rules don’t seem to apply in complex but common everyday scenarios.
The four types of unsafe situations that lead to critical crashes are.
- run under or run over
- loss of control
- lane change and lane merge
When vehicles change lanes and move in front of a vehicle closer than the typical headway distance, it is considered a cut-in maneuver. We’re going to focus on lane merge or cut-in crashes in the multi-agent environment that we live in today.
In a mixed-mode environment of autonomous and human-driven vehicles, an AV’s aggressive cut-in behavior may put pressure on its surrounding human-driven vehicles. Conversely, a seemingly unpredictable move by a human driver could put the AV control system outside its normal operating region. Any resultant defensive behavior of an AV may reduce the efficiency of the traffic. AV safety systems should track the vehicles around the ego vehicle and make an online prediction to decelerate, apply emergency brakes or take an evasive maneuver.
Safety Model Development
We took an iterative approach for developing such capabilities, and the time and effort were proportional to the systems we use to build the models. We have benchmarked our training and inference process against the previous generation Volta-based GPU and the current generation NVIDIA A100 Ampere GPU platforms on Azure.
The data we used consisted of 7000 trips made by ego vehicles that traveled about 100,000 miles, with over 10,000 hours of data collected. The drivers annotated by observing the unsafe lane changes made by other vehicles around the ego vehicle. We used LiDAR, camera, GPS, and CAN bus read-out to build the underlying time-series data consisting of the ego vehicle engine, steering angle, and the brake pedal positions. The processed data includes the position of the ego vehicle within a lane and the yaw rate. We tracked up to eight obstacle vehicles along with the position, speed, and yaw rate relative to the ego vehicle. External weather data such as precipitation, temperature, and visibility, was also included in the data.
Using the statistical analysis of the data and the human behavior models, we categorized the features into spatial, temporal, and spatio-temporal data. We used a combination of INT16, FP32, and TF32. The objective was to take a sliding window of the prior state (T0 to T0-α secs) and predict cut-in at (T0+β secs). A multi-layer ensemble of models was built and trained on the processed data.
Wrangling the data and building the model with over 95% cut-in prediction accuracy was riddled with challenges. The data scale was one of the challenges, and the corresponding hyperparameter space when we trained the model was also significant. We had to also deal with a cost vs. time-bound decision due to the multi-layer ensemble model computation load. The NVIDIA A100 platform scaled to meet these challenges by providing a significant raw performance improvement over the earlier architectures that we had considered. In addition to the model-building complexity, the process also added a lot of challenges around logistics. The availability in Azure HPC and the HPC workload management provided the necessary flexibility. The model development team could effectively make go-no-go design decisions and focus on incremental model performance improvements.
The second challenge was that the model’s size had to be reasonably small to be more portable. The third challenge was the need to quickly iterate over several architectures as there was a long pipeline of training and validation workloads. There was a need to provide high throughput inference with a parallel stream of vehicle data coming in from the field. Finally, the development team needed to show improved utilization of computation resources to ensure other groups working in adjacent areas such as human behavior, vehicle modeling, road dynamics modeling, and such were able to contribute effectively.
Model Optimization and Scaling
With NVIDIA A100 on Azure, we saw around 2x performance improvement even without TF32 optimizations over V100 architecture. Enabling TF32 improved the model training time by an average speed of 1.5x without compromising the accuracy on NVIDIA A100.
The MIG-enabled NVIDIA A100 allowed us to train multiple model variants in parallel, increasing the GPU utilization by pushing 3x the number of training cycles with only a 4–10% increase in energy utilization.
From an energy management (kWh per training cycle) and data center optimization perspective, we saw 25–30% less energy for identical training workloads on NVIDIA A100 over V100. On the inference front, we were able to scale the throughput with the NVIDIA A100’s Multi-Instance-GPU technology by running seven parallel inference streams with TF32 enabled, outperforming V100 by 2.2x. Beyond the raw performance gain, we were able to track and infer for 7x vehicle streams per GPU, improving inference workload and ensured that the testing that we do in inferencing was scaled up by 7x.
In summary, the NVIDIA A100 architecture has added a compelling value to our model development and testing of AV safety systems, reducing IT costs and time-to-market.
The AV safety model development leveraged the TCS intelligent transportation research and innovation group working in AV passenger safety, prediction, and optimization of multi-modal transportation systems, as well as the Neural Automotive Framework, TCS Autoscape™ — a comprehensive suite of solutions and services to help customers accelerate their autonomous vehicle development.
This article covered, cut-in maneuver, one of the risky but rare occurring driving patterns, designing the machine learning-based safety model and highlighting how the model development and testing leverage the features available in the NVIDIA A100 chip architecture running on Azure HPC and AI software platform.
Azure’s HPC & AI Platform
The Azure ND A100 v4 VM series is our most powerful and massively scalable AI VM, available on-demand from eight, to thousands of interconnected NVIDIA GPUs across hundreds of VMs. The ND A100 v4 VM series starts with a single virtual machine (VM) and eight NVIDIA Ampere A100 Tensor Core GPUs, but just like the human brain is composed of interconnected neurons, our ND A100 v4-based clusters can scale up to thousands of GPUs with an unprecedented 1.6 Tb/s of interconnect bandwidth per VM. Each GPU is provided with its own dedicated 200 Gb/s NVIDIA Mellanox HDR InfiniBand connection supporting GPUdirect RDMA. Tens, hundreds, or thousands of GPUs can work together, in a seamless-to-deploy, topology-agnostic fashion, to achieve training goals at any level of AI ambition.
As TCS experienced, most customers and partners will see an immediate boost of 2x to 3x compute performance over the previous generation GPU products with no engineering work. New A100 features like multi-precision Tensor Cores with sparsity acceleration and Multi-Instance GPU (MIG) when layered, can achieve a boost of up to 20x for specific workloads. And this is without taking into account the massive 16x increase in interconnect bandwidth that drives the unique scalability of ND A100 v4 over previous and already class-leading Azure NDr v2 offering.
The supercomputer-class interconnect capabilities come with the amenities our customers expect from networking in a public cloud: Azure VM Scale Sets can transparently configure clusters of any size automatically and dynamically. This will allow anyone, anywhere, to instantiate AI supercomputing capabilities on par with the best national labs and institutional customers on-demand in minutes. You can access VMs independently or launch and manage training jobs across the cluster using services like the Azure Machine Learning service.
We believe that Azure’s commitment to delivering AI at scale, and on-demand, combined with our partner TCS’s Neural Automotive Framework will not only deliver a safer future but also accelerate its realization. The ND A100 v4 VM series and clusters are now in preview with a new high-memory cluster option based on the latest 80 GB A100 GPUs entering preview in H1CY21.
NVIDIA GTC 2021
To access the NVIDIA GTC session schedule, please register free of charge at GTC 2021 Registration.
To learn more from experts firsthand, and see full details of the talk, log into GTC and look up “s31518” in the catalog to access “Amping Up” Autonomous Vehicle Safety Design: Benchmarking on the NVIDIA Ampere Architecture [S31518].
To learn more about the TCS Autoscape™suite of solutions for Autonomous Vehicles, access https://www.tcs.com/tcs-autoscape.
Tata Consultancy Services
Tata Consultancy Services (TCS) is an IT services, consulting, and business solutions organization that has partnered with many of the world’s largest businesses in their transformation journeys for the last 50 years. TCS offers a consulting-led, cognitive-powered, integrated portfolio of IT, business and technology services, and engineering. This is delivered through its unique Location Independent Agile delivery model, recognized as a benchmark of excellence in software development. TCS has more than 448,000 employees in 46 countries with $22 billion in revenues as of March 31, 2020.
TCS Automotive Industry Group focuses on delivering solutions and services to address the CASE (Connected, Autonomous, Shared, and Electric) ecosystem and partners with industry-leading players to address the opportunities evolving from disruption in the automotive industry.
About the Authors
Sanjay Dulepet is the global head of product development and a technology leader at TCS, focusing on autonomous and connected vehicle strategic initiatives. Sanjay is in charge of driving innovation with an entrepreneurial product mindset through strategic partnerships with industry leaders, startups, and academia. He holds master’s and bachelor’s degrees in computer science.
Arvind Ramanujam is a Senior Scientist at TCS Research and Innovation and is a member of the Data and Decision Sciences Research group. Arvind’s area of interest includes large-scale traffic simulation, Electric Vehicle modeling, and Autonomous Vehicles. His current focus is on data-driven methods to predict transport network behavior and to design AV passenger safety systems. Arvind’s team has been working with IIT-Madras’s Center for Excellence in Urban transportation to monitor and predict traffic parameters use frugal instrumentation.
Ian Finder is the program and product lead for Azure’s high-end AI and GPU-compute accelerated products, including the ND A100 v4 AI supercomputer. A firm believer in the promise of novel architectures to solve complex real-world problems, in the past, he worked on Azure’s FPGA-accelerator offerings. A lifelong hardware enthusiast, Ian has two former would-be Top500 supercomputers in his garage (from 1988 and 1994), and enough power to run one of them- if temperatures dip below freezing. He holds a Bachelor of Science in Computer Engineering from the University of Washington’s Paul G. Allen School of Computer Science & Engineering, and still lives in the area.