Insights from NVIDIA Research March 22, 2022 Bill Dally Chief Scientist and SVP of Research, NVIDIA Corporation NVIDIA Research GPU Storage Systems Perception Demand Graphics & Learning Programming Systems Applied DL Research Robotics Autonomous Networks Vehicles Tel Aviv Toronto AI AI Architecture AI Algorithms Supply VLSI ? Circuits Integration/Moonshots TECHNOLOGY TRANSFER GREATEST HITS RTX NVSwitch CuDNN Autonomous Vehicle Research 4 NVIDIA DRIVE AV PLATFORM End-to-end, Open & Modular DATA COLLECTING, TESTING MAPPING, TRAINING SIMULATION DRIVE CHAUFFEUR DRIVE CONCIERGE DRIVE DGX DRIVE AV on IX on Hyperion A100 Constellation DRIVE AGX DRIVE AGX AUTONOMOUS VEHICLE RESEARCH GROUP Next-generation AV Autonomy Stacks x Perception Prediction Planning Mapping Control Robust & Human-centered Data-driven Modular AV Stacks Safety Assurances Autonomy For more information about the Autonomous Vehicle Research Group: https://nvr-avg.github.io/ TOWARDS ROBUST AUTONOMY Addressing uncertainty and error propagation throughout the autonomy stack Planning & Sensor Data Detection Tracking Prediction Control State uncertainty from Class uncertainty from Planning with multimodal predictions object detection object classification TOWARDS ROBUST AUTONOMY Making prediction robust to tracking errors Planning & Sensor Data Detection Tracking Prediction Control Object tracking errors Fragmentatio ID Switch n Tracked Trajectory Ground Truth MAKING PREDICTION MORE ROBUST Object tracking is not perfect Tracked Trajectory Ground Truth Fragmentatio ID Switch n MAKING PREDICTION MORE ROBUST Tracking errors detrimentally affect prediction Predictions with accurate tracking Predictions after an ID switch Accurate tracking yields reasonable predictions. A false positive detection causes an ID switch, yielding errant predictions. Prediction errors increase by 10-30x in the presence of tracking errors MAKING PREDICTION MORE ROBUST Is tracking necessary to make predictions? Affinity Information Data Associatio Key Directly use detections and n Transformer Insight: affinity as prediction inputs Past object Predictions tracklets Detections Removed, no more tracking Prediction errors decrease by up to 80% when using our affinity-based prediction method MAKING PREDICTION MORE ROBUST Is tracking necessary to make predictions? Standard Tracking-Prediction Pipeline Our Affinity-based Prediction Method HISTORY OF DRIVING SIMULATORS Virtual KITTI Synthetic sensor image Nvidia DriveSim Photo-realistic rendering as part of CarSim Omniverse High-fidelity vehicle dynamic simulator CARLA Urban driving simulation with UE4 1990 2016 2017 2021 ✓ high-fidelity physics simulation ✓ high-fidelity rendering (UE, Omniverse) ? intelligence that generates realistic, human-like driving behavior TOWARDS BETTER DRIVING SIMULATION Grounding traffic behavior synthesis in real human driving behaviors Learn Deploy Real-world driving log ML-based behavior model High-fidelity simulator TOWARDS BETTER DRIVING SIMULATION Grounding traffic behavior synthesis in real human driving behaviors Desiderata Fidelity: can we synthesize agent group behaviors that resemble real-world traffic? • Diversity: can we generate a wide range of realistic traffic patterns? • Controllability: can we steer the traffic model to generate a specific scenario? Challenges • Stability: prevent the simulation from diverging out of nominal behaviors • Long-horizon: run simulation with horizon much longer than training data horizon • Adaptivity: adapt to new regions / traffic patterns with little to no tuning MODELING HUMAN BEHAVIORS BEYOND PREDICTION Closed-loop decision making requires rethinking behavior modeling Ground Truth Rolling out a Prediction Model Trajectory CLOSED-LOOP DRIVING BEHAVIOR SIMULATION Hierarchical Decision Making: Decoupling Where from How Hierarchical Policy = Spatial Map Planner + Goal-conditional Controller Rollout Spatial Probability Map SYNTHESIZE NEW AND DIVERSE TRAFFIC BEHAVIORS Sample behavior model to synthesize counterfactual scenarios Stay in the same lane Switch to the right lane SYNTHESIZE NEW AND DIVERSE TRAFFIC BEHAVIORS Decentralized Decision Making: Emergence of Complex Interactive Behaviors All agents in the scene are controlled by a learned traffic model AI For Electronic Design 20 IR DROP ESTIMATION WITH 3D-UNET Power Coefficient Predicted IR Skip maps maps connections 𝛽 maps Power maps Bottleneck Ground Truth 𝐼𝑅𝑖𝑛𝑠𝑡 = 𝛽1 𝑃𝑖 + 𝛽2 𝑃𝑠 + 𝛽3 𝑃𝑙 + 𝛽4 𝑃𝑠𝑐𝑎 + 𝛽5 𝑃𝑡𝑜𝑡 + +𝛽6 𝑅 + 𝛽7 𝑃𝑖𝑟𝑒𝑠 Best of both worlds: Transferable and ML model learns the coefficients 94% accuracy in 3 second model inference 18 minutes including feature extractions vs 3 hr in commercial tools V. A. Chhabria et al., MAVIREC: ML-Aided Vectored IR-Drop Estimation and Classification, DATE, 2021 SWITCHING ACTIVITY ESTIMATION WITH GNNS D Q D Q A CO HA B S D Q D Q edge features: node features: state_cor=0.25 prob_0=0.8125 trans_cor_0_0=0 prob_1=0.1875 pin_order=0 cell_type=42 … …… CO S Sink nodes Source nodes Netlist to Graph Conversion Y. Zhang et al., “GRANNITE: Graph Neural Network Inference for Transferable Power Estimation”, DAC 2020 PARASITICS PREDICTION WITH GNNS Cap Prediction (F) Ground Truth Circuit Schematics to Graph Conversion MAE=0.852fF MAPE=15% Simulation error reduced to <10% H. Ren et al., “ParaGraph: Layout Parasitics and Device Parameter Prediction using Graph Neural Networks”, DAC 2020. (Research funded under Cadence’s DARPA IDEA contract) ROUTING CONGESTION PREDICTION WITH GNNS R. Kirby et al., “CongestionNet: Routing Congestion Prediction Using Deep Graph Neural Networks”, VLSI-SoC 2019 NVCELL: AUTOMATING STDCELL LAYOUT (Ren et al., DAC 2021 – available in On-Demand Content) Placement: Simulated Annealing Placement: Reinforcement learning (in progress) hi hi PMOS sequence: 3, 4, NA, 1, 2 i h’i pi Action NMOS sequence: 1, 3, 2, 5, 4 j hl hj Net PIN sequence: A, NA, B, C, NA l h’k pk k hk hk Pool Placement representation Value Graph Neural Net hpool V Net A B C A B C 3 4 1 2 1 3 4 2 1 3 2 5 4 2 5 1 3 4 Transforms: Swap PMOS/NMOS pair segment, etc RL Placement Game Sequence (as good as SA on 97% cells) NVCELL: AUTOMATED STDCELL LAYOUT Routing (Ren et al., ASPDAC 2020) Unrouted Genetic Maze Routed Fix DRC Algorithm Routing with RL DRC free route design DRCs Transferable RL Model trained only one single cell DRC Fixing with RL NVCELL: AUTOMATED STDCELL LAYOUT Production use in latest technology nodes, achieving 2X overall productivity improvement Area Comparison NVCell Success Rate with Manual Design Cell Count Cell Count Cell Complexity Cell Complexity Total NVCell Smaller Same Larger 92% single row cells are DRC/LVS clean 12% cells are smaller Support multirow. High success rate with tuning Similar delay/power (±1-3%) Presentations: Special Session (Research Track): NVCell: Standard Cell Layout in Advanced Technology Nodes with Reinforcement Learning AI-DESIGNED DATAPATH CIRCUITS Smaller, Faster and Efficient Circuits using Reinforcement Learning Goal: Move the pareto frontier for datapath circuits DELAY AI designed circuits: Today: • Individually crafted • EDA Tools • Hyper-optimized for cell library, • General Purpose circuit options, surrounding logic • Heuristics • Lower area/power/delay AREA/POWER PREFIXRL: RL FOR PARALLEL PREFIX CIRCUITS Adders, priority encoders, custom circuits PrefixRL Agent PrefixRL Environment Q values (0,0) at Circuit (1,0) (1,1) add (3,2) (2,0) (2,1) (2,2) Synthesis (3,0) (3,1) (3,2) (3,3) st (area, delay)t -∆(area) Q-network rt -∆(delay) (0,0) Circuit (1,0) (1,1) (2,0) (2,1) (2,2) Synthesis (3,0) (3,1) (3,2) (3,3) st st+1 (area, delay)t+1 PREFIXRL: RESULTS 64b adders, commercial synthesis tool, latest technology node Wednesday, December 8th 3:30pm -3:50pm PST PrefixRL: Optimization of Parallel Prefix Circuits using Deep Reinforcement Learning Authors: Rajarshi Roy, Jonathan Raiman, Neel Kant, Ilyas Elkin, Robert Kirby, Michael Siu, Stuart Oberman, Saad Godil, Bryan Catanzaro PREFIXRL: RESULTS 64b adders, commercial synthesis tool, latest technology node 550ps to 320ps at 51um2 Wednesday, December 8th 3:30pm -3:50pm PST PrefixRL: Optimization of Parallel Prefix Circuits using Deep Reinforcement Learning Authors: Rajarshi Roy, Jonathan Raiman, Neel Kant, Ilyas Elkin, Robert Kirby, Michael Siu, Stuart Oberman, Saad Godil, Bryan Catanzaro GPU Storage Systems 32 Emerging DC Apps Demand Fast, Programmatic Access to Massive Structured Data Graph Analytics Recommender Systems 10GB to 1TBs 100GB to 100TBs Data Analytics 10GB to 10 TBs 33 Targeted Access Patterns Data-dependent gather/scatter of segments of massive (array) datasets Recommenders Embedding Tables Graph Analytics Data Analytics Edge (neighbor) list Selected elements Node embeddings Enabling “context-enhanced” AI decision. Poorly served by current systems. 34 Data Analytics – Data Loading and I/O Amplification Query: Get average distance of taxi rides from Williamsburg, NYC dropoff_loc_id payment_type pickup_loc_id total_amount trip_distance id 1100 NYC Taxi cab dataset[1] • 1.7B trip record from 2009-2021 • Many metrics: distance, pickup location, total 1.7B Rows 1100 amount, taxes, surcharge, … • ~200GB ORC encoded file 1100 35 1) https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page Data Analytics APP in Current System Query: Get average distance of taxi rides from Williamsburg, NYC Chunk 1 Chunk 2 ... Relevant columns are loaded in chunks. Each chunk is deserialized into buffer arrays before query processing. Dataframe Init Per Stripe Query select_stripes gather_streams_info start_streams_reads prepare_chunks_meta wait_for_reads parse_index decode_data …. Compute Comparison with Current CPU-centric Solutions CPU-centric Design (with GDS) BaM Design CPU GPU SSD CPU GPU SSD Metadata Transfer Metadata Transfer InitFullDF() Data Transfer InitChunk() Query Compute Query Compute On Chunk 0 On Entire Data Time Metadata Transfer Data Transfer InitChunk() Query Compute On-demand On Chunk 1 Data Transfer 37 BaM Software Stack Prototype: Overview Enables threads in GPU to directly CUDA User Space Applications access storage devices/services User Abstractions (BamArray, …) Leverages GPU memory and optimizes High Throughput Software Data cache storage bandwidth utilization Storage I/O Queues Provides abstraction to ease integration with applications OS Driver (Managing I/O Mapping) Builds a prototype using the off-the- Scalable Prototype Hardware shelf components to show it is possible to provide cost-effective high memory capacity for GPUs 38 BaM Access by GPU Threads Thread val = data[tid]; sqdb hit(data) ❶ Update SQ/CQ head ❾ 🅐 Wait for doorbell Coalescer ❷ Update cache state ❽ 🅑 Read SQ from GPU Poll for completion ❼ 🅒 Process CMD 🅕 Offset Calc ❸ read(off,tid) Submit CMD to SQ ❻ 🅓 DMA write to GPU IO buffer ❹ Cache lookup miss(off) Prepare NVMe CMD ❺ 🅔 Write CQ in GPU mem BaM Lookup IO Stack NVMe Controller 39 BaM Hardware Prototype Samsung 980pro Drawer 4 x Samsung 980pro NVIDIA A100 PCIe Intel Optane Switch #2 Drawer PCIe 1x Intel Optane 5800X Switch #1 NVIDIA A100 Supports any SSD types and form factors Example Configuration PCIe X16 Cable Optane SSDs Z-NAND SSDs NAND SSDs Supermicro 4122GS-TNR H3 Falcon 4016 40 All trademarks, service marks and company names are the property of their respective owners.
Enter the password to open this PDF file:
-
-
-
-
-
-
-
-
-
-
-
-