NVIDIA Corporation
FULLY-FUSED NEURAL NETWORK EXECUTION
Last updated:
Abstract:
A fully-connected neural network may be configured for execution by a processor as a fully-fused neural network by limiting slow global memory accesses to reading and writing inputs to and outputs from the fully-connected neural network. The computational cost of fully-connected neural networks scale quadratically with its width, whereas its memory traffic scales linearly. Modern graphics processing units typically have much greater computational throughput compared with memory bandwidth, so that for narrow, fully-connected neural networks, the linear memory traffic is the bottleneck. The key to improving performance of the fully-connected neural network is to minimize traffic to slow "global" memory (off-chip memory and high-level caches) and to fully utilize fast on-chip memory (low-level caches, "shared" memory, and registers), which is achieved by the fully-fused approach. A real-time neural radiance caching technique for path-traced global illumination is implemented using the fully-fused neural network for caching scattered radiance components of global illumination.
Utility
7 Jun 2021
8 Sep 2022