NVIDIA Corporation
FINE-GRAINED PER-VECTOR SCALING FOR NEURAL NETWORK QUANTIZATION
Last updated:
Abstract:
Today neural networks are used to enable autonomous vehicles and improve the quality of speech recognition, real-time language translation, and online search optimizations. However, operation of the neural networks for these applications consumes energy. Quantization of parameters used by the neural networks reduces the amount of memory needed to store the parameters while also reducing the power consumed during operation of the neural network. Matrix operations performed by the neural networks require many multiplication calculations, so reducing the number of bits that are multiplied reduces the energy that is consumed. Quantizing smaller sets of the parameters using a shared scale factor improves accuracy compared with quantizing larger sets of the parameters. Accuracy of the calculations may be maintained by quantizing and scaling the parameters using fine-grained per-vector scale factors. A vector includes one or more elements within a single dimension of a multi-dimensional matrix.
Utility
30 Oct 2020
3 Mar 2022