NVIDIA Corporation
FINE-GRAINED PER-VECTOR SCALING FOR NEURAL NETWORK QUANTIZATION

Last updated:

Abstract:

Today neural networks are used to enable autonomous vehicles and improve the quality of speech recognition, real-time language translation, and online search optimizations. However, operation of the neural networks for these applications consumes energy. Quantization of parameters used by the neural networks reduces the amount of memory needed to store the parameters while also reducing the power consumed during operation of the neural network. Matrix operations performed by the neural networks require many multiplication calculations, so reducing the number of bits that are multiplied reduces the energy that is consumed. Quantizing smaller sets of the parameters using a shared scale factor improves accuracy compared with quantizing larger sets of the parameters. Accuracy of the calculations may be maintained by quantizing and scaling the parameters using fine-grained per-vector scale factors. A vector includes one or more elements within a single dimension of a multi-dimensional matrix.

Status:
Application
Type:

Utility

Filling date:

30 Oct 2020

Issue date:

3 Mar 2022