Microsoft Corporation
Pre-Training With Alignments For Recurrent Neural Network Transducer Based End-To-End Speech Recognition

Last updated: 13 Oct 2021

Abstract:

Techniques performed by a data processing system for training a Recurrent Neural Network Transducer (RNN-T) herein include encoder pretraining by training a neural network-based token classification model using first token-aligned training data representing a plurality of utterances, where each utterance is associated with a plurality of frames of audio data and tokens representing each utterance are aligned with frame boundaries of the plurality of audio frames; obtaining first cross-entropy (CE) criterion from the token classification model, wherein the CE criterion represent a divergence between expected outputs and reference outputs of the model; pretraining an encoder of an RNN-T based on the first CE criterion; and training the RNN-T with second training data after pretraining the encoder of the RNN-T. These techniques also include whole-network pre-training of the RNN-T. A RNN-T pretrained using these techniques may be used to process audio data that includes spoken content to obtain a textual representation.

Status:

Application

Type:

Utility

Filling date:

3 Apr 2020

Issue date:

7 Oct 2021

Full patent description

Patent application document

Microsoft Corporation Pre-Training With Alignments For Recurrent Neural Network Transducer Based End-To-End Speech Recognition

Abstract:

Microsoft Corporation
Pre-Training With Alignments For Recurrent Neural Network Transducer Based End-To-End Speech Recognition