Microsoft Corporation
CONVOLUTIONAL NEURAL NETWORK WITH PHONETIC ATTENTION FOR SPEAKER VERIFICATION

Last updated:

Abstract:

Embodiments may include determination, for each of a plurality of speech frames associated with an acoustic feature, of a phonetic feature based on the associated acoustic feature, generation of one or more two-dimensional feature maps based on the plurality of phonetic features, input of the one or more two-dimensional feature maps to a trained neural network to generate a plurality of speaker embeddings, and aggregation of the plurality of speaker embeddings into a speaker embedding based on respective weights determined for each of the plurality of speaker embeddings, wherein the speaker embedding is associated with an identity of the speaker.

Status:
Application
Type:

Utility

Filling date:

7 Feb 2022

Issue date:

19 May 2022