Adobe Inc.
Active learning for large-scale semi-supervised creation of speech recognition training corpora based on number of transcription mistakes and number of word occurrences

Last updated:

Abstract:

Techniques are disclosed for generating ASR training data. According to an embodiment, impactful ASR training corpora is generated efficiently, and the quality or relevance of ASR training corpora being generated is increased by leveraging knowledge of the ASR system being trained. An example methodology includes: selecting one of a word or phrase, based on knowledge and/or content of said ASR training corpora; presenting a textual representation of said word or phrase; receiving a speech utterance that includes said word or phrase; receiving a transcript for said speech utterance; presenting said transcript for review (to allow for editing, if needed); and storing said transcript and said audio file in an ASR system training database. The selecting may include, for instance, selecting a word or phrase that is under-represented in said database, and/or based upon an n-gram distribution on a language, and/or based upon known areas that tend to incur transcription mistakes.

Status:
Grant
Type:

Utility

Filling date:

13 Nov 2018

Issue date:

18 May 2021