Microsoft Corporation
Multilingual Model Training Using Parallel Corpora, Crowdsourcing, and Accurate Monolingual Models

Last updated: 20 Jul 2022

Abstract:

A data processing system for generating training data for a multilingual NLP model implements obtaining a corpus including first and second content items, where the first content items are English-language textual content, and the second content items are translations of the first content items in one or more non-English target languages; selecting a first content item from the plurality of first content items; generating a plurality of candidate labels for the first content item by analyzing the first content item with a plurality of first English-language NLP models; selecting a first label from the plurality of candidate labels; generating first training data by associating the first label with the first content item; generating second training data by associating the first label with a second content item of the second content items; and training a pretrained multilingual NLP model with the first training data and the second training data.

Status:

Application

Type:

Utility

Filling date:

22 Dec 2020

Issue date:

23 Jun 2022

Full patent description

Patent application document

Microsoft Corporation Multilingual Model Training Using Parallel Corpora, Crowdsourcing, and Accurate Monolingual Models

Abstract:

Microsoft Corporation
Multilingual Model Training Using Parallel Corpora, Crowdsourcing, and Accurate Monolingual Models