Microsoft Corporation
Multilingual Model Training Using Parallel Corpora, Crowdsourcing, and Accurate Monolingual Models
Last updated:
Abstract:
A data processing system for generating training data for a multilingual NLP model implements obtaining a corpus including first and second content items, where the first content items are English-language textual content, and the second content items are translations of the first content items in one or more non-English target languages; selecting a first content item from the plurality of first content items; generating a plurality of candidate labels for the first content item by analyzing the first content item with a plurality of first English-language NLP models; selecting a first label from the plurality of candidate labels; generating first training data by associating the first label with the first content item; generating second training data by associating the first label with a second content item of the second content items; and training a pretrained multilingual NLP model with the first training data and the second training data.
Utility
22 Dec 2020
23 Jun 2022