SAP SE
SELF-SUPERVISED PRETRAINING THROUGH TEXT ALIGNMENT

Last updated:

Abstract:

Machine learning models, trained on labeled training data, may be used to categorize documents. To convert data from human-readable text to a form usable by a machine-learning model, a mapping of words to vectors is performed. Learning the mapping to be used is often part of training a machine learning model that operates on text input. A self-supervised pretraining step is performed that aligns the vectors for two or more fields of each document. In this way, when training on the labeled data begins, the vectors used for transforming the text will already be pretrained to give similar values for the two fields. In applications where the two fields are expected to have similar meanings, this pretraining can improve the quality of the resulting model, reduce the amount of training needed, or both.

Status:
Application
Type:

Utility

Filling date:

4 Jan 2021

Issue date:

7 Jul 2022