International Business Machines Corporation
DATASET MANAGEMENT IN MACHINE LEARNING

Last updated:

Abstract:

A method, a computer system, and a computer program product for managing a dataset of training samples, labeled by class, during training of a machine learning model is provided. Embodiments of the present invention may include training the model on a sequence of increasing-sized sets of the training samples and testing performance of the model after training with each set to obtain class-specific performance metrics corresponding to each set size. Embodiments of the present invention may include generating class-specific learning curves from the performance metrics for the plurality of classes. Embodiments of the present invention may include extrapolating the learning curves. Embodiments of the present invention may include optimizing a function of the predicted performance metrics to identify a set of augmentation actions to augment the dataset for further training of the model. Embodiments of the present invention may include providing an output indicative of the set of augmentation actions.

Status:
Application
Type:

Utility

Filling date:

7 May 2020

Issue date:

11 Nov 2021