NetApp, Inc.
METHODS AND SYSTEMS FOR AUTOMATED DOCUMENT CLASSIFICATION WITH PARTIALLY LABELED DATA USING SEMI-SUPERVISED LEARNING

Last updated:

Abstract:

A method, a computing device, and a non-transitory machine-readable medium for classifying documents. A document collection is sorted into a plurality of categories. A classifier corresponding to a category of the plurality of categories is trained to output a probability that a document associated with the category is of a selected type (e.g., confidential). The training includes determining, by the processor, that a cardinality of a set of negative samples in a train set is not above a pipeline threshold but is at least one and training the classifier via a first pipeline and a second pipeline using a training group that includes a first portion of a group of positive samples in the train set, a second portion of a set of negative samples in the train set, and a third portion of a group of unlabeled samples in the train set

Status:
Application
Type:

Utility

Filling date:

31 Jul 2020

Issue date:

3 Feb 2022