International Business Machines Corporation
Method for re-aligning corpus and improving the consistency

Last updated:

Abstract:

Vocabulary consistency for a language model may be improved by splitting a target token in an initial vocabulary into a plurality of split tokens, calculating an entropy of the target token and an entropy of the plurality of split tokens in a bootstrap language model, and determining whether to delete the target token from the initial vocabulary based on at least the entropy of the target token and the entropy of the plurality of split tokens.

Status:
Grant
Type:

Utility

Filling date:

30 Jan 2020

Issue date:

15 Mar 2022