International Business Machines Corporation
Method for re-aligning corpus and improving the consistency
Last updated:
Abstract:
Vocabulary consistency for a language model may be improved by splitting a target token in an initial vocabulary into a plurality of split tokens, calculating an entropy of the target token and an entropy of the plurality of split tokens in a bootstrap language model, and determining whether to delete the target token from the initial vocabulary based on at least the entropy of the target token and the entropy of the plurality of split tokens.
Status:
Grant
Type:
Utility
Filling date:
30 Jan 2020
Issue date:
15 Mar 2022