International Business Machines Corporation
BOOTSTRAPPING OF TEXT CLASSIFIERS
Last updated:
Abstract:
Computer-implemented methods and systems are provided for generating training datasets for bootstrapping text classifiers. Such a method includes providing a word embedding matrix. This matrix is generated from a text corpus by encoding words in the text as respective tokens such that selected compound keywords in the text are encoded as single tokens. The method includes receiving, via a user interface, a user-selected set of the keywords a nearest neighbor search of the embedding space is performed for each keyword in the set to identify neighboring keywords, and a plurality of the neighboring keywords are added to the keyword-set. The method further comprises, for a corpus of documents, string-matching keywords in the keyword-sets to text in each document to identify, based on results of the string-matching, documents associated with each text class. The documents identified for each text class are stored as the training dataset for the classifier.
Utility
10 Sep 2020
10 Mar 2022