International Business Machines Corporation
Automatic extraction of domain specific terminology from a large corpus

Last updated:

Abstract:

A method of extracting jargon from a document corpus stored in a database using a processor and a user interface is described herein. A sub-domain input is entered through the user interface to initiate a review of the document corpus stored in the database. The processor separates the document corpus into at least one sub-corpus and a remainder corpus. The at least one sub-corpus is defined by the sub-domain input. A first topic model and a second topic model are built to generate respective topic similarity scores for at least one term extracted from the at least one sub-corpus and at least one corresponding term extracted from the remainder corpus. The respective topic similarity scores are compared by the processor to identify jargon terms and thereby provide a list of jargon terms through the user interface.

Status:
Grant
Type:

Utility

Filling date:

26 Jan 2018

Issue date:

7 Sep 2021