International Business Machines Corporation
Automatic extraction of domain specific terminology from a large corpus

Last updated: 10 Sep 2021

Abstract:

A method of extracting jargon from a document corpus stored in a database using a processor and a user interface is described herein. A sub-domain input is entered through the user interface to initiate a review of the document corpus stored in the database. The processor separates the document corpus into at least one sub-corpus and a remainder corpus. The at least one sub-corpus is defined by the sub-domain input. A first topic model and a second topic model are built to generate respective topic similarity scores for at least one term extracted from the at least one sub-corpus and at least one corresponding term extracted from the remainder corpus. The respective topic similarity scores are compared by the processor to identify jargon terms and thereby provide a list of jargon terms through the user interface.

Status:

Grant

Type:

Utility

Filling date:

26 Jan 2018

Issue date:

7 Sep 2021

Full patent description

Patent application document

International Business Machines Corporation Automatic extraction of domain specific terminology from a large corpus

Abstract:

International Business Machines Corporation
Automatic extraction of domain specific terminology from a large corpus