SAP SE
DATA-DRIVEN STRUCTURE EXTRACTION FROM TEXT DOCUMENTS

Last updated: 22 Dec 2021

Abstract:

Methods and apparatus are disclosed for extracting structured content, as graphs, from text documents. Graph vertices and edges correspond to document tokens and pairwise relationships between tokens. Undirected peer relationships and directed relationships (e.g. key-value or composition) are supported. Vertices can be identified with predefined fields, and thence mapped to database columns for automated storage of document content in a database. A trained neural network classifier determines relationship classifications for all pairwise combinations of input tokens. The relationship classification can differentiate multiple relationship types. A multi-level classifier extracts multi-level graph structure from a document. Disclosed embodiments support arbitrary graph structures with hierarchical and planar relationships. Relationships are not restricted by spatial proximity or document layout. Composite tokens can be identified interspersed with other content. A single token can belong to multiple higher level structures according to its various relationships. Examples and variations are disclosed.

Status:

Application

Type:

Utility

Filling date:

3 Jun 2020

Issue date:

9 Dec 2021

Full patent description

Patent application document

SAP SE DATA-DRIVEN STRUCTURE EXTRACTION FROM TEXT DOCUMENTS

Abstract:

SAP SE
DATA-DRIVEN STRUCTURE EXTRACTION FROM TEXT DOCUMENTS