Oracle Corporation
DATA EXTRACTION AND ORDERING BASED ON DOCUMENT LAYOUT ANALYSIS

Last updated:

Abstract:

The embodiments disclosed herein relate to identifying phrases in an electronic document, where each token is one or more characters. Phrases are formed from the tokens, based on a position of each token relative to other tokens in the document. If the horizontal space between two tokens is less than a threshold, the two tokens are identified as a phrase. Information identifying phrases and tokens can be stored in a marked-up document. Value phrases can be identified by the content of the phrase. Thereafter, a label phrase can be identified based on proximity to the value phrase and/or the presence of an association symbol in the phrase. The label phrase and value phrase can be identified as a label-value pair, where the label identifies the type of content in the value phrase. A reading order of the document can be determined through the use of a binary tree.

Status:
Application
Type:

Utility

Filling date:

30 Jan 2020

Issue date:

5 Aug 2021