International Business Machines Corporation
EXTRACTING NON-TEXTUAL DATA FROM DOCUMENTS VIA MACHINE LEARNING

Last updated:

Abstract:

An approach for extracting non-textual data from an electronic document is disclosed. The approach includes receiving a request to extract a file and converting the file into pixels. The approach creates a pixel map of the converted file and determines one or more density clusters of the pixel map based on image clustering method. Furthermore, the approach determines one or more coordinates of the one or more density clusters and determines one or more candidate information regions based on the one or more coordinates, density of the one or more density clusters. Finally, the approach extracts one or more textual data based on the one or more candidate information regions and outputs the extracted one or more textual data.

Status:
Application
Type:

Utility

Filling date:

13 Mar 2020

Issue date:

16 Sep 2021