International Business Machines Corporation
Synthetic deidentified test data
Last updated:
Abstract:
Embodiments include a method for one or more processors to receive an organic dataset and a domain knowledge base. The one or more processors identify private data entities present within the organic dataset. The one or more processors determine statistical properties of the private data entities identified within the organic dataset. The one or more processors create a plurality of test data templates by removing the private data entities from the organic dataset. The one or more processors select from the domain knowledge base, synthetic data entities that match a data type of the removed private data entities, respectively, and align with the statistical properties of the private data entities, and the one or more processors generate synthetic test data by inserting, respectively, the synthetic data entities of the matching data type for the removed private data entities in the test data templates.
Utility
16 Nov 2020
19 Jul 2022