Tabular data generation

Example

Real data

This example shows how to use PiShield for synthesizing realistic tabular data, which is compliant with a set of requirements written as linear inequalities.

Task: training a deep generative model (DGM) for tabular data generation.

Training Data: WiDS dataset, which is used to predict if a patient is diagnosed with a particular type of diabetes named Diabetes Mellitus, using data from the first 24 hours of intensive care. For training the models here, 22K samples were used, each sample having 109 features.

Requirements: 31 linear inequalities capturing relations between the data features.

Example requirement: the value associated with the "maximum level of hemoglobin recorded" column to be greater than or equal to the one associated with the "minimum level of hemoglobin recorded" column.

Before

The DGM model, used to generate the samples, violates the background knowledge, since many of the generated samples are crossing the boundary marked in red.

In other words, such samples have lower values for the maximum hemoglobin levels than for the minimum hemoglobin levels and, thus, are not realistic.

After

All samples generated by constraining the DGM with PiShield (denoted as the C-DGM model) satisfy the background knowledge.

This results in realistic data, which also resembles the real data more than the samples generated with the baseline DGM model.