Primary symptom and diagnosis. The principal symptom in conventional ML workflows is that developers continually repeat the same data-wrangling tasks from scratch. From an information-theoretic perspective, this represents excessively low alphabet compression (AC) across all three abstract components: interaction (I), algorithm (A), and statistics (S), because no shared, compressed representation of prior transformation effort is available. Every developer must independently sift through raw data and process derivations that may have already been performed by colleagues. The root cause is therefore diagnosed as (I, A, S) Low-AC. The remedy is to store transformation outputs centrally so that downstream developers can reuse them. Storing pre-processed datasets increases AC across I, A, and S (denoted as (I, A, S) High-AC) by creating a shared processed dataset that can be queried on demand. This remedy reduces the interaction costs borne by individual developers; however, materialising all possible transformation permutations as explicit physical copies imposes an algorithmic storage cost (A High-Ct).
Storage overhead. This cost is a direct consequence of explicit duplication of the derived dataset (A Low-AC): because each derived dataset is stored as a discrete physical file, the storage footprint grows with every new transformation variant explored during experimentation. In some cases, for example, when a long time series is segmented using a sliding window, the size of the derived dataset can approach a multiple of the original source. Data virtualization resolves this storage side-effect by storing only lightweight specifications, such as paths to source data and identifiers for transformation functions, rather than the materialized outputs. This shift from physical to virtual representation raises algorithmic AC (A High-AC) by collapsing multiple redundant files into a single specification entry. The cost introduced is a marginal computational overhead (A Low-Ct) incurred when a data loader resolves a virtual dataset by executing the required transformations at query time (I High-Ct). However, this overhead is consistently negligible in practice, particularly given the Data Processing Cache (DPC) within the data virtualization service (DVS).
Loss of provenance and transparency. The introduction of virtual transformations introduces a second side effect: developers who utilize virtual datasets produced by others lack visibility into the transformation history. The sequence of path links, transformation functions, and parameter choices that constitute the virtualization pipeline is not easily interpretable. This situation corresponds to high AC at the interaction dimension (I High-AC): the developer's interaction with the data has lost access to the intermediate information states needed to understand, verify, or reproduce the derivation. It is important to note that the virtualization infrastructure does not discard provenance information. On the contrary, the complete data lineage is structurally preserved within the specification of each virtual dataset, which records all path links and transformation call links. The problem is one of accessibility, not loss. The appropriate remedy is therefore to lower AC through provenance visualization (V Low-AC): by visualizing the transformation graph as an interactive, navigable structure. Developers can therefore audit, verify, and reproduce data transformations with minimal effort, yielding low interaction cost (I Low-Ct) across the full ML development lifecycle.