Khan:2025:SCC

S. Khan, J. Chakraborty, P. Beaucamp, N. Bhujel, and M. Chen. Data virtualization for machine learning. Proc. 22nd International Services Computing Conference (SCC 2025), pp. 69-84, Hong Kong, China, 27–30, September, 2025. (Held as Part of the Services Conference Federation, SCF 2025.), DOI.

This Case Study Report was authored in 2026 by Saiful Khan, The Science and Technology Facilities Council, UK

Most analysis was done at the beginning of the project prospectively, though the report was written after part of the implementation was completed.

In traditional machine learning (ML) workflows, developers perform numerous experiments that require extensive data wrangling, including feature extraction, normalization, data merging, etc., and partitioning datasets into training, validation, and testing sets. As illustrated in Fig. 1(a), each developer independently writes programs that produce explicit physical copies of the transformed datasets at every stage of the pipeline. This practice results in massive storage redundancy; inconsistent and duplicated transformation efforts across team members; and a loss of data provenance whenever intermediate datasets are overwritten or deleted during iterative experimentation.

Data virtualization [Khan et al. 2025] addresses these inefficiencies by replacing explicit data copies with virtual datasets: light-weight specifications that record the path links to the source data and the call links to the transformation functions required to reproduce any derived dataset on demand. As shown in Fig. 1 (b), a centralized data virtualization service processes data-loader queries, traces the specification graph back to the original explicit datasets, executes the required transformations through processing cache, and delivers the materialized data to the ML workflow without creating intermediate files. This design eliminates storage redundancy, creates consistent transformation pipelines, and preserves complete data lineage as a structural property of the virtual dataset specification itself.

(a) traditional
workflow

(b) data
virtualization

Fig. 1 An illustration of two distinct data workflows: (a) shows the traditional data preparation sequence of a workflow before the training, validation, and testing phases, where explicit data is saved to disk. In contrast, (b) depicts a virtualization workflow that focuses specifically on data transformations within the underlying infrastructure, where only virtual data is saved to the database, and actual data is generated on demand [Khan et al. 2025 ].

The workflow for optimizing the workflow in Fig. 1(a) is given in Fig. 2, The list below demonstrates how the IVAS framework supports the structured analysis in the requirements analysis and the design of the data virtualization workflow in Fig. 1(b). Note that the design of these branches were formulated at the beginning of the project, while Branches B and C could might also have arrived iteratively by considering the side-effects.

AT THE VERY BEGINNING OF THE PROJECT

Symptoms: (1) ML developers repeatedly perform identical or similar data wrangling tasks (e.g., selection and processing) across experiments and across team members; (2) Intermediate datasets created by individual ML developers are not shared among ML developers in the same organization or R&D environment; (3) ML developers do not have adequate visualization capabilities to see what they did in the past, not mentioning what others did in the past.

BRANCH 1

Symptom 1: ML developers repeatedly perform identical or similar data wrangling tasks (e.g., selection and processing) across experiments and across team members.
Abstract Reasoning (symptom to cause): (I) High-Ct ⭢ (I, A, S) Low-AC
Cause 1: No sharing, compressed representation of prior transformation efforts exists; every developer independently processes raw data from scratch.
Abstract Reasoning (cause to remedy): (I, A, S) Low-AC ⭢ (I, A, S) High-AC
Remedy 1: Centralize and store all previously computed transformation outputs so that subsequent developers can reuse them without re-implementation.
Abstract Reasoning (remedy to side effect): (A) High-AC ⭢ (A) High-Ct
Side-Effect 1: See Symptom 2

BRANCH 2

Symptom 2: Storing all preprocessed data permutations as explicit physical copies consumes excessive storage space (side effect of row 1)
Abstract Reasoning (symptom to cause): (A) High-Ct⭢ (A) Low AC (treating data as a special case of algorithm, i.e., identity transformation)
Cause 2: Explicit duplication: every unique transformation output is materialised as a separate physical dataset.
Abstract Reasoning (cause to remedy): (A) Low-AC ⭢ (A) High-AC, (A) Low-Ct
Remedy 2: Data virtualization: replace explicit copies with virtual datasets.
Abstract Reasoning (remedy to side effect): (A) Low-Ct ⭢ (I) High-Ct (at query time)
Side-Effect 3: See Symptom 3

BRANCH 3

Symptom 3: Virtualised data are not transparent to developers who did not create them: retrieving the processing history auditing is not easy (also side effect 2).
Abstract Reasoning (symptom to cause): (I) High-Ct ⭢ (I) High-AC
Cause: Lost provenance
Abstract Reasoning (cause to remedy): (I) High-AC ⭢ (V) Low-AC, (I) Low-Ct
Remedy: Provenance visualization: expose the complete data lineage (e.g., source datasets, transformation functions, parameters, etc.) through a visualization interface.
Abstract Reasoning (remedy to side effect):
Side-Effect 3: No major side effect is found.

Primary symptom and diagnosis. The principal symptom in conventional ML workflows is that developers continually repeat the same data-wrangling tasks from scratch. From an information-theoretic perspective, this represents excessively low alphabet compression (AC) across all three abstract components: interaction (I), algorithm (A), and statistics (S), because no shared, compressed representation of prior transformation effort is available. Every developer must independently sift through raw data and process derivations that may have already been performed by colleagues. The root cause is therefore diagnosed as (I, A, S) Low-AC. The remedy is to store transformation outputs centrally so that downstream developers can reuse them. Storing pre-processed datasets increases AC across I, A, and S (denoted as (I, A, S) High-AC) by creating a shared processed dataset that can be queried on demand. This remedy reduces the interaction costs borne by individual developers; however, materialising all possible transformation permutations as explicit physical copies imposes an algorithmic storage cost (A High-Ct).

Storage overhead. This cost is a direct consequence of explicit duplication of the derived dataset (A Low-AC): because each derived dataset is stored as a discrete physical file, the storage footprint grows with every new transformation variant explored during experimentation. In some cases, for example, when a long time series is segmented using a sliding window, the size of the derived dataset can approach a multiple of the original source. Data virtualization resolves this storage side-effect by storing only lightweight specifications, such as paths to source data and identifiers for transformation functions, rather than the materialized outputs. This shift from physical to virtual representation raises algorithmic AC (A High-AC) by collapsing multiple redundant files into a single specification entry. The cost introduced is a marginal computational overhead (A Low-Ct) incurred when a data loader resolves a virtual dataset by executing the required transformations at query time (I High-Ct). However, this overhead is consistently negligible in practice, particularly given the Data Processing Cache (DPC) within the data virtualization service (DVS).

Loss of provenance and transparency. The introduction of virtual transformations introduces a second side effect: developers who utilize virtual datasets produced by others lack visibility into the transformation history. The sequence of path links, transformation functions, and parameter choices that constitute the virtualization pipeline is not easily interpretable. This situation corresponds to high AC at the interaction dimension (I High-AC): the developer's interaction with the data has lost access to the intermediate information states needed to understand, verify, or reproduce the derivation. It is important to note that the virtualization infrastructure does not discard provenance information. On the contrary, the complete data lineage is structurally preserved within the specification of each virtual dataset, which records all path links and transformation call links. The problem is one of accessibility, not loss. The appropriate remedy is therefore to lower AC through provenance visualization (V Low-AC): by visualizing the transformation graph as an interactive, navigable structure. Developers can therefore audit, verify, and reproduce data transformations with minimal effort, yielding low interaction cost (I Low-Ct) across the full ML development lifecycle.

Fig. 2. Illustration of the workflow for optimizing the baseline workflows used by ML developers for data preparation and learning preparation.