The revolutionary breakthrough of foundation models makes foundation model fine-tuning (FMF) workloads prevalent in modern GPU datacenters. However, existing deep learning schedulers are primarily tailored for model training from scratch and cannot well adapt to FMF workloads, because of lacking the consideration of their unique characteristics.
We propose YMIR, the first scheduler to improve the efficiency of FMF workloads in large-scale GPU datacenters. The key insight of our design is that many FMF workloads commonly share the same architecture backbone, giving us the opportunity of merging multiple workloads for higher efficiency. YMIR successfully addresses two challenges of FMF workload scheduling. (1) It reduces the GPU resource consumption and context switch overhead of FMF workloads. (2) It investigates the task transferability among different FMF workloads and improves the cluster-wide efficiency. We conduct 32-GPU physical experiments and 240-GPU trace-driven simulations to validate the effectiveness of YMIR. YMIR can reduce the average job completion time by up to 4.3× compared to existing state-of-the-art schedulers. It also promotes scheduling fairness by fully exploiting task transferability.
Figure 6 shows the workflow of YMIR. It contains three key components: YMIREstimator is responsible for predicting the execution time of FMF workloads in various transfer learning modes (§ 4); YMIRSched assigns resources to FMF workloads and identifies the optimal transfer learning modes (§ 5); YMIRTuner improves the efficiency of FMF workloads with novel data-parallelism and lightweight context switch mechanisms (§ 6).
Specifically, (1) a user submits an FMF workload to the scheduler in a YAML format. The YAML file specifies the model, the path of fine-tuning dataset, fine-tuning hyperparameters (e.g., batch size, learning rate), whether to accept parameter sharing with other tasks, and target computing metrics2 (e.g., accuracy, F1-score, BLEU score) ( ① ). For a new workload, YMIREstimator requests profiled resources (e.g., 1 GPU) from YMIRSched to collect statistical information (e.g., loss, gradient) and calls transferability estimator to compute the transferability score and predict the transfer gain (defined in Eqn. 1) between the new workload and other FMF workloads ( ② ). Then iteration estimator uses the transfer gain to predict the number of iterations (defined in Eqn. 2) that can reach the target computing metrics in different learning modes ( ③ ). Time estimator uses the estimated number of iterations to predict the duration of each task under any resource allocations (defined in Eqn. 5). Importantly, this estimation process is performed only once for each new workload, significantly reducing the computational overhead and improving efficiency ( ④ ). (2) In YMIRSched, task merger utilizes the predicted execution information from YMIREstimator to decide the optimal transfer learning modes for potential task combinations and makes the resource (re-)allocations for each task ( ⑤ ). (3) In YMIRTuner, task constructor instantiates the FMF workloads based on the given transfer learning modes and other hyper-parameters, and executes them on allocated resources ( ⑥ ). Pipeswitch can efficiently reduce the context switch overhead and enable more flexible scheduling of YMIRSched ( ⑦ ).
Table 6: Workflow of YMIR
Table 8: Dataset Description
We present a full suite of FMF tasks in Table. 8. We have 9 vision datasets, 9 language understanding and 9 language generation for ViT-Base, RoBERTa-Base, and Vicuna-7B respectively. We have conducted a hyper-parameter sweep to search the optimal learning rate and batch size for each task. We also show the validation metric, scale, and size (the number of total samples) for each dataset.
Workload. Our adopted FMTrace contains 18,471 jobs over a period of 3 months on a cluster of 88 nodes, a total of 704 NVIDIA V100 GPUs. We select a subset of jobs the duration of which ranges from 5 minutes to 10 hours to synthesize our evaluated workloads. We use only one workload to evaluate YMIR's's physical experiment and three workloads to evaluate YMIR's simulation experiments. We follow FMTrace's pattern of job arrival, resource request, and job duration. We assign sampled datasets in Table. 8 for each job based on the GPU time. We set the probability of generating Small (0.5 GPU-hours), Medium (0.5-10 GPU-hours), and Large (10-64 GPU-hours) as 0.3, 0.6, 0.1. We sample 240, 180, 120 jobs for ViT-Base, RoBERTa-Base, and Vicuna-7B respectively to conduct physical evaluation. Moreover, we select 3000, 2000, 1500 jobs for ViT-Base, RoBERTa-Base, and Vicuna-7B respectively as 1 × job load to demonstrate the scalability of YMIR.
Simulator Constructor. We directly use our LUT to simulate the job throughput over different resource allocations. We collect job throughput data on V100-32GB for simulation experiments with ViT-Base and RoBERTa-Base. Similarly, we use A100-80GB to perform simulation experiments with Vicuna-7B. For unseen configurations, we use linear interpolation to estimate the throughput. In addition, we collect the actual profiling cost of transferability estimation and context switch overhead for our simulator. Such simulator construction method is also adopted by Pollux [1].
We provide a comprehensive explanation of our YAML format to submit FMF workloads in Figure 14.
Figure 14: YAML Template
Figure 15: Transfer learning performance: (a) RTE accuracy in temporal transfer learning on RoBERTa-Base; (b) FashionMNIST accuracy in spatial transfer learning on ViT-Base.
In fact, temporal and spatial transfer learning does not always bring beneficial JCT speedup. Figure 15a shows that RTE accuracy degrades when we adopt QQP→RTE on RoBERT-Base compared to using normal transfer to fine-tune RTE tasks. Similarly, Figure 15b presents that the ImageNet25 accuracy drops significantly when we leverage ImageNet25 || FashionMNIST on ViT-Base. Both negative examples demonstrate that we cannot naively combine both tasks to obtain positive transfer gain. Instead, we need to an effective transferability estimator to help filter out negative task combinations and direct scheduler to choose positive task combinations to improve cluster-wide efficiency.
We present how YMIR computes the transferability score for different transfer learning modes using Task2Vec in Alg. 1. In particular, we introduce a function VECTORIZATION, to compute the diagonal entries of the Fisher Information Matrix (FIM), which represent a task with a fixed-dimension embed- ding (Lines 5 - 8). This is a computationally efficient process in that it calls SUBSET to compute FIM instead of full dataset. We also introduce another function SIMSCORE to measure the distance of two task embeddings FA and FB (Lines 9 - 10). With these functions, we can calculate the transferability score (Line 17). (1) For temporal transfer learning, it is computed as the embedding distance of tasks A and B subtracting the em- bedding distance of task B with different seeds. (2) For spatial transfer learning, we compute the bidirectional transferability scores d(A,B) and d(B,A), and take their average as the final transferability score.
We leverage Eqn. 3 to estimate the number of iterations for FMF workloads to reach the model convergence. We advance Optimus’s curve-fitting model by introducing a new term. To validate the effectiveness of curve fitting, we present the normalized actual loss (y-axis) and fitted value over training progress (x-axis) in Figure 16. We can find that the additional term can help better fit the normalized training loss.
Figure 16: Prediction performance of our proposed Eqn. 3 and Optimus’s curve fitting formulation.
Table 9 illustrates the impact of the transfer gain and resource allocation on determining the transfer learning modes. To ease the understanding, we directly use transfer gain to represent the JCT improvement brought by task transferability. It shows the JCT of fine-tuning two jobs A and B in four scenarios with different resource capacities and transfer gains. Both jobs require 10 fine-tuning epochs with at most two GPUs, where A and B need one and two time units for each epoch on a single GPU, respectively. Normal training uses the Shortest Remaining Time First (SRTF) algorithm for scheduling; A is scheduled before B as it completes faster. We observe the following two insights: (1) the optimal learning modes depend upon the transfer gain between tasks (1st, 2nd, and 3rd rows); (2) resource allocation can affect the optimal transfer learning mode selection (3rd and 4th rows). Such analysis from Table 9 implies that we should consider both resource allocations and transfer gain to determine the optimal transfer learning mode. To ease the reader’s understanding Table 9, we provide the computation process of several representative cases. First, we consider the JCT computation process of normal training. We adopt the SRTF algorithm to schedule both task A and B. When the cluster capacity is 1, JCT for task A and B is 10 and 30 (= 10 + 20) respectively, and we can obtain a total JCT of 40. When we increase the cluster capacity to 3, we allocate 2 GPUs and 1 GPU for task A and B respectively, then allocate 2 GPUs for task B when task A completes. The JCT of task A and B are 5 and 12.5 (5 + 7.5). This means the total JCT is 17.5. Second, the computation process of temporal transfer learn- ing is similar to normal training. The transfer gain can reduce the amount of fine-tuning needed to reach the convergence. In the 3rd rows, when A → B and B → A can reduce 30% training iterations. We first run task A, which enjoys no benefit from temporal transfer learning. Then we run task B using task A’s weights. This can reduce the number of training epochs of task B from 10 to 7. To summarize, the JCT of task A and B is 10 and 24 (=10+14) respectively. Third, we explain how to compute the JCT of task A and B under spatial transfer learning. In the second row, task A and B can reduce the number of fine-tuning epochs mutually up to 60%. When we adopt the spatial transfer learning, the number of fine-tuning epochs is decreased to 4, but the time- unit per epoch increases up to 3 (=2+1). Hence the JCT of task A and B is 12. In the fourth row, when the cluster capacity is 3, the spatial transfer learning can increase the number of allowed allocated GPUs due to the accumulated batch size scale. Hence, we can allocate 3 GPUs for A||B, and achieve 3× throughput speedup. Hence we can derive the JCT of task A and B as 7.
Table 9: Comparison of JCT in normal, temporal and spatial transfer learning modes.
We can build LUT in an offline manner, but it is also challenging as the number of potential configs is extremely large. For example, in RoBERTa-Base, the number of configs {s, m, amp, \ell, ckpt, pipeline} is (32,2,64,2,12,2,2). The total profiling time is up to 1,092 hours, assuming profiling each configuration combination takes 10 seconds. The configuration space scales exponentially when considering larger parameter spaces.
To address this challenge, we further zoom into the details of these training configurations. First, we can ignore some configurations without affecting the efficiency of LUT. For example, we can carefully control the batch size as the job speed correlates linearly with the batch size within a certain range and the gradient checkpoint performs effectively when the batch size is too large. Second, our resource scheduler only considers the number of allocated GPUs from {0, 1, 2, 3, 4m | m ∈ Z+ }. Hence we can only profile a smaller resource allocation set. Third, we do not need to consider layer freezing when using parameter-efficient transfer learning. With the above consideration, we can reduce the number of profiling configurations to around 2,000 and the profiling time to about 5 hours per FM. Additionally, we can profile many configurations in parallel to minimize the profiling time cost.
Considering 2-GPU resource allocations, we compute TranWt of different task combinations for ViT- Base, RoBERTa-Base, and Vicuna-7B under different transfer learning modes. Note that, we quickly filter negative combinations for spatial transfer using Task2Vec considering the significant. training overhead of Vicuna-7B. Figure 17 (a) (c) and (e) show TranWt of ViT-Base, RoBERTa-Base and Vicuna-7B when we adopt temporal transfer. Similarly, Figure 17 (b) (d) (f) present TranWt when we choose spatial transfer. We can observe that spatial transfer exhibits higher TranWt than temporal transfer. This results from that spatial transfer can exploit the mutual positive transferability. Temporal transfer learning enforces the sequential order and only takes advantage of one-way positive transferability.
Figure 18: Graphical illustration of existing PETL methods. “FM module” represents a certain sublayer of the modules of foundation models (e.g. attention or FFN) that is frozen.
Figure 17: TranWt of ViT-Base (a-b), RoBERTa-Base (c-d) and Vicuna-7B (e-f) across different datasets.
Figure 18 illustrates the difference between existing parameter- efficient transfer learning methods. They share similar architecture, and this is why we can adopt an unified architecture to express any PETL architectures. Users can specify one of these architectures, and we can extend it into a unified architecture by adding certain layers and fixing corresponding parameters.
Given the number of datasets, we follow a similar workload synthesis process in A. Dataset and Workload, and assign the dataset based on the GPU time. For each experimental setting, we sample 3 traces to reinforce the advantages of our task merger.
The previous study [2, 3] points out the importance of batch size and learning rate on the performance of FMF workloads. For language and vision tasks, we search learning rate over {10−5,2 × 10−5,4 × 10−5,10−4,4 × 10−4,10−3,10−2}. The search space of batch size scale is {64,256,512,768}, {16,64,128,192}, {24, 32, 64, 96} for vision, language understanding, and language generation tasks respectively. We search for the optimal combination of hyper-parameters in a brute-force manner and apply the found optimal hyper-parameters in our experiments. We directly use the default value of transformer’s official example to set other opti- mization hyper-parameters including optimizer, weight decay, learning rate scheduler, sequence length. The optimal validation metric is used as the stopping criteria of each FMF workload.
Our scheduler supports data parallelism and pipeline parallelism for PETL and full fine-tuning respectively. Both need to initialize the weights of the task-specific head layers with the Gaussian noise. For PETL, We leverage a unified tech- nique [4] to accomplish the PETL architecture. It integrates Adapters [5], Lora [6], Prompt [7] and their variants into a single architecture and also achieves comparable performance. We follow the official implementation and do not modify the hyper-parameters of architectures for any specific tasks. For pipeline parallelism, we adopt fairscale [8], and use LUT to determine the model partition and the number of pipelines in an offline manner for different resource allocations. More advanced implementations including FTPipe [9] can be used to further improve the system throughput of pipeline-parallel jobs, but this is not the focus of YMIR.
We implement three transferability estimation methods in- cluding Task2Vec [10], Task2Feat [11], and LEEP [12]. We implement Task2Vec based on released code. In detail, we fix the number of samples to compute the gradient as 500, and also user-provided hyper-parameters including learning rate, and batch size to compute the gradient vector. For Task2Feat and LEEP, we adopt released implementation. They need to compute the feature vector over the entire dataset without any handcrafted hyper-parameters.
[1] Aurick Qiao, Sang Keun Choe, Suhas Jayaram Subramanya, Willie Neiswanger, Qirong Ho, Hao Zhang, Gregory R. Ganger, and Eric P. Xing. Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning. In 15th USENIX Symposium on Operating Systems Design and Implementation, OSDI ’21, 2021.
[2] AndreasSteiner,AlexanderKolesnikov,XiaohuaZhai,RossWight- man, Jakob Uszkoreit, and Lucas Beyer. How to train your vit? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270, 2021.
[3] ConglongLi,MinjiaZhang,andYuxiongHe.Thestability-efficiency dilemma: Investigating sequence length warmup for training gpt models. In Advances in Neural Information Processing Systems, 2022.
[4] Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. Towards a unified view of parameter-efficient transfer learning. In International Conference on Learning Represen- tations, 2022.
[5] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Mor- rone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019.
[6] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
[7] Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010.
[8] MandeepBaines,ShrutiBhosale,VittorioCaggiano,NamanGoyal, Siddharth Goyal, Myle Ott, Benjamin Lefaudeux, Vitaliy Liptchinsky, Mike Rabbat, Sam Sheiffer, Anjali Sridhar, and Min Xu. Fairscale: A general purpose modular pytorch library for high performance and large scale training, 2021.
[9] Saar Eliad, Ido Hakimi, Alon De Jagger, Mark Silberstein, and Assaf Schuster. Fine-tuning giant neural networks on commodity hardware with automatic pipeline model parallelism. In 2021 USENIX Annual Technical Conference, USENIX ATC ’21, 2021.
[10] Alessandro Achille, Michael Lam, Rahul Tewari, Avinash Ravichan- dran, Subhransu Maji, Charless C Fowlkes, Stefano Soatto, and Pietro Perona. Task2vec: Task embedding for meta-learning. In Proceed- ings of the IEEE/CVF international conference on computer vision, pages 6430–6439, 2019.
[11] Tu Vu, Tong Wang, Tsendsuren Munkhdalai, Alessandro Sordoni, Adam Trischler, Andrew Mattarella-Micke, Subhransu Maji, and Mohit Iyyer. Exploring and predicting transferability across nlp tasks. arXiv preprint arXiv:2005.00770, 2020.
[12] Cuong Nguyen, Tal Hassner, Matthias Seeger, and Cedric Archam- beau. Leep: A new measure to evaluate transferability of learned representations. In International Conference on Machine Learning, pages 7294–7305. PMLR, 2020.