Multiple machine learning (ML) models are often incorporated into real-world ML systems. However, updating an individual model in these ML systems frequently results in regression errors, where the new model performs worse than the old model for some inputs. While model-level regression errors have been widely studied, little is known about how regression errors propagate at system level. To address this gap, we propose RegTrieve, a novel retrieval-enhanced ensemble approach to reduce regression errors at both model and system level. Extensive experiments across various model update scenarios show that RegTrieve significantly reduces system-level regression errors with almost no impact on system accuracy, outperforming all baselines by 20.43% on average.
The intuition of our retrieval-enhanced ensemble approach is to utilize similar training samples to surrogate the accuracy of the old model and new model on a testing sample, which could be used to decide the proper ensemble weight for the specific testing sample. As the figure shows, given a testing audio file 𝑎_𝑡 and a partially predicted question 𝑞ˆ, we need to predict the next token. First, based on the vector representation 𝑟_𝑡 of the current context, 𝐾 nearest neighbors (KNN) are retrieved from the datastore. Then, the ensemble weight 𝜆_𝑡 is calculated by performing a weighted average using the neighbors’ old loss, new loss, and distance. Finally, the ensemble logits 𝑧_𝑒 are obtained by combining the old and new models’ logits with 𝜆_𝑡. This approach is training-free and only need model forward passes to create the indexed datastore.
• RQ1 System-Level Regression Error Investigation: Do system-level regression errors exist and can them be indicated by model-level regression errors?
• RQ2 Effectiveness Evaluation: Does RegTrieve reduce more regression errors compared with the baseline approaches without the harm of accuracy?
• RQ3 Ablation Study: How does the CDF-based loss ensemble function contributes to RegTrieve?
• RQ4 Hyper-Parameter Analysis: What is the impact of hyper-parameters on RegTrieve?
• RQ5 Efficiency Evaluation: What is the efficiency of RegTrieve compared with baselines?
• RQ6 Generalization Evaluation: Does RegTrieve generalize to various ML systems?
• RQ1 Summary: Only a small portion of system-level regression errors are caused by model-level regression errors, and vice versa. Therefore, for ML systems involving multiple ML models, the reduction of system-level regression errors require separate investigation.
• RQ2 Summary: RegTrieve consistently outperforms baseline approaches in reducing regression errors while maintaining system accuracy. Numerically, RegTrieve reduces more regression errors than all baselines by 20.43% in average. RegTrieve achieves Pareto Improvement in all but one scenario, with System+ and System demonstrating the best performance in different scenarios.
• RQ3 Summary: RegTrieve without the CDF-based ensemble function (M. Loss and S. Loss) still achieves Pareto Improvement in more scenarios than all baseline approaches. However, the CDF-based ensemble function is important for more reduction of system-level regression errors.
• RQ4 Summary: Increasing 𝛾_𝑚 and 𝛾_𝑠 consistently reduces system-level regression errors, while increasing first and then reducing F1 score. Besides, increasing 𝜌 in System+ reduces both F1 score and system-level regression errors due to the growing influence of System.
• RQ5 Summary: The datastores of RegTrieve range from 2GB to 8GB, which could be fully loaded into memory for later inference. RegTrieve only needs about 20% of the time required by uncertainty-based approaches for inference, introducing twice the overhead of Max/Avg approaches.
• RQ6 Summary: Compared with baselines, RegTrieve requires similar efforts for adaption, and shows promising generalization capability by reducing 14.06% more system-level regression errors in the multi-sensor fusion perception system on average.
Notice that the first five RQs are conducted on the spoken QA system, while RQ6 is conducted on the multi-sensor fusion perception system.
Update scenarios for the spoken QA system:
Update scenarios for the multi-sensor fusion perception QA system:
Only the 2D object detection model was updated, while other components remained unchanged. Two update scenarios were considered: (1) model architecture update, i.e., replacing R-CNN with YOLOv5 (Scenario 8); and (2) training step update, i.e., replacing R-CNN trained for 1 epoch with R-CNN trained for 12 epochs (Scenario 9).
Scenario 1
Scenario 2
Scenario 3
Scenario 4
Scenario 5
Scenario 6
Scenario 7
Generally, as 𝛾_m increases, system-level regression error rate continues to decrease. In Scenario 3, the system-level regression error rate first increases then decreases, which may be caused by the large F1 score gap between the old system and new system. In Scenario 6, the system-level regression error rate remains flat, because it is already relatively low(1%) thus it is hard to reduce it further.
Scenario 1
Scenario 2
Scenario 3
Scenario 4
Scenario 5
Scenario 6
Scenario 7
Generally, as 𝛾_s increases, system-level regression error rate continues to decrease. In Scenario 3, the system-level regression error rate first increases then decreases, which may be caused by the large F1 score gap between the old system and new system. In Scenario 6, the system-level regression error rate remains flat, because it is already relatively low(1%) thus it is hard to reduce it further.
Scenario 1
Scenario 2
Scenario 3
Scenario 4
Scenario 5
Scenario 6
Scenario 7
Generally, as 𝜌 increases, system-level regression error rate and F1 score continue to decrease. In Scenario 3, the system-level regression error rate increases slightly, which may be caused by the large F1 score gap between the old system and new system. In Scenario 6, the system-level regression error rate and F1 score remain flat because the Model setting and System setting in this case manifest the same regression error rates and F1 scores.
Scenario 1. The impact of K on the performance under Model setting of RegTrieve
Scenario 1: The impact of K on the performance under System setting of RegTrieve
Scenario 1: The impact of K on the performance under System+ setting of RegTrieve
Scenario 2. The impact of K on the performance under Model setting of RegTrieve
Scenario 2. The impact of K on the performance under System setting of RegTrieve
Scenario 2. The impact of K on the performance under System+ setting of RegTrieve
Scenario 5. The impact of K on the performance under Model setting of RegTrieve
Scenario 5. The impact of K on the performance under System setting of RegTrieve
Scenario 5. The impact of K on the performance under System+ setting of RegTrieve
Scenario 6. The impact of K on the performance under Model setting of RegTrieve
Scenario 6. The impact of K on the performance under System setting of RegTrieve
Scenario 6. The impact of K on the performance under System+ setting of RegTrieve
Scenario 7. The impact of K on the performance under Model setting of RegTrieve
Scenario 7. The impact of K on the performance under System setting of RegTrieve
Scenario 7. The impact of K on the performance under System+ setting of RegTrieve
Under the three settings of RegTrieve, the F1 score initially increases with 𝐾 and then stabilizes. Similarly, the system regression error rate decreases with increasing 𝐾 and then stabilizes. This trend is expected, as a larger K means more neighbors, leading to more accurate estimates of model accuracy and ensemble weights. However, beyond 15 or 20 neighbors, the newly added neighbors are increasingly distant from the current sample and contribute less to the final result. Therefore, setting 𝐾 to 20 in previous experiments is a reasonable choice.