Model evaluation and selection are fundamental tasks in machine learning and data science. From a theoretical perspective, we are interested in the bounds of the generalization error of predictive models. From a practical perspective, we need to compare a set of candidate models and select the most suitable one for the concrete task at hand. For practical applications, specifically in a biomedical context, model interpretability is also crucially important.
For many years, model evaluation and selection in machine learning have emphasized the role of significance tests. As a consequence, the p-value has become firmly embedded in the current evaluation practice. Our research revealed several critical problems of this prevailing practice; for example, the now widely used Friedman test is not suitable for the comparison of multiple classifiers over diverse benchmark data sets [1,2,3]. In our research, we work on alternative methods, such as Bayesian methods and graphical tools for performance assessment.
References
[1] Berrar (2017) Confidence curves: an alternative to null hypothesis significance testing for the comparison of classifiers. Machine Learning 106(6):911–949.
[2] Berrar D. and Dubitzky, W. (2019) Should significance testing be abandoned in machine learning? Intl. Journal of Data Science and Analytics 7(4):247–257.
[3] Berrar D. (2021) Significance testing for the comparison of classifiers over multiple data sets: pitfalls and alternatives. Under review.