Abstract: In credit markets, screening algorithms aim to discriminate between good-type and bad-type borrowers. However, when doing so, they can also discriminate between individuals sharing a protected attribute (e.g. gender, age, racial origin) and the rest of the population. This can be unintentional and originate from the training dataset or from the model itself. We show how to formally test the algorithmic fairness of scoring models and how to identify the variables responsible for any lack of fairness. We then use these variables to optimize the fairness-performance trade-off. Our framework provides guidance on how algorithmic fairness can be monitored by lenders, controlled by their regulators, improved for the benefit of protected groups, while still maintaining a high level of forecasting accuracy.
Hué S., Hurlin C., Pérignon C. and Saurin S. (2026), Measuring the Driving Forces of Predictive Performance: Application to Credit Scoring. Management Science, forthcoming.
Abstract: As they play an increasingly important role in determining access to credit, credit scoring models are under growing scrutiny from banking supervisors and internal model validators. These authorities need to monitor the model performance and identify its key drivers. To facilitate this, we introduce the XPER methodology to decompose a performance metric (e.g., AUC, R2) into specific contributions associated with the various features of a forecasting model. XPER is theoretically grounded on Shapley values and is both model-agnostic and performance metric-agnostic. Furthermore, it can be implemented either at the model level or at the individual level. Using a novel dataset of car loans, we decompose the AUC of a machine-learning model trained to forecast the default probability of loan applicants. We show that a small number of features can explain a surprisingly large part of the model performance. Notably, the features that contribute the most to the predictive performance of the model may not be the ones that contribute the most to individual forecasts (SHAP). Finally, we show how XPER can be used to deal with heterogeneity issues and improve performance.
Python package: https://github.com/hi-paris/XPER
Media Coverage: Blog article on Towards Data Science (Click here)
Saurin S. (2025), Homogeneity Test for Credit Scoring Models: A Conformal-Prediction Approach.
Abstract: Since the signature of the Basel II Accords in 2004, most international banks have been implementing the Internal Ratings-Based (IRB) approach to determine their capital requirements for credit risk. This approach involves estimating the default risk of each loan lying in the balance-sheet of the bank through a credit scoring model and then, allocate these loans into \textit{homogenous} risk grades (or risk classes) gathering credits with similar default risk. Yet, effective solutions for testing this homogeneity remain lacking. In response, we introduce the Risk Homogeneity Coefficient (RHC), a novel measure that quantifies the degree of homogeneity within risk grades. RHC is derived from confidence intervals for the difference between each credit’s estimated probability of default and the risk grade’s default probability. The key insight is that a wider confidence interval suggests lower homogeneity within the risk grade. To build these confidence intervals, we adapt to our context the conformal prediction approach—a modern framework widely used in machine learning to construct confidence intervals for predictions without relying on distributional assumptions or asymptotic convergence results. Through numerical illustrations, we demonstrate that the RHC effectively measures homogeneity within risk grades. Applying our methodology to data simulated under the IRB framework, we observe significant variation in homogeneity across risk grades, with the overall level of homogeneity remaining moderate, even with a seemingly optimal credit segmentation. This finding raises important questions about the feasibility of achieving perfectly homogeneous risk grades in practice.
Abstract: We study the problem of deciding whether, and when an organization should replace a trained incumbent model with a challenger relying on newly available features. We develop a unified economic and statistical framework that links learning-curve dynamics, data-acquisition and retraining costs, and discounting of future gains. First, we characterize the optimal switching time in stylized settings and derive closed-form expressions that quantify how horizon length, learning-curve curvature, and cost differentials shape the optimal decision. Second, we propose three practical algorithms—a one-shot baseline, a greedy sequential method, and a look-ahead sequential method. Using a real-world credit-scoring dataset with gradually arriving alternative data, we show that (i) optimal switching times vary systematically with cost parameters and learning-curve behavior, and (ii) the look-ahead sequential method outperforms other methods and is able to approach in value an oracle with full foresight. Finally, we establish finite-sample guarantees, including conditions under which the sequential look-ahead method achieve sublinear regret relative to that oracle. Our results provide an operational blueprint for economically sound model transitions as new data sources become available.
Guerrier S., Hurlin C., Karemera M., Pérignon C. and Saurin S. (2025), Equivalence Testing for Algorithmic Fairness: A Case Study in Credit Scoring
Abstract: We introduce the concept of fairness equivalence by building on both the regulatory paradigm in drug development and on the fairness literature in machine learning. Equivalence tests allow banks and fintech to prove the fairness of the predictions of their credit scoring models, similarly as drug manufacturers proving the effectiveness of their drug products. In the equivalence approach, an algorithm is said to be fair if all individuals have an equivalent probability of positive outcome, regardless of their group membership. It means that the difference between group probabilities remains below a given tolerance level. The equivalence approach brings four main advantages. First, it allows to formally test any fairness definition by relying on an inference test while fully controlling for the risk of wrongly validating an unfair model (similar to the FDA tests on new drugs). Second, by introducing a tolerance level, the equivalence approach allows to control for any residual heterogeneity among individuals. Third, it can accommodate any high-stakes algorithms by adjusting the tolerance level according to the societal importance of the AI application. Fourth, it allows to identify the features at the origin of the fairness problem and to mitigate their effects.