Registration for Cambridge Corporate Finance Theory Symposium is open!
When Sheer Predictive Power is not Good Enough: Towards Accountability in Machine Learning Applications
by Thies Lindenthal, CERF Fellow
The law is clear: Housing-related decisions must be free of discrimination, at least in terms of gender, age, race, ethnicity, or disabilities. Easier said than done for the plethora of machine learning (ML) empowered systems for mortgage evaluation, tenant screening, i-buying schemes or other ‘disruptions’. A rapidly expanding literature explores the potential of ML algorithms, introducing novel measurements of the physical environments or using these estimates to
improve the traditional real estate valuation and urban planning processes (Glaeser et al., 2018; Johnson et al., 2020; Karimi et al., 2019; Lindenthal & Johnson, 2019; Liu et al., 2017; Rossetti et al., 2019; Schmidt & Lindenthal, 2020). These studies, again and again, demonstrated the undisputed power of ML systems as prediction machines. Still, it remains difficult to establish
causality or for end-users to understand the internal mechanism of the models.
An “accountability gap” (Adadi & Berrada, 2018) remains: How do the models arrive at their prediction results? Can we trust them not to bend rules or cut corners? This accountability gap holds back the deployment of ML-enabled systems in real-life situations (Ibrahim et al., 2020; Krause, 2019). If system engineers cannot observe the inner workings of the models, how can they guarantee reliable outcomes? Further, the accountability gap also
leads to obvious dangers: Flaws in prediction machines are not easily discernible by classic crossvalidation approaches (Ribeiro et al., 2016). Traditional ML model validation metrics such as the magnitude of predictions errors or F 1 -scores can evaluate the models’ predictive performance, but they provide limited insights for addressing the accountability gap. Training ML models is a software development process at heart. We believe that ML developers therefore should follow best practices and industry-standards in software testing. Particularly, the system testing stage of software test regimes is essential: It verifies whether an integrated system performs the exact function as required in the initial
design (Ammann & Offutt, 2016). For ML applications, this system testing stage can help to close the accountability gap and to improve the trustworthiness of the resulting models. After all, thorough system testing has verified that system is not veering off into dangerous terrain but stays on the pre-defined path.
System testing should be conducted before evaluating the model’s prediction accuracy, which can be considered as the acceptance testing stage in the software testing framework. In recent years, several up-to-date model interpretation algorithms have been developed, which attempt
to reduce the complexity by providing an individual explanation that solely justifies the prediction result for one specific instance (Lei et al., 2018; Lundberg & Lee, 2017; Selvaraju et al., 2017; Ribeiro et al., 2016). However, most of the current local interpretation tools are qualitative and require human inspection for each individual sample. Thus, these tools for model verification do not easily scale up with large sample size.
One example – to demonstrate the general approach
In this paper, we develop an explicit system-testing stage for an ML-powered classifier for images of residential real estate. In formalizing a novel model verification test, we first define categories
of relevant and irrelevant information in the training images that we are interested in testing.
Then we identify the elements of the input images that are found to be most relevant for classification by the ML model (i.e., which pixels matter most?), using a local model interpretation algorithm. Finally, we calculate what proportion the interpretable information originates from our defined categories of relevant/irrelevant information, and we use this proportion as the model verification test score. High scores imply that the model bases its
predictions on meaningful attributes and not on irrelevant information, e.g. in the background of the images.
Specifically, we augment an off-the-shelf image classifier that has been re-trained to detect architectural styles of residential buildings in the UK (see my previous blog post for CERF) .
This type of computer-vision based classifier is selected as an illustration due to its popularity in real estate and urban studies (Naik et al., 2016), although our approach also extends to other ML classifiers, e.g. in text-mining (Fan et al., 2019; Shen, 2018).
Following architects’ advice, we define facades of houses, windows and doors as the most relevant attributes for classifying building styles, and we consider trees and cars as the irrelevant information. These objects are detected in the input images using the object detection algorithms. Further, we implement the local interpretable model-agnostic explanation algorithm (LIME) – one of the popular local model interpretation tools – to find the areas in the input images that best explains the predictions.
Finally, by comparing these interpretable areas and the areas of the objects, we calculate the verification test score/ratio for this exemplar model. Our results reveal that the classifier indeed selects information from house, windows, and doors for predicting building vintages, and it also excludes the irrelevant information from the trees as we hope. More importantly, these findings
improve the trustworthiness of the prediction results, as well as the associated implications between building vintages and real estate values (Johnson et al., 2020; Lindenthal & Johnson, 2019; Schmidt & Lindenthal, 2020). However, we find that the model also considers information from the cars for its predictions.
Our study contributes to the growing literature that applies ML in real estate and urban studies from two aspects. Firstly, we propose a ML application framework with an additional system testing stage, which aims to address the accountability gap and improve the trustworthiness of the results. Using a commonly applied computer vision model in the literature as an example, we demonstrate the capability of our approach to check whether the model is under the threat
of capturing undesirable information for predictions.
Secondly, we extend the existing qualitative model-interpretation techniques to a formal quantitative test. Methodology-wise, this helps to scale up the model interpretation analyses for a large sample size, which is essential for most of the applications in real estate and urban studies. In summary, our proposed method extends for other ML models and, due to the essence of closing the accountability gap, this study has important implications for ML applications in real estate and urban studies, as well as in other subjects beyond.
Fig 1: First, find areas that are relevant when e.g. describing a home’s vintage.
Fig 2: Second, compare to the image areas that actually lead to a specific classification: How good is the overlap?