Performance of GRFT across Vanilla and Repaired Models
Performance of GRFT across Vanilla and Repaired Models
This table presents more detailed experimental results, specifically the number of discriminatory instances found by each fairness testing method across different vanilla and repaired models and attribute combinations. Here, M_van indicates the vanilla model, M_dis indicates the model repaired by retraining with discriminatory instances, M_flip indicates the model repaired by flip-based retraining, M_mt indicates the model repaired by multitask learning, M_Faire indicates the model repaired by the method Faire, and Ours indicates the model repaired by our repair method. "-" indicates that the model could not be tested or no discriminatory instances were found. To reduce the impact of randomness, we repeat each experiment ten times and record the average value.
As shown in the table, GRFT consistently finds more discriminatory instances than existing fairness testing methods for most combinations of protected attributes and models. Notably, LIMI relies on additionally trained GAN models; however, due to the instability of these trained GAN models, it often fails to find surrogate boundaries, rendering some models untestable. Compared to ADF and EIDIG, DICE discovers more discriminatory instances in the Census and Bank datasets because its global phase can provide multiple global seeds for the local phase. However, this also leads to DICE taking more time to complete the search than ADF and EIDIG. On the other hand, DICE finds fewer discriminatory instances in the Compas and LSAC datasets, possibly because the quality of its global seeds is not good enough (high QID values but not discriminatory instances). These results underscore the robustness of GRFT in handling diverse fairness-enhanced models. Its ability to uncover a large number of discriminatory instances, even in models that have undergone significant repair, highlights its potential as a powerful tool for fairness testing in machine learning.
Additionally, models repaired with our method significantly mitigate discrimination issues. Although existing repair methods repair models, a considerable number of discriminatory instances can still be detected in M_dis, M_mt, and M_faire. Particularly, more discriminatory instances can be detected in M_mt than in the vanilla models. For example, ADF detects about 315,654 discriminatory instances in M_mt, which is 2.24 times the number found in the vanilla models (i.e., 140,958). Notably, for M_mt, no discriminatory instances are detected on the Credit and COMPAS datasets due to the lower accuracy resulting in the same output for all inputs. As a comparison, our repair method is highly effective in reducing the number of discriminatory instances across all datasets. Particularly in the Census, Bank, COMPAS, and LSAC datasets, ADF, EIDIG, and NeuronFair detected fewer than 20 discriminatory instances. ExpGA was unable to find discriminatory inputs in the Bank and Credit datasets. While GRFT identifies a relatively higher number of discriminatory instances in our repaired models due to its rigorous nature, the overall reduction in bias achieved by our repair method surpasses that of existing repair methods. For example, GRFT on average discovers 2,850.6 discriminatory instances in the models repaired by our method, which is a 60.50% reduction compared to the flipping-based retrained models. These findings clearly demonstrate that our repair method is effective in mitigating bias and improving the fairness of deep learning models, making it a valuable tool in the development of equitable AI systems.