The Boston housing prices dataset has an ethical problem: as investigated in [1], the authors of this dataset engineered a non-invertible variable "B" assuming that racial self-segregation had a positive impact on house prices [2]. Furthermore the goal of the research that led to the creation of this dataset was to study the impact of air quality but it did not give adequate demonstration of the validity of this assumption.
From a product innovation perspective, it is crucial to carefully select input signals and data sources. In large organizations, we typically begin with internal data or legally compliant crowdsourced channels, such as customer feedback and complaints. However, even well-designed initiatives can encounter legal and compliance issues due to implicit or systemic biases. Below, I document issues found in the well-known Boston housing prices dataset to illustrate the challenges of choosing healthy and reliable data sources—not to question the dataset’s authenticity or credibility.
Python code Coming Soon...