EDA, Modeling, and Predicting Housing Prices of Cook County

Introduction

In this project, I analyze an open-source housing dataset of Cook County to understand how property features relate to housing prices and how predictive modeling can be used to estimate fair market values. Data collection was procured by the Cook County Assessor’s Office (CCAO) to improve the accuracy of unsold property valuation and to target the disproportionately low-income neighborhoods affected by data gaps. The dataset's observations included property sales, physical attributes, property characteristics, and various transaction data for a granular analysis. Following the data science lifestyle, this project starts with data cleaning and EDA to feature engineering and predictive modeling. Early visualizations involved in EDA highlighting reveal relationships between building size and sales price, where a regression line suggests a linear trend, indicating that the transformed building square footage is a strong predictor.

In retrospect, Cook County's system produced inequitable outcomes by overvaluing inexpensive homes and undervaluing expensive ones, creating a regressive tax structure that burdened lower income homeowener and disproportionately penalized high-income homeowners. Considering the broader human context of property valuation, housing prices affect multiple stakeholders with different incentives, including investors pursuing profit, homebuyers seeking affordability, and governments relying on stable property values for tax revenue.

Motivated by the inquility, I evaluate model performance using RMSE, residual analysis, and a fairness-aware metric, Mean Absolute Percentage Error(MAPE). RMSE is a useful tool to capture overall prediction error; it keeps abstruse data on how mistakes are distributed across housing segments, so using a percentage provides a more interpretable view of model behavior.

Overall, this project demonstrates data wrangling, exploratory data analysis, statistical modeling, and a critical evaluation of algorithmic fairness. It reflects an analytical approach grounded in curiosity in how one treats models as systems whose assumptions and consequences must be examined in how they impact real world decisions.

Visual exploration of skewed economic data

def plot_distribution(data, label):

fig, axs = plt.subplots(nrows=2)

sns.distplot(

data[label],

ax=axs[0]

)

sns.boxplot(

x=data[label],

width=0.3,

ax=axs[1],

showfliers=False,

)

# Align axes

spacer = np.max(data[label]) * 0.05

xmin = np.min(data[label]) - spacer

xmax = np.max(data[label]) + spacer

axs[0].set_xlim((xmin, xmax))

axs[1].set_xlim((xmin, xmax))

Figure 1. To inspect skewness and outliers, the following function plots both a density curve and a boxplot for a given variable. The granularity of this dataset is at the individual property sale level, with each row corresponding to a single property in Cook County. Since the regression line fits the data pretty well and shows a general trend, the feature is most likely a good indicator to help predict our intended variable.

Neighborhood-Level Comparison

Figure 2. To compare housing prices across neighborhoods, the project visualizes both boxplots and counts.

Q1 = training_data["Bathrooms"].quantile(0.25)

Q3 = training_data["Bathrooms"].quantile(0.75)

IQR = Q3 - Q1

lower = Q1 - (1.5 * IQR)

upper = Q3 + (1.5 * IQR)

q5 = remove_outliers(training_data, "Bathrooms", lower, upper)

def plot_categorical(neighborhoods):

fig, axs = plt.subplots(nrows=2)

sns.boxplot(

x='Neighborhood Code',

y='Log Sale Price',

data=neighborhoods,

ax=axs[0],showfliers=False

)

sns.countplot(

x='Neighborhood Code',

data=neighborhoods,

ax=axs[1],

)

Feature Engineering for Modeling

from sklearn.preprocessing import OneHotEncoder

def ohe_wall_material(data):

"""

One-hot-encodes wall material. New columns are of the form "Wall Material_MATERIAL".

"""

cat = ["Wall Material"]

oh_enc = OneHotEncoder()

oh_enc.fit(data[cat])

cat = oh_enc.transform(data[cat]).toarray()

catdf = pd.DataFrame(data=cat, columns = oh_enc.get_feature_names_out(), i ndex = data.index)

return data.join(catdf)

training_data_ohe = ohe_wall_material(training_data_mapped)

# This line of code will display only the one-hot-encoded columns in training_data_ohe that

# have names that begin with “Wall Material_"

training_data_ohe.filter(regex='^Wall Material_').head(10)

Figure 3. Categorical variables prepared for regression using one-hot encoding.This step converts building features into numeric values suitable for linear and regularized models.

Additional Model

Figure 4.

Log Building Square Feet appears to be a strong candidate feature for our model because, after the transformation, the scatter plot shows a clear and meaningful relationship with Log Sale Price. The fitted regression line closely follows the overall pattern of the data, indicating a consistent upward trend. This suggests that as building size increases, sale price tends to increase in a roughly linear manner, making this feature a strong predictor of the target variable.

Part II

Predicting Housing Prices in Cook County

Introduction:

In Part 2 of this project, I examine how predictive models can be used to estimate housing values in Cook County and how these predictions affect real people through property tax assessments. Different stakeholders have competing interests in housing prices: potential homebuyers generally prefer lower prices, real estate investors benefit from higher prices, and the government seeks stable and moderately high values for tax revenue and economic stability. Because assessment errors translate directly into financial burden, prediction accuracy alone is not sufficient; fairness across different price ranges and communities must also be considered.

Historically, Cook County’s assessment system overvalued inexpensive homes and undervalued expensive ones, creating a regressive tax structure that disproportionately harmed low-income and minority homeowners and made it harder for them to appeal unfair assessments. Motivated by this context, this project evaluates model performance not only using RMSE, but also through residual analysis and a fairness-aware metric (MAPE) to understand whether prediction errors are systematically biased across housing price segments.

Residuals vs Log Sale Price to Measure Model Performance

m2_residuals = Y_valid_m2 - Y_predicted_m2

plt.scatter(Y_valid_m2, m2_residuals, s= 1, alpha = 0.2)

plt.suptitle("Model 2 Performance")

plt.title("Plot of Residuals of Model2 vs Log Sale Price");

This scatter plot with a fitted regression line illustrates a strong positive linear relationship between building size and sale price after log transformation.

Figure 5.

Percentage of Homes Overestimated by Price Range (Tax Fairness)

# Overestimation plot

plt.subplot(1, 2, 2)

props = []

for i in np.arange(8, 14, 0.5):

props.append(prop_overest_interval(preds_df, i, i + 0.5) * 100)

plt.bar(x = np.arange(8.25, 14.25, 0.5), height = props, edgecolor = 'black', width = 0.5)

plt.title('Percentage of House Values Overestimated \n for different intervals of Log Sale Price', fontsize = 10)

plt.xlabel('Log Sale Price')

plt.yticks(fontsize = 10)

plt.xticks(fontsize = 10)

plt.ylabel('Percentage of House Values\n that were Overestimated (%)')

plt.tight_layout()

plt.show()

This visualization bridges modeling into social impact. Lower-priced homes show the highest rates of overestimation, aligning with a regressive pattern in which inexpensive properties are systematically overvalued, matching scenario C from the fairness analysis.

Figure 6.

MAPE Across Price Intervals (Fairness-Aware Evaluation)

Figure 7. This reveals whether low priced properties are misestimated relative to their value, which is very important for equitable taxation.

new_preds_df = pd.DataFrame({

'True Log Sale Price' : trainY,

'Predicted Log Sale Price': trainX_with_bias @ theta_opt,

'True Sale Price' : np.e ** trainY,

'Predicted Sale Price' : np.e ** (trainX_with_bias @ theta_opt)

})

plt.figure(figsize=(8, 5))

plt.subplot(1, 2, 1)

mape_values = []

for i in np.arange(8, 14, 0.5):

mape_values.append(mape_interval(new_preds_df, i, i + 0.5))

plt.bar(x=np.arange(8.25, 14.25, 0.5), height=mape_values, edgecolor='black', width=0.5)

plt.title('MAPE of Sale Price Across\n Log Sale Price Intervals', fontsize=10)

plt.xlabel('Log Sale Price')

plt.ylabel('Mean Absolute Percentage Error (MAPE)')

plt.xticks(fontsize=10)

plt.yticks(fontsize=10);

Unlike RMSE, MAPE evaluates errors proportionally. The true price of homes standardized by MAPE prevents expensive homes from skewing the metric and allows fairness amongst housing segments to be better understood.

Data-Story Telling

This project demonstrates that evaluating housing price prediction models requires more than measuring overall accuracy. While traditional metrics like RMSE provide insight into average error, they can obscure systematic patterns that disproportionately affect certain segments of the housing market. By examining residual plots and the percentage of overestimated homes across price intervals, the analysis reveals that lower-priced properties are more frequently overvalued, aligning with a regressive assessment pattern in which inexpensive homes bear a higher relative tax burden .

Introducing a fairness-aware metric such as Mean Absolute Percentage Error (MAPE) allows model performance to be evaluated proportionally across housing price ranges, reducing the dominance of high-priced properties in error calculations and offering a more equitable lens for assessment . Taken together, these results highlight that fairness and accuracy are closely linked but not interchangeable: a model can perform well on average while still producing socially harmful outcomes. For real-world property valuation systems, especially those that influence taxation, responsible data science must prioritize balanced and interpretable error behavior alongside predictive performance.

Page updated

Google Sites

Report abuse