Ashna Gibbons | MAT 328 Final Project | May 15, 2026
An exploratory analysis of property listings in Oregon from January-March 2026.
Source data and code listed at the bottom of this page.
This dataset shows information about the real estate market in Oregon in 2026. I have had to search for an apartment here or a room to rent, so I have become interested in how houses and apartments are priced in other places compared to NYC. This drew me to the dataset. It was uploaded by user Kanchana1990 on Kaggle. This data was scraped from the web by the user using the Apify API to collect these real-world listings from Oregon in Q1 of 2026.
One row in the dataset corresponds to each property listing. Properties are split into types including single_family, land, condos, townhomes, multi_family, farm, and unknown. Sub_types of townhome and condo are additionally included in a separate column. In the text column, any description from the listing is recorded as a string, preserving the notes of the sellers posting the listings. Additionally, as floats, the listing price (listPrice), square feet (sqft), number of floors (stories), number of bedrooms (beds), number of bathrooms (baths), a number of full bathrooms (baths_full), and additional calculated number of full bathrooms (baths_full_calc), number of garages (garage), and the year in which the property was built (year_built) are recorded. Therefore, this dataset contains substantial quantitative data about the listings in Oregon, and splits the listings into seven distinct, nominal categories.
The dataset is rectangular, and stored in a CSV file for ease of use. Many of the quantitative categories are distinct, and though price and square footage could be listed as continuous, most of the entries are whole numbers as well. As mentioned before, the category of housing type is nominal.
In terms of granularity, each row represents one listing of one property in Oregon. Addresses are not included in the dataset, and from the minimal general notes from the poster of the dataset, we can assume that listings are from all around the state (he mentions listings “ranging from historic Portland multi-family units to expansive, multi-acre rural land parcels”). Therefore, the scope is assumed to be around the state of Oregon.
The temporality of the data is limited to Q1 of 2026, therefore ranging from January 1st to the date the dataset was last updated (March 10th). All listings are active as of the last update, but were not necessarily first posted within Q1 of 2026. The year in which properties were built is also recorded.
The main issue when discussing faithfulness with this dataset is missing values. The properties of the category land are usually missing, the sub_type category is rarely even used, and many other properties can still have NaN values. Out of the over 10,000 rows in the original set, most quantitative columns have over 2,000 values missing, and the garage column has over 3,000 values missing. The data collector was aware of this and warned users of the abundance of missing values, and advised segmentation of the data using conditional logic. I will discuss my process of cleaning and segmenting the data below.
The first step in cleaning the data was to address the many NaN values. Using guidance from the description of the data in Kaggle, I was able to ascertain that most of the NaN values came from properties classified as land. Of 1,872 Land properties, 6 categories were entirely full of NaN values. I therefore removed the properties categorized as land from the dataset for this project. I did conduct some separate analysis on the land properties, which will be explained below.
After removing those properties, I looked again at the NaN values remaining in the dataset. In the sub_type column, the vast majority of values were again NaN. Additionally, only two values were assigned to that attribute, townhome and condo, which were values in the type column anyway. So I removed this column from the dataset. Finally, I then removed all other rows with NaN values to simplify the analysis. This resulted in a dataset with 6200 values and no NaN values.
Distribution of list price (non-land)
The distribution of list price of the non-land properties is right-skewed, meaning there are a few outlier properties that have very high list prices compared to the majority. Most properties have a list price less than $1 million, with very few properties listed as anything more than $5 million. This shows an expected distribution of most properties at an expected price point and a few very expensive properties included in the dataset. The mean list price was $763,895 and the median list price was $575,000.
Distribution of list price (land)
I also plotted the distribution of list prices for the properties categorized as land. This distribution was similarly right-skewed, with a few very abnormally expensive properties in the dataset, however the scale was different than the non-land properties. On the whole, land properties were cheaper than other types of properties, with the vast majority of land properties costing less than $300k. The mean price of a land property was $348,320 and the median price was $183,500. The difference between the mean and median shows a severe right skew.
Distribution of square footage
The distribution of square footage of the properties was slightly right-skewed as well, but more normalized than the distribution of prices. Most properties were 5000 sqft or less. The mean square footage was 2211 sqft and the median square footage was 1912.5 sqft. There were some very high outliers; the property with the largest square footage was 40,075 sqft. As this is the dataset without the land properties, this makes me wonder about the faithfulness of this entry, as it could have been a typo. It could, however, also indicate a property in a very rural area.
Distribution of year built
The majority of properties seemed to be relatively newly built, with a higher frequency of properties being built as the years progressed. This distribution does have a left skew, however, because there are some properties that are very old, going back to the mid-1800s. However, it shows that the majority of properties being sold in the Oregon market today were built after 1975.
Distribution of number of beds
The distribution of the number of bedrooms in the properties was actually quite normal. However, there were some high outliers, with the property with the most bedrooms reporting 24 bedrooms. Most properties, however, had 1 to 5 bedrooms, with the most common number of bedrooms being 3.
I decided to plot square footage (independent variable) versus list price (dependent variable) in order to see if my assumption would hold true that these two would have a positive relationship. The data did support this. I plotted a regular scatterplot and one plot with a regression line for clarity. As square footage increased, so, generally, did the list price. Most listings were concentrated below 20,000 sqft and under $1,250,000. I did decide to filter out this data to plot it in order to better see the relationship between square footage and price.
The two plots below of the filtered data showed again a positive relationship between square footage and price, but a seemingly steeper slope of the regression line. This indicates to me that price can actually be very variable (maybe depending on the area or neighborhood type of the property, for which there was no data), but that square footage likely does influence the listing price of a property.
What surprised me with the plots of all the data was that there were several outliers of properties with relatively little square footage but very high list prices. It could be assumed that these properties are in with high costs of living and expensive neighborhoods. Additionally, the outlier property with a square footage of over 40,000 was oddly listed at a relatively low price (about $2.5 million), which could support the hypothesis that it is in a rural area.
The second plot with multiple variables is a box plot of the number of beds versus list price. Surprisingly, there was less of a relationship between the number of bedrooms in the property and the list price. Most properties, if they had 10 bedrooms or less, had a median price below $1 million. The median prices began to differ once the number of bedrooms increased, but this was also due to the relatively fewer number of properties with such a high number of bedrooms. Due to this fact, I did filter the data to include properties with 10 or less bedrooms in order to see the graph more clearly.
It shows that there were several high outliers in price only for those properties with 6 bedrooms or less, and one outlier price of the properties with 8 bedrooms. This shows the absurdly high pricing of some properties regardless of the number of people they are meant for. Additionally, those properties with 9 bedrooms were noticeably the most variable in price, with the higher-priced half of those properties ranging up to over $10 million.
Milestone 2
For milestone 2, I made variations of two different kinds of models: decision tree models and linear regression models. Using the dataset without the land type properties, I constructed three different decision tree models of different depths and three different linear regression models with different sets of independent variables. Below, I will analyze the accuracy of these models and compare them.
Data Preparation
In order to prepare the data for the models, I converted the categories of properties into dummy variables and then split the independent variables and the listPrice variable into training and testing sets.
Decision Tree Models
Depth 5
I first trained a decision tree of max depth 5 on the training data. The resulting tree structure is available to see in the CoLab file. The tree at first used square footage of a property (sqft) to begin sorting the properties. Eventually, fields like stories, garage, beds, year_built, baths, and type_farm were used.
When using this tree to predict list price on the test data, I evaluated the mean squared error (MSE). For testing data, the MSE was about 4.9e+11. On the training data, the same tree produced an MSE of 1.96e+11. The difference between these two MSEs was 2.94e+11. This shows a significant amount of overfitting to the training data.
Depth 3
I then trained another decision tree model with a max depth of 3. I went through the same process of training it on the training data, and finding its predictions for the testing data and training data. The difference between the MSEs of its predictions was about 1.85e+11. This is less than the difference between MSEs in the previous model, so we can conclude that this tree overfitted to the training data less and is better able to generalize to data it hasn't seen before.
Depth 7
The final decision tree model I trained had a max depth of 7. After performing the same training process and finding the predictions on testing and training data, the difference between the MSEs of its predictions was 3.86e+11. This is higher than the differences of MSEs for either of the other two decision trees, showing more overfitting to the training data.
Recommendation
When using a decision tree model to predict the list price of a property, I would recommend using the model with max depth equal to 3. This model had the lowest difference in MSE between its predictions on the testing data and the training data. This shows that the model with depth 3 is best able to generalize findings to data it hasn't seen before.
Below is a chart comparing the differences in MSEs on testing and training data between the three models. As we can see in the plot, the decision tree of depth 3 had the lowest difference of all three models, and therefore the least overfitting.
Linear Regression Models
I then trained linar regression models on the same training data. I formed two separate dataframes of training and testing data in order to train the linear models and then test them. The linear models were only fitted to the training data for training.
All Variables
I first trained a linear regression model using a all the fields besides list price to predict list price. Above is the summary of the resulting model. The R-squared was not very good at only .444, but many variables were found to be statistically significant in the model (having P-values < 0.05). Only type_single_family, type_townhomes, garage, and year_built were found to be not statistically significant on the linear regression model.
As shown in the below charts, the residuals produced by this linear regression model largely centered around zero, with a slight right-ward skew. This is reflected in the histogram as well as in the plot of actual list prices versus residuals. As list prices increase, the residual values tend to increase as well, indicating a tendency of the model to underestimate actual list prices as list prices increase. This could be due to outlying high prices of certain properties listed in the dataset.
Finally, I used the equation produced by the model to predict the values of the list prices on the testing dataset. Similarly to the evaluations of the decision tree models, I also used the equation to predict the list prices on the training dataset, calculated the MSE for both predictions, and took the difference to assess overfitting.
The MSE of the model's predictions on the testing data was 4.67e+11. The MSE of its predictions on the training data was 3.72e+11. The difference between these two is 9.45e+10. While this does show a significant degree of overfitting, the linear regression model is less overfitted than any of our decision tree models thus far.
Statistically Significant Variables
The next linear regression model I used included only those variables which the previous model had found statistically significant. Therefore, I removed the fields type_single_family, type_townhomes, garage, and year_built .
As shown in the charts featured above, the model is largely similar to the previous linear regression model. All fields included were statistically significant, although their coefficients changed compared to the previous model. The same pattern of residuals can be found, with underestimation more likely as list prices actually increase. The R-squared value is also almost exactly the same at 0.443.
After running the equation produced by the model on both the testing and training data, the MSEs produced were 4.67e+11 and 3.72e+11 for the testing and training data respectively. The difference between these two was 9.5e+10, very slightly higher than the previous model.
Sqaure Footage and Bedrooms
The last linear regression model I ran was one with only the variables sqft and beds. I was interested to run this model because usually on property listings, these are two of the main criteria buyers are looking at and seem to influence the list price.
The R-squared for this model was slightly lower at 0.402, which was to be expected. All the variables included in the model were statistically significant.
Similar to the other linear regression models, this model had residuals roughly centered around 0, slightly less right-skewed than the previous two models. In the plot of actual list prices versus residuals, a very similar pattern to the other two models was shown, where underestimations were made by the model more frequently if the list price was higher.
When using the equation produced by this model to predict list prices, MSEs for testing and training data respectively were equal to 4.79e+11 and 4.05e+11. The difference between these two MSEs was found to be 7.32e+10. This was less than the previous two models, indicating that this model overfit to the training data the least.
Recommendations
Whne comparing the three linear regression models, it was clear that the model including only sqft and beds was the best model. When predicting list prices for the testing set, this model had the lowest MSE, as well as the lowest difference in MSE between the testing set and training set (7.32e+10). This shows that the third model overfit to the training data the least, and is the model most able to accurately generalize to data it has not seen yet.
Conclusion
Of all the decision tree models and linear regression models used, the model most suited to predicting list price was the linear regression model that used the variables sqft and beds. This model showed the least overfitting (difference in MSE between predicitons on testing and training data), making it the most generalizable to other data similar to the Oregon Q1 Property Listings dataset.
Milestone 3
Principle Components Analysis
After making dummy variables of the type column and standardizing the data, I then ran a PCA analysis with 2 components on the standardized data. When plotting the PCA data colored by the type of house, this is the graph that resulted:
I was surprised to find that the data was not separated by type of home, however there were two distinct clusters formed based on the PC2 value. For the most part, the houses were split into those with PC2 values below -2 and those above -2. I then ran multiple k Means clustering algorithms and calculated the silhouette scores of each to be sure of the correct number of clusters.
Again, I was surprised that the highest silhouette score came from 6 clusters rather than 2. I decided to proceed forward with 6 clusters instead of the initial 2 that I had thought would result from the tests. Below is the graph of PCA data colored by each of the 6 clusters.
Again, the houses are clustered based on PC2 score, but this is not how the KMeans algorithm clustered them. The clusters are quite mixed, especially in those houses with PC2 < -2. For those with PC2 > -2, it seemed they were grouped into clusters 0, 5, and 2 respectively. I decided I had to look at the distribution of each variable among the clusters to see the differences between them. I looked at 7 different quantitative variables: listPrice, sqft, stories, beds, baths, garage, and year_built. Below are the averages of each variable by cluster.
There are some rough patterns that emerge from the clusters determined by the KMeans algorithm. Clusters 0 and 1 are similar, but Cluster 0 has the highest number of garages in any cluster, high average square footage, and a high number of bathrooms. Cluster 1 is similar with a high number of bathrooms on average, but with less stories, indicating ranch homes. Cluster 2 consists of smaller and older houses, with the lowest average values for sqft and year_built. Cluster 3 houses are on average quite newly built and the most expensive, with more rooms in the houses compared to the average square footage of a cluster 3 house. Cluster 4 houses are conversely the cheapest on average and the newest, and have a the most average bedrooms of any cluster. Cluster 5 houses are the largest, containing the highest average sqft value.
When plotting the clusters' average PC1 and PC2 results, it became clearer as to how they related to the two visible clusters in the initial plot. Clusters 0, 2, and 5 were in the top cluster with PC2 > -2 and clusters 1, 3, and 4 were in the bottom cluster with PC2 < -2. They then also seem to be subsequently divided by PC1 value, with PC1 values increasing for clusters 2, 5 and 0 respectively, and for 1, 3 and 4 respectively.
In conclusion, while confusing at first, PCA analysis with KMeans clustering served to somewhat cluster the data into identifiable groups. Variable PC2 seems to be influenced by listPrice, judging by the similar average listPrice values between clusters 0, 2 and 5. Variable PC1 seems to be influenced by the number of garages as well as the year built, based on similarities in those fields between clusters 1, 3, and 4. PCA and KMeans clustering may not have been the most ideal analysis to run on this data, but some groupings were identified based on price and year built, and to a lesser degree, square footage and number of rooms.
Additional Seaborn Plots
I chose to create some additional seaborn plots using the size parameter. I plotted the year a home was built against its list price and then sized each point based on the square footage of the house. In the initial graphs, I also colored by type of house.
On the left is the plot of all houses, with year_built on the x-axis and listPrice on the y-axis. The points are colored by type of house, indicating single-family homes are the most commonly built, and the points are sized based on their square footage. I noticed a lot of crowding and it was hard to see some of the data points, so I decided to focus on houses built from the year 2000 to the year 2025. That graph is displayed on the right, and we can see that single family homes are still the most common. I wanted to explore the relationship between year built and list price for each type, so I separated out the categories. Those graphs are below.
I skipped single-family homes and farms; single family homes because they dominated the graph of all types of homes, and farms because there were only 6 farms built between 2000-2025 in our dataset. This is significant, as it shows an emphasis on urban and suburban life rather than a rural landscape in Oregon today. For single-family homes, prices seem to be steady no matter when the house was built, with size most likely being a more contributing factor to the price, as seen in our previous analysis and in the size of the outlying points we can see in the graph.
I was surprised to find that multi-family homes were also less common in the last 25 years, with none being built between 2009 and 2019, most likely due to the financial crisis at the time. The prices of multi-family homes do seem to increase in price and size as they are built more recently, with one large outlier in both size and price having been built in 2025.
There are significantly more condos and townhomes built within the last 25 years in our dataset. Townhome price has largely stayed the same as years have passed, with only a few larger townhomes being priced higher than usual in recent years. This excludes one very large and pricey townhome built around 2005.
Condos are different, however. There is another gap in townhomes built after 2009, again likely because of the financial crisis. Other than that, condo prices do seem to increase as they are newer, especially in 2025. Condos built before the crash in 2008 have more normalized prices clumped together, but there is a steep rise in prices for recently built condos.