Decentraland NFT Analysis

The goal of this project was to analyze Decentraland's land and wearable NFT sales data, identify trends, and create models to predict the sale prices. This website goes through the entire process of this analysis project: extracting, preprocessing, and visualizing the data, as well as creating and evaluating models. The data used in the project are all transactions made through Decentraland’s marketplace and do not include ones from third-party marketplaces. All the data was extracted through APIs from The Graph, Decentraland, and CoinGecko.

Generalized additive models and random forest regressor models were created for land and wearable data. Baseline models were also created for comparison to the other models. The baseline for the land data was the average sale price grouped by estate size, and for the wearable data was the average sale price grouped by rarity. For both sets of data, the random forest model with the response as sale price in MANA was the best-performing model. Compared to the baseline models, the mean absolute error was decreased by 69% for land and 71% for wearable. The median absolute error decreased by 94% for land and 99% for wearables.

The models performed quite well and could be used in a couple of different ways. First, it could identify current listings that are underpriced as potential acquisition targets. And second, it could be a tool for users that shows the estimated value of their land and wearables or ones they are interested in buying. This would be similar to how Zillow shows users their Zestimate ® of a home's valuation.

Decentraland Introduction

Decentraland is a decentralized virtual reality platform powered by the blockchain. Its users can own things on the platform through non-fungible tokens (NFTs).

The power of the blockchain and NFTs is that once you buy them, they are completely yours. The NFT is your proof of ownership, as recorded on the blockchain, and you can sell or transfer them anytime you like. If it goes up in value, you can sell and profit. This contrasts older systems and games where you were awarded different things throughout gameplay but could not profit if the game became more popular or if you obtained a rare item. Decentraland NFTs can be purchased and sold through their marketplace or third-party marketplaces (like OpenSea).

Four types of NFTs can be owned and used in Decentraland: land, wearables, emotes, and names. A major benefit of NFTs is that the data is publicly available on the blockchain. However, this does not mean that the data can be easily sourced. The main objective of a blockchain is to create an immutable ledger that is shared among a peer-to-peer network of nodes, not to make querying data easy.

The Decentraland land is a grid-based map with X and Y coordinates. The map is made up of plazas (green), districts (purple), roads (grey), and user-owned parcels (dark grey), as seen in the image to the left. The plazas are where users respawn and have higher traffic. The districts are themed community areas, and some examples are Vegas City, Fashion Street, and Dragon City. All the dark grey parcels are owned by users and can be bought, built on, sold, and profited from. Each X and Y coordinate pair is a parcel of land. If multiple parcels are adjacent and owned by a single user, they form an estate.

Decentraland wearables are various types of items that users can use to customize their avatars. They include clothing, hair, body features, accessories, etc. The images to the left show a couple of different examples. MANA is the currency of Decentraland and is used to buy NFTs from the Decentraland marketplace.

Datasets

Three different sources were used to create two datasets: one for land and one for wearables. The data sources were The Graph, CoinGecko, and Decentraland. The land data was retrieved on 4/27/2023, and the wearable data on 4/18/2023.

The Graph is a decentralized protocol for indexing and querying blockchain data. The Graph makes it much easier to query specific data (like a traditional database) instead of sifting through all blockchain transactions and checking if it is related to the specific application. There were three different APIs used from The Graph. The first was for all land transactions, which are on Ethereum. The second was for wearable transactions on Ethereum, and the third was the wearable transactions on Polygon. This data only includes transactions made through the Decentraland marketplace and does not include transactions made on third-party marketplaces.

The CoinGecko API retrieved the daily historical prices for Ethereum and MANA. The MANA price will convert the sale prices from MANA to USD. The MANA and Ethereum prices may indicate the popularity of Decentraland and blockchain applications as a whole, which may correlate to the NFT sale prices, making them potential variables in the models.

The Decentraland API extracted the locations and names of all the districts and plazas. This data will be used to calculate the closest plaza and district and the distance to the closest plaza and district for each sale, which will all be used as variables in the land model.

Data Preprocessing

All the data were returned as JSON responses. These were converted to tables in Python and checked for missing values. For the sales data and price data, there were only simple transformations that needed to be done.

The district data had a row for every district but included one row for all plazas and one for all roads. This means that all 11 plazas were combined. To differentiate the different plazas, a K-means clustering was performed, with K = 11, to find which coordinates should be grouped together. The image shows the results from the clustering, which perfectly separated all 11 plazas.

For each parcel of land, the closest plaza and distance to plaza needed to be determined, as proximity to these public areas could potentially be important variables in the model. The data from The Graph had a distance to plaza variable, but it only had data for parcels that had a distance of less than 10, which was only 9% of the data. With 91% blank values, something had to be done with the missing data. The options were: to replace all blanks with a global constant, replace them based on other features/variables (if possible), or calculate the distance for every sale. Since this data could be critical to the model's performance, the decision was made to calculate all the distances.

The original plan was to use the K-means clustering to predict the closest plaza for every parcel, and then the distance could be quickly calculated. However, this is not 100% accurate because the plazas are not circular, and two plazas near the bottom are not the same size as the other nine. The closest plaza and distance needed to be calculated another way.

K-means clustering to differentiate the points of the 11 plazas

A custom function was created for this calculation. Each plaza’s min x, max x, min y, and max y were stored in a data frame. For each parcel, the distance to each plaza was calculated using their min and max x and y and the x and y coordinates of the parcel. The closest plaza and minimum distance were returned. This worked because the plazas were all rectangular. The image to the right shows a couple examples of the calculated distances. The distances in the figure are both 4 because there is one parcel of land, then two parcels of the road, and then one more to hit the plaza in green. The image on the left is still 4 because you can get to the green plaza diagonally in 4 parcels. The differences between the K-means clustering approach and the custom function can be seen in the images below. The red stars are the corner points of the plazas. The different colors differentiate which plaza it is closest to.

Both highlighted parcels are a distance of 4 away from the plaza in green

Closest plaza as determined by K-means clustering prediction, contains errors

Closest plaza as determined by custom function, 100% accurate

A similar approach was used to calculate the closest district and distance. The district results are not 100% accurate because the calculation assumes the districts are rectangular, and some are not. Since there are ~40 districts, to get 100% accuracy semi-efficiently, this would need to be solved as a dynamic programming algorithm with memoization. Because the quantity and significance of the errors are low, and these variables were not believed to be as important as the plaza variables, the decision was made not to pursue a dynamic programming algorithm with memoization.

Data Understanding and Visualizations

Land Data Visualizations

The distribution of sale price is extremely right skewed. There are a lot of relatively low sale prices, then some high sale prices, and then a few very high sale prices that seem like they may be outliers. The figure below shows the distribution with the x-axis cutoff at $25,000. There are sale prices that go up to $2.3 million.

87% of transactions are for a single parcel. Of the estate transactions, 70% are for estate sizes of less than five parcels. This shows that there are very few transactions involving larger estates. Large estate sizes may not be an accurate predictor for the model due to the minimal data available. The figure below shows the distribution for estate sizes.

Estate size is the number of parcels that make up the estate

The figure below shows the trends in quarterly average sale price per parcel and number of parcel and estate sales. The average sale price per parcel stayed consistent from 2018 to the end of 2020, increased and hit a high in quarter 4 (Q4) of 2021 and since then has been declining but remains above 2018 to 2020 levels. The number of transactions hit a high in Q4 of 2021 and had a low in Q1 of 2023 (we are currently in Q2 of 2023).

The figure to the right below shows that there is a difference in the average sale price per parcel based on the closest plaza. The average sale price per parcel ranges from $4,000 to $8,500, depending on the closest plaza. The plaza with the highest average sale price is Central Genesis (the very center of the map), and the lowest are Pixel 2 (one of the smaller plazas) and Southeast Genesis.

The two charts above are combined into the interactive chart below. You can hover over a data point to get information on that point, and you can click on a color or name in the legend for "Closest Plaza," and it will grey out all other plazas on the chart. To bring all the colors back, click on the chart but off the legend.

Average sale price per parcel correlates fairly well with distance to plaza and MANA price. There is a slight negative correlation with distance to plaza and a positive correlation with MANA price. There does not seem to be much correlation between average sale price per parcel and distance to district. These relationships can be seen in the figures below.

The map below shows all the plazas (in green), districts (in purple), and the road (in grey), along with all the sales. The sales are the blue points. The lighter the point, the lower the sale price per parcel. The darker the point, the higher the sale price per parcel. The gradient for the sale price per parcel is maxed at $25,000, even though some data points go up to $435,000, to be able to better visually differentiate between points. The darkest points are the ones greater than $25,000. It is an interactive chart: you can pan and zoom, filter the sales data by year by clicking on the colors next to the years on the left (if you want to select multiple years, hold shift while clicking), hover over the districts and plazas to get their names, and hover over a sale point to get more information, such as sale date, closest plaza, closest district, sale price, and estate size.

The below map is the same but has a filter for sale price instead of years. This one is interesting because you can slide the filter to the far right to only see the highest sale prices and then slowly slide it left to observe where the points are appearing on the map.

Wearable Data Visualizations

The wearable NFT sales data is made up of sales on Ethereum and sales on Polygon. When wearables were first sold in 2020, they were only on Ethereum. Polygon sales started in Q2 of 2021 and since then have been the majority of the sales. As of this analysis, the total number of sales on Ethereum is 9,282, and on Polygon is 182,976.

The average sale price on Ethereum is much higher than the average sale price on Polygon. This is expected since Ethereum has much larger transaction fees, and so transactions are only worth doing if it is for more expensive items. The trend of the average sale price and number of transactions is shown in the figure below. The distribution of sale prices is extremely right-skewed, as shown in the figure to the right below. The x-axis is cut off at $100, but the highest sale prices go up to $20,000.

The sales on Ethereum have a large impact on the average sale prices. So the following charts only use Polygon sales. The first figure on the left shows the average sale price and number of sales across categories. The second figure on the right shows the average sale price grouped by category and rarity. For this second figure, the rarity level "unique" was excluded because it was much larger than the others and made the other lines indistinguishable. These figures show a clear difference in sales price between different categories and rarities.

Modeling

For both land and wearable data, multiple models were fitted. The first type of model was a generalized additive model (GAM). The initial model used a Gaussian (normal) distribution, and then the next used a gamma distribution due to the heavily right-skewed dataset. The other type of model was a random forest regressor (RF). These models were fitted with the response as sale price in MANA and again with the response as sale price in USD to see if there are any differences.

The data was randomly split into a training set and a test set. All models were trained on the training set and evaluated on the test set.

Land Models

After the additive models were fitted, marginal plots were made to see the impact of each variable adjusting for all the other variables (i.e. keeping all the other variables constant). These plots can give you an idea of the impact each variable has on the response. A few of the marginal plots are shown below for both GAMs using the sale price in MANA as the response. For each plot, the x-axis is in the units of the title. For the normal distribution plots, the y-axis is the marginal change to the sale price in MANA. If it is positive, the sale price is increased by that quantity for that variable. For the gamma distribution plots, the y-axis is not as easily interpretable. The gamma GAM uses a link function that links the mean response to the predictors. Even though it makes it harder to interpret, the plots still show relative variable impact. A positive number increases the response, and a larger number has a greater impact than a smaller number.

From these plots, there are numerous takeaways. First, estate size seems to be the most impactful variable, and as estate size increases, the sale price increases, which seems intuitive. For the model using the normal distribution, there is a large decline around estate size of 70, but that is probably due to the minimal sales data for larger estate sizes. Second, very low distances to plaza result in higher sale prices, but there isn’t much of a difference between a distance of 10 and distances much greater than 10. Third, low MANA prices have higher sale prices (in MANA) and remain fairly consistent at higher prices. Lastly, as the Ethereum price increases, the sale price slightly increases as well for the normal distribution model. For the gamma model, the sale price starts higher, declines quickly, and then slowly increases up to an Ethereum price of around $2,000.

Marginal plots for GAM using normal distribution

Marginal plots for GAM using gamma distribution

From the random forest models, you can retrieve the variable importance, which shows how important each variable is to the model. The higher the importance, the greater the impact that variable had on the model. The figure to the right shows the importance from both random forest models. The two models were very similar: estate size was the most important, followed by MANA price and Ethereum price, and the remaining variables had fairly low importance.

Wearable Models

The gamma distribution additive model could not be solved and was not pursued further due to the promising results of the random forest model. Normal distribution additive models and random forest models were created for the wearable data. Similarly to the land models, marginal plots were made for the additive models, and a variable importance plot was made for the random forest models. The first figure below shows the marginal plots. The sale price varied across different categories and rarities. The MANA price and Ethereum price had an impact when their prices were low but were consistent at higher prices.

The category and rarity factors needed their values transformed from strings to numbers to be used in the model. The category and rarity marginal plots have the numeric label on the x-axis and the table to the right of the plots shows how the labels map to the different categories and rarities.

The two random forest models had similar variable importance except for rarity and collection. For those two variables, the models had the importance essentially flipped. The model with the response variable as sale price in MANA had rarity as the most important variable. The model with the response variable as sale price in USD had collection as the most important.

Marginal plots for GAM using normal distribution

Evaluation

Prediction models can be challenging to evaluate since there is no correct or incorrect like there is with classification models. For this reason, baseline models were created as a comparison, and numerous evaluation metrics will be looked at, such as mean absolute error, median absolute error, percent of predictions within 10% of the actual, and percent of predictions within a certain amount of the actual.

Land Models

A typical baseline model uses the overall average sale price as the prediction for all test data. This baseline model was a little better: it used the average sale price grouped by estate size as the predictions for the test data.

The models with the response as sale price in MANA had their absolute errors converted to USD so that all the models were directly comparable.

A table of results is below. The random forest model with sale price in MANA was the best-performing model. It had the lowest mean absolute error and median absolute error and the highest percent within 10% of actual and percent within $1,000 of actual.

Order of models, best to worst: RF (MANA), RF (USD), GAM Gamma (MANA), GAM Normal (MANA), GAM Normal (USD), Baseline

The best model, random forest model with sale price in MANA, had a mean absolute error of $2,793. At first glance, that probably doesn’t look great. However, the data has sale prices ranging from $0.03 to $2,315,502, and the model is much better than the baseline. When compared to the baseline model, the mean absolute error decreased by 69%, the median absolute error decreased by 95%, and the percentage within $1,000 increased from 7.2% to 72.9%.

Wearable Models

The baseline model for the wearables used the average sale price grouped by rarity as the predictions for the test data.

Similar to the land models, the models with the response as sale price in MANA had their absolute errors converted to USD so that all the models were directly comparable.

A table of results is below. The random forest model with sale price in MANA was the best-performing model again. It had the lowest mean absolute error and median absolute error and the highest percent within 10% of actual and percent within $1 of actual. The random forest model with sale price in USD was only slightly worse. The GAMs using normal distribution were worse than the baseline model. This is a clear indication that they are not correctly modeling the data and should not be used.

Order of models, best to worst: RF (MANA), RF (USD), Baseline, GAM Normal (USD), GAM Normal (MANA)

The best model, random forest model with sale price in MANA, had a mean absolute error of $7.86. When compared to the baseline model, the mean absolute error decreased by 71%, the median absolute error decreased by 99%, and the percentage within $1 increased from 6.1% to 70.5%.

The two random forest models had a large discrepancy in variable importance, with one having rarity as the most important and the other having collection as the most important. However, these differences ended up leading to very similar results.

Conclusion

The datasets were retrieved through APIs from The Graph, CoinGecko, and Decentraland. This data only includes transactions that occurred on Decentraland's marketplace and does not include any from third-party marketplaces. The sale price data for both wearable and land are heavily right-skewed. The average sale price and the number of sales have been declining since reaching their highs near the end of 2021/start of 2022.

Generalized additive models and random forest models were created for both datasets. The random forest models performed best for both land and wearables. The most important variable, according to the random forest model, was estate size for land and rarity for wearables (although the other random forest model for wearables had collection as the most important variable). Baselines were created as a comparison to the models. The baseline for land was the average sale price grouped by estate size, and for wearable was the average sale price grouped by rarity. The best model decreased the mean absolute error compared to the baseline by 69% for land and 71% for wearable. The median absolute error decreased by 95% for land and 99% for wearables.

This project leads to some interesting future work opportunities. The first would be to include all the Decentraland NFT transactions that occur through third-party marketplaces. This would give you all the current data and allow an analysis of whether there are any differences between marketplaces. The other opportunity would be to get data on other decentralized virtual worlds (like Sandbox), and then you can compare the two platforms and analyze their differences.

Page updated

Google Sites

Report abuse