Final Report

Introduction

Over the span of just a few short decades, the internet has grown from a niche tool for researchers to the indispensable part of our daily lives it is today. However, underpinning this process are the variety of services required to make this function: website hosting, content delivery and certification to name a few, all of which require separate providers in whose markets there has been growing consolidation. In contrast, the Interplanetary File System, or IPFS, aims to provide all three of the above functions and more together in one package, using the most abundant resource available on the internet: the machines that users use to connect to it. My project aims to evaluate how effectively IPFS has achieved this goal by looking at its implementation, adoption and current use.

Overview of IPFS

IPFS stands on the shoulders of many previous works, including but not limited to: P2P overlay networks such as CAN, CHORD, Pastry; Operational DHT implementations such as the Kademlia implementation of BitTorrent; Content based addressing, as first proposed by information centric networking and later taken up by named data networking.

IPFS itself relies on four key concepts, the first of which is decentralized object indexing. IPFS uses a DHT that is based on Kademlia to locate data objects. Each node will maintain 256 buckets of size 20 that partitions its known overlay network. New peers will join the DHT as clients by default, and are changed to servers if they are publicly reachable. The process works as follows: A new node joins and asks its known neighbors to initiate connections to it, if four or more are able to do so, then it is considered a server. The difference between a DHT client and server being that a client only requests data items from the network, while servers also store mapping records and provide them to any client that requests them. The distinction is done for performance, so that peers behind NAT do not clog up the routing tables of their neighbors. Thus, IPFS is able to host data items without requiring dedicated infrastructure, instead leveraging the same nodes that users operate and request data from to store data.

The second concept is content based addressing. Since data items in IPFS are served by other peers in the network, and not a central server, data items are not stored according to its location, as multiple nodes should store the data item for availability, since IPFS does not enforce or incentive nodes to stay online. These nodes then, are called data providers in this context.

When a node requests a particular data item, it will perform two separate operations in parallel. The first operation is to query the underlying DHT for the data providers by using the content identifier in multi round, iterative lookups. In each successive round, the query will search for nodes with identifiers close to that of the data item, based on XOR distance, and each node that is queried will either respond with the requested data item, if they are a data provider, or reply with its known neighbors with identifiers close to that of the content identifier of the data item, for the subsequent round of lookup. The second operation is to request from all its known neighbors using BitSwap, a subprotocol within IPFS where the node will inform all its known peers of its wantlist, a list of all data items a node is interested in. Then data providers will reply without coordination, meaning that data providers will not take measures to ensure data is not duplicated. This has the effect of increased network traffic as in the general case, there will be multiple nodes replying to the same request. However, this has security benefits as an attack on the DHT will have limited effect since a node will still be able to obtain data items from their neighbors, as long as one is still able to respond. This is the solution that IPFS uses for content distribution, requiring no Content Delivery Networks but instead uses the volume of peers to deliver content.

The third concept is self certification. IPFS has the property of a Self Certifying Filesystem, what this means in practice is that each data item is assigned a unique immutable address, its content identifier, that is generated from the hash of its contents. So when a node receives a particular data item, it will verify the hash by comparing it with the hash it generated itself based on data it receives. If the two differ, some form of tampering has occurred as the content identifier is immutable. This allows IPFS to bypass the requirement for central Certificate Authorities that the current web uses for public key infrastructure.

The fourth concept is open participation. The process of joining the network as a node is quite simple, the aspiring peer will generate a public private key pair, of which the hash of the public key is used to generate its peer id. To accomodate the wide geographic distribution of peers, this peer id forms the basis of a so called multiaddress, which is named because it it encodes the sequence of network and transport protocol needed to communicate with a peer, which enables the connectivity of the wide variety of nodes operating as peers to be determined quickly and easily. The ease of which an attacker can create an IPFS identity makes Sybil attacks a potential concern, however, due to the properties of the BitSwap protocol discussed earlier, limiting the availability of stored data is quite difficult. However, users are still able to request data from the IPFS network even without formally joining as a node, as there exists gateways, which offer HTTP entrypoints into IPFS, by combining a DHT server node with a Nginx HTTP web server with a public IP address, which translates GET requests containing a content identifier into DHT queries. It is the hope that ease of access will incentivize users to join as nodes, and keep existing nodes from leaving.

Previous work

Prior work has focused on the performance and topology of the underlying overlay network, with often contradictory results, with some claiming of the existence of a feature that others deny, though this is perhaps to be expected with its history, the first version came out in 2015 and was vastly different, and the rapid pace of development has changed many aspects of IPFS. For example, one work in 2020 claims that their crawler was only able to get a response from only 6.55% of nodes it encountered, suggesting that many were behind NAT and operated by private individuals, and the study was done before the distinction between client and server nodes was made. However authors generally agree that IPFS operations provide high latency, getting increasingly worse with larger file sizes, though the test bed performance is still much better than BitTorrent, in a controlled setting.

Current adoption

Organizations that have adopted IPFS are able to leverage it as distributed storage in isolated private networks, allowing the free disk space of machines to be utilized to their full potential. These applications are difficult to track and catalog as the crawlers used to index public facing IPFS nodes are unable to discover them. However, due to the performance characteristics of IPFS, to be discussed later, as well as the stringent uptime required of IPFS nodes to act as servers, it is unclear whether IPFS is suited to this application. A more practical and argubably useful use for IPFS is to serve as the backend of other third party applications. This way, end users do not have to interact with IPFS directly, as the only way of meaningfully doing so is with the content identifier of the data item you are interested in, which is difficult to acquire, as will be explained later. These services range from content delivery itself, to decentralized finance, to social media applications and efforts to ensure data persistence. However, it is difficult to see how all these novel applications might grow into a critical mass that could present an alternative to HTTP, as IPFS has stated is one of its design goals. One of the reasons behind the success of the current web architecture is underlying it is a self-sustaining cycle: if an end user is able to find what they are looking for, they are more likely to continue using the web to find other things, thus giving reasons for more web pages to exist to cover more obscure content or content of a higher quality. Similarly for BitTorrent, if a user is able to retrieve what they are looking for by using BitTorrent, they are more likely to continue using it for data retrieval, encouraging participation and growth. However for IPFS, since end users do not interact with its ecosystem directly, and every organization has its own sets of goals and requirements to consider in choosing IPFS, adoption by your peers has little relevance to you as the needs of your organization and product are different and must be evaluated on a case by case basis.

Another prominent use of IPFS has been to store snapshots of Wikipedia for users in countries who have censored the website for one reason or another. However, these are read only snapshots taken by volunteers, and are updated infrequently. For example, the page for former Queen of England, Elisabeth II, still portrays her as alive. The reason for these infrequent updates can be attributed to the cumbersome process of taking a new snapshot, which requires that a IPFS node be configured, a snapshot downloaded and extracted from a third party and the data from the snapshor has to be added to the node, which is a lengthy process and disk space hungry, seeing as the entirety of the English Wikipedia is over 250gb. Though these are issues that need to be resolved, they are not crippling to the average user experience, as the vast majority of articles do not require constant revision. However, a more pressing problem is that the snapshot, being hosted on IPFS, is only the link to a single page, and does have a search function associated with it, so the only way to traverse the site is through the links, which makes accessing a particular page both time consuming and unpleasant.

Another issue is the accessibility of these snapshots. IPFS supports mapping a DNS link to an IPFS address, allowing the cumbersome content identifiers to become much more readable. However, if censorship is truly an issue, DNS resolution can be blocked if the query contains sensitive content, so users must choose an alternative. One is choosing to go through public gateways instead, but even then, a degree of trust must be placed in the maintainers of these gateways, and they represent central points of failure in a system designed to promote decentralization. The most reliable and secure way then, of accessing a page is by using its content identifier, which then has the issues to be discussed in the next section of obtaining the content identifier.

User experience

Quite simply, the user experience of IPFS is not up to the standard we have come to expect from the internet. There exists an issue with the current implementation of IPFS where the longer a node is running, the more likely it is that it will encounter an error that blocks incoming connections, denying the node the ability to connect to any new peers. Given the amount of churn in the DHT, this will eventually lead to a node not having any peers in its neighbor list. Thus, the IPFS node must be periodically refreshed to ensure an adequate amount of neighbors, which then has the problem of a bandwidth hungry peer discovery process where the IPFS attempts to initiate connections to as many peers as possible, which though necessary, monopolized all network traffic in the case of my test machine, with a 75mbps connection. This would not be an overly pressing concern in and of itself, however the combination of the two above issues, one requiring constant restarts of the node and the other penalizing it, made for an unpleasant experience of running an IPFS node as a private individual. With that being said, there is perhaps an even more crippling issue plaguing IPFS, its reliance on content identifiers to fetch data items. Whether using a command line, the GUI that writes IPFS commands for you or through a web browser using a IPFS gateway, to retrieve a data item you must first know its content identifier, this combined with the intentional lack of centralization results in the bottleneck in using IPFS not its retrieval delay or retrieval time, but the inability to acquire the content identifier in the data item you are interested in. Though there is plenty of data stored on IPFS nodes, it hardly matters to the end user as retrieval of a data item is difficult without knowing its content identifier.

There have been some efforts to remedy this however, and there exists a search engine for IPFS, called IPFS Search that uses a crawler and sniffer to extract hashes from nodes, the data from the hash and indexes both for future queries. However, compared to the functionality provided by modern search engines such as Google, it is sorely lacking, only allowing for filters according to the file size, the type of data item it is, and when the crawler last saw the file. It will then use a regular expression match for the search query, ignoring results that are similar but are spelt differently or are listed under a synonym. In essence, the state of IPFS is similar to the internet in the early 90s, where it existed, but the various indices that aggregated websites were only local, though rudimentary web crawlers that attempted to aggregate the whole web were popping up.

Then there is the matter of the content identifiers themselves, being a hash, they are intentionally unreadable and difficult to keep track of as they bear no relation to the title of the data item or its contents, making the usage of content identifiers opaque to the user requesting the item, since they cannot be sure of what they are requesting until they have already received it. Of course, IP addresses once had this problem as well, which was solved by the introduction of DNS. IPFS too aims to address this issue with DNS by using what it calls DNSLink, where DNS records can be used to map to IPFS addresses, allowing for a more human readable address to be used as a search query.

Strengths

Compared to the current web architecture, an organization looking to host a website or any other data item has no upfront cost other than the machine used to store the data item. This has the potential to usher in an explosion of web traffic as people with good ideas would be more likely to implement them if the cost to do so is nothing, rather than small but measurable. The uptake of IPFS is also quite widespread, both in the number of nodes, which number over 300,000, but also in their geographic distribution as there is a relatively even spread of peers located around the world, with the largest concentration in the United States at 28%, achieving one of the design goal of IPFS: decentralization. This provides resilience to network fragmentation and disruption attempts. Another strength is that IPFS, having learned from previous DHT implementations, is well aware of the effect of churn on peer routing tables and data availbility, so sets a high replication factor of 20 to try to ensure that nodes leaving does not remove the ability of the network to provide content. This is in contrast to BitTorrent where it is common for obscure data items to have no seeds and be impossible to obtain, and the presence of dead links on websites today, caused by website providers being motivated by economic incentives and not user interest.

The performance of IPFS, especially its retrieval times are quite competitive with HTTP, with operations taking around four times longer, which seems bad on paper, but takes into account the DHT walk for unreliable peers, as well as the high benchmark of data items delivered by HTTP by mature Content Delivery Networks. In contrast, previous DHT implementations such as BitTorrent have a median lookup latency of over a minute due to excessive dead notes, while IPFS is able to sidestep the issue of punching through NAT by not adding those peers into its lookup table during the peer discovery process, with over half of probes taking under a second. IPFS does have worse publication performance, but with one of its stated goals of data item permanence, its publication times are of far less significance and is a byproduct of requiring multiple hash operations.

Weaknesses

As it currently stands, the largest obstacle to the success of IPFS is proliferation and ubiquity of the current web architecture. Back when the internet was still a niche application for enthusiasts and researchers, much like the way IPFS is now, it had the potential to revolutionize the way humans interacted with each other, even if people didn’t know it at the time, and any company that was able to carve out a portion of that market more likely than not became a giant of technology. But what IPFS is trying to do is fundamentally the same as what the internet has already done, it still aims to deliver data items to end users, admittedly through different methods, but the data items themselves are still the same. Of course, IPFS need not revolutionize our lives the way the internet has done, but because it is fundamentally a peer to peer distributed file system, its speed and reliability is tied to the number of nodes online at any given time, which in turn is tied to its popularity. This is of particular interest as since the average end user is unlikely to use the current incarnation of IPFS directly without a robust search engine, especially as BitTorrent boasts a larger user population and hence is able to provide data items faster with more seeds/data providers than IPFS. This thus ensures the growth of IPFS nodes is capped to a relatively small fraction of all the internet capable devices out there in the world, meaning that the space for available growth is limited. But as it exists today, the most promising use of IPFS would be to help organizations host their own content, which seems to be the most promising use of IPFS as it negates many of the disadvantages of IPFS, as users do not need to use IPFS directly and content identifiers can be indexed and stored by a central authority, within the organization’s records. Additionally, the effect of churn on data availability that concerns all distributed hash tables is somewhat mitigated by the fact that the organization would be able to dedicate some of their own machines to being IPFS node data providers. However, with that being said, again, IPFS faces hurdles in adoption as it enters an established market with large players such as HTTP and BitTorrent already dominating market share. These organizations must ask themselves why they would want to choose IPFS over these competitors, and the answer may not be straightforward, because though the strengths of IPFS are numerous, so are its weaknesses.

Conclusion and future work

IPFS is a late player to an already well established game. This has the downside of a greater barrier to entry as the existing players already have market share, but has the benefit of incorporating the myriad of file distribution techniques that have been invented since the inception of its competitors to boost its reliability and performance. As it stands, IPFS has the most potential for use with organizations looking for a secure and decetralized way to distribute their product to their users, thus bypassing the need for their users to interact with the complicated IPFS ecosystem directly.

Though there are a number of different directions to take for future work, the most promising avenue would be to build an application that stores data using IPFS in its backend, and compare its ease of use and retrieval times with a more typical technology stack to determine in depth how suited IPFS in its current incarnation is to developers.

References

Trautwein, Dennis, et al. "Design and evaluation of IPFS: a storage layer for the decentralized web." Proceedings of the ACM SIGCOMM 2022 Conference.

Abdullah Lajam, Omar, and Tarek Ahmed Helmy. "Performance evaluation of ipfs in private networks." 2021 4th International Conference on Data Storage and Data Engineering.

Henningsen, Sebastian, et al. "Mapping the interplanetary filesystem." 2020 IFIP Networking Conference (Networking). IEEE, 2020.

Chen, Yongle, et al. "An improved P2P file system scheme based on IPFS and Blockchain." 2017 IEEE International Conference on Big Data (Big Data).

Rhea, Sean, et al. "Handling churn in a DHT." Proceedings of the USENIX annual technical conference. Vol. 6. 2004.

Mathis de Bruin 2021. IPFS Search Documentation. https://ipfs-search.readthedocs.io/en/latest/.

Uncensorable Wikipedia on IPFS. https://blog.ipfs.tech/24-uncensorable-wikipedia/.

About the author

Larry Xu is a fourth year Computer Science student at the University of Victoria. He has done two Co-Op terms at Allsalt Maritime working to refine wave impact sensors for naval vessels and is expecting to graduate Fall 2023.