P2P Security in the Emergent Internet

ECE1548 - Advanced Network Architectures 

Abstract

This paper will introduce several concepts and high level elements that will be used as the base architecture of the Emergent Internet. At the highest level, we will examine the advantages and disadvantages of specific overlay network technologies used in P2P systems to decide on a model for the overlay networks that will be used in P2P communities. We will then examine the cross layer issues of distributed storage, transitive trust systems, and a modified version of RSA for use in security in the global P2P infrastructure.

Introduction

In this project I will examine the general properties of the Emergent Internet and predict how it will evolve over the next 10 years. I will focus on the issue of security and the high level components from the field of P2P security that will be required by the Emergent Internet. Specifically, I will propose the characteristics and requirements of P2P security within the context of the Emergent Internet to promote the services that will be provided by the Emergent Internet.

The Emergent Internet

Emergent behaviour; otherwise known as emergence is a phenomenon in which a system of simple or irreducible elements that are easily described and analysed at the microscopic layer work together to create very complex patterns at the macroscopic layer. Systems that exhibit emergent behaviour are likened towards distributed systems; nodes in a distributed system run a local algorithm which uses little or no communication with other nodes and the local algorithm is designed to be self-synergetic. The result is the local algorithm, when run on all nodes, co-operate within the context of the system and can perform some kinds of distributed computing which may or may not be evident when looking simply at the local algorithm. A very good example of emergent behaviour is observed within the context of the human brain. Thought patterns are an emergent property of the individual synapses which are not in themselves capable of thought. In this context, the synapses themselves do not exhibit any properties which would indicate that their function is to facilitate the thought process, yet thought processes are what they produce at the macroscopic layer – when the community of the synapses is viewed as a whole.

The Emergent Internet is an emergent phenomenon of the Internet. Specifically, the nodes within the internet function as elements of a community. These nodes can be altruistic, cooperative or freeloading which likens them to individuals as defined in evolutionary biology. This is a rather striking observation – the Internet itself possesses properties when viewed as a whole that are not as obvious when the individual nodes are examined. What properties do we observe at the macroscopic level – in this case, the global level? There are several properties and to better understand the concept of the Emergent Internet, it is instructive to look at a few of these properties.

Emergent Behaviour in the Emergent Internet

Emergent behaviour in the internet ranges across the entire spectrum from very simple to very complex emergent behaviour. Some of these behaviours are due to explicit imposition of rules and structure while others are the result of a series of individual tools introduced for one purpose but which incorporate themselves as larger systems.

Google Search and Google Sets

Google drives one part of the Emergent Internet. In particular, their drive in the Emergent Internet is due to the development of one technology which in itself has emergent properties. This technology is of course Google search. The Google search engine is successful because it indexes pages in a revolutionary way as compared to other search engines.

Before Google search, search engines indexed pages based on specific keywords and their relative frequency. If a given page’s keywords matched every word in a user’s query, the page would have a high rank. The problem with this approach to search is that a web administrator has the power to be dishonest about the content of his page. For instance, suppose that Microsoft wanted a search engine to rate their page high given the search query “open source operating system”. In order to do this, Microsoft could put the keywords “open source” and “operating system” in their meta-data keywords. The page content could also use the keywords extensively without actually saying that Microsoft Windows is an open source operating system. Microsoft might for instance say: “Microsoft Windows is more economic than an open source operating system such as Linux because administration costs on Windows is lower than on Linux.” With several sentence like this one, Microsoft Windows would end up with a high page rank for the search query “open source operating system”, even though Windows is not an open source operating system. More importantly, while the result is to a certain extent dishonest, it was not done in an underhanded kind of way. Microsoft was making a legitimate claim as to the administration costs of Windows with respect to the alternative open source operating system. Indeed, Windows administrators tend to be cheaper than Linux administrators. There is no easy way to fix this problem without making specific exceptions in the search engine algorithm itself which, to a certain extent, defeats the purpose of the web crawler.

Google takes a different approach to search. Instead of indexing pages based on the keywords specified by the page administrator, page A is indexed based on what other pages say about page A. If many pages say that Red Hat Linux is the open source operating system of choice by many people, then Google will index Red Hat Linux with a high rank for the search query “open source operating system”. This approach to search is similar to a credit rating system and is inherently democratic. A person’s credit is based on what other companies have to say about that individual’s credit. If a set of companies trust a given individual and those companies in turn trust one another, then it is assumed that the individual is trust worthy. Of course this does not necessarily mean that the individual is in fact trust worthy. If the companies are not trust worthy, but they trust one another, then the individual may not be trust worthy, even if the companies trust the individual. This particular flaw in the system has been exploited in the past. Try searching for “failure” on Google. The first result is “Biography of President George W. Bush”. The keyword failure does not at all appear on the page or in the page source. How is this possible? The attack is known as a Google bomb. From Wikipedia:

“Due to the way that Google's PageRank algorithm works, a page will be ranked higher if the sites that link to that page all use consistent anchor text. A Google bomb is created if a large number of sites link to the page in this manner.”

The difference between traditional search engines and the Google search engine in terms of how page ranking is performed is significant. A page’s rank in the Google search engine can only be affected by the set of individuals who reference that page. Thus, it is significantly harder to change a page’s rank in Google than it is in a traditional search engine. The algorithm that Google search uses for page indexing is what lends towards the technology’s emergent behaviour. Specifically, because a page’s rank is based on what other pages have to say about it, there always exists some correlation between any two pages. This fact has been exploited as part of Google Sets to which end we find some striking characteristics reminiscent of artificial intelligence.

Google Sets is a computer science related curiosity in which the user inputs a set of common items and Google Sets will attempt to grow the set with other related items. For instance, if the input set was {Orange, Apple}, Google Sets would grow the original set by appending other fruits such as {Banana, Strawberry, Cheery}. The way Google Sets works is by utilizing their page rank system to infer a context. Searching for Orange in Google search will find pages related to the keyword Orange based on references from other pages. If we then search for Apple, we will find pages related to the keyword Apple. If we now look at the similarities between these two sets, we will find some pages that refer to both. The keywords that are held in common between the pages over the sets that overlap have a high probability of being items that could potentially extend the original set. If the original input was {Orange, Red}, then the set would be grown with other colours instead of fruits. This ability to infer a context based on instance is an example of emergent behaviour. The irreducible microscopic level components are simply web pages that can be about any variety of subjects. By examining the vast amounts of seemingly chaotic web data (the macroscopic level), we can infer some order. It is important to note that this particular emergent behaviour depends on the existence of a fast and reliable search mechanism as the one developed by Google.

P2P Applications

Of course, Google search is not the only internet technology that exhibits emergent behaviour. P2P applications also have some emergent characteristics in their ability to perform certain tasks such as file sharing. This particular ability is only possible when many P2P clients participate in the process.

There is a significant difference in terms of emergent behaviour between P2P applications and other technologies. In Google’s case, the system is maintained by Google. Their database is private and they have exclusive access to the raw data. Because they control the system, they need not be concerned with issues such as the security or validity of the data within their database. Of course, they rely on the assumption that the majority of web pages are in fact trust worthy – this is an inherent assumption in the Google search algorithm.

The Semantic Web

The Semantic Web is an emergent phenomenon which will likely be one of the most important products of the Emergent Internet. As it stands, Google search has a vast database of a wide variety of information ranging from academic information to the individual usage data of people. All of the data that Google has indexed has also been labelled. This “data about data” or metadata is the basis by which Google Sets and Google search can infer context. It is indeed the product of the emergent behaviour of Google’s search technology. However, despite the large database that Google maintains, there still exists lots of data that Google does not have access to. While it may be that Google’s goal is to index the world, the fact remains that companies such as eBay have their own database of data pertaining to individuals. There is no doubt that Google would like to own this data, but due to privacy issues, no one company could possibly be allowed to keep such a large amount of personal data about people. Nevertheless, the data stored on eBay could potentially be very valuable, and in some instances it may be beneficial for the individual to share some of this data with another company.

In order for the data to be meaningful between companies, a standard must exist by which the data can be automatically parsed and interpreted. For instance, eBay maintains a database of goods and services that individuals are trying to sell. Items for sale can be physical objects, such as books or videos, or can be services. If metadata is stored about every item in the database, then information can be acquired about the item from other sources and more importantly, it becomes possible to use this information in emergent technologies since the context has already been inferred before hand.

Motivation for Security: Authentication and Authorization

EBay maintains a database of individuals and a rating for each individual. The rating is based on the feedback of other users and is used by members of the eBay community to determine how trustworthy another eBayer is – i.e. will this individual try to steal my money or will I get the item I paid for? In order to quickly assess the trustworthiness of an individual, one should look at the positive and negative feedback of the user as well as the positive feedback of the individuals who left feedback for that user. This trust system is highly similar to the way that Google search works and is similar to a credit rating system except that instead of banks and credit companies assessing an individual, it is other peers that maintain the credit system. Again, the underlying assumption is that most peers are trust worthy.

Within the context of the Emergent Internet, security in terms of authentication and authorization will become an issue. Currently, companies do not share or divulge their information about an individual, but this information might be useful and beneficial to the individual if it were shared with other companies, if perhaps with some discretion. If it becomes possible to share this information with others, it will also become necessary to be able to verify the identity of the individual in a secure manner. Some companies will be responsible for some individuals, but there is also the issue as to whether or not these companies are trustworthy. The concept of credit, as discussed before, will be an integral part of trust. If several companies {C}, trust company A’s assertion as to the identity of user X, and company B trusts the companies in C, then company A is trustworthy and user X may be authenticated.

Basic Security Components and Concepts

Currently today, there exist several security components each of which have properties that can be used to implement a variety of technologies. In this section, we describe the functionality of the most important of them and how they may be used in the Emergent Internet.

Secret Key Cryptography

Secret key cryptography is the simplest cryptographic scheme. Given the secret key, a block of plaintext can be encoded using the key to produce the cipher text. The actual process of encoding the cipher text is an almost random process of swapping bit positions iteratively based on the secret key. The cipher text is unintelligible and can only be decoded by someone who shares the secret key by performing the same bit swapping procedure backwards.

There are indeed many secret key cryptographic schemes. The most advanced schemes use 512 bit length keys that can encrypt blocks up to size 512 bits. With such large blocks, an exhaustive search of all possible plaintext cipher-text pairs would take longer than the age of the universe. The biggest problem with secret key cryptography is that the secret key must be shared between all parties that wish to communicate.

Public Key Cryptography

The RSA public key cryptographic scheme is a fairly new and remarkable encryption scheme. This scheme is based on the fact that it is very difficult to factor large integers. In this scheme, there are two keys, the public key and the private key. A message when encoded using the public key can only be decoded using the private key. This particular property makes it possible to send a message to a particular user without knowing his private key. Suppose Bob wants to send a message to Alice. Bob simply has to encode his message using Alice’s public key and then send it to her. When Alice receives Bob’s message, she decodes the message using her private key to obtain the original message. The scheme also makes it possible to prove to Bob the identity of Alice as long as Bob already knows Alice’s public key. Bob simply sends Alice a random message. Alice encrypts the message using her private key and returns the cipher-text. Bob can then decrypt the message using Alice’s public key. If the decoded message equals the random message that Bob generated, then he can be sure that he’s talking to Alice since only Alice knew how to encode the message in the first place.

Using some of the characteristics of public key cryptography, it is possible to create an entirely different encryption scheme which has the desirable property of commutative encryption and decryption. While this scheme uses many of the properties of public key cryptography, the two schemes are not compatible. Commutative encryption has the following mathematical property: Ea(Eb(m)) = Eb(Ea(m))

This property can be visualized as a chest; each encryption key is a lock that can be applied to the latch. As long as there is a lock applied to the latch, the chest cannot be opened, however the latch can be locked and unlocked in any order. In order for a peer to be able to view the message m, every other peer must first remove their lock from the chest. This property of public key cryptography is of particular significance in zero knowledge proofs. Zero knowledge proofs can be used to prove to an arbitrary peer that an individual is aware of some knowledge without teaching the arbitrary peer anything new. Specifically, with a zero knowledge proof, Alice can prove to Bob that she knows some secret, and she can prove this without telling Bob the secret or giving him any information that he didn’t already know.

Hash Functions

A hash function is a function which maps objects over a fixed length bit space. The function is designed to perform the mapping over all objects in a statistically uniform way such that the probability that any two objects map to the same value is very low. There have been numerous proposed hash functions. One of the most popular was MD5 which has recently been superseded by the SHA-1 and SHA-2 varieties of hash functions. SHA-1 is a 160 bit hash function with collision probability between any two arbitrary objects of somewhere on the order of 2-80.

Authentication and Authorization

Given the set of security tools as described in the last section, it is possible now to introduce authentication. The process of secure authentication involves the exchange of a password between a client and a server and verifying that the password is correct. This process can be done in a very secure way such that the password itself is never actually sent over the channel. Authentication is effectively the process of another system establishing the identity of a user. This may be done in a variety of ways. If authentication is password based, then the system determines your identity based on what you know. If someone else also knows your password, then the system will incorrectly authenticate that user. A system may also only allow a user to login if they are at a specific terminal or are at a computer within the company building. In this case, authentication is not based on what you know, but where you are. Some more recent forms of authentication include voice prints and finger prints. These types of authentication effectively authenticate who you are and not what you know or where you are.

In the context of the Emergent Internet and P2P networks, authentication cannot be based on where you are since nodes in the Emergent Internet may be mobile. Since the devices that access this network are diverse, some may not have the necessary hardware to authenticate a user based on the user’s voice print or finger print. Thus, the prevalent method of authentication will likely remain password based. There are of course advantages and disadvantages to password based authentication. One of the major disadvantages is the fact that the user has to choose his or her password. This can often be a problem to security since easy to remember passwords tend to be easy to guess passwords as well.

Once a user has been authenticated, the system must know what that user is allowed to do. This is known as authorization. Since the Emergent Internet has a large variety of users accessing a wide range of nodes, authorization lists are not tractable. One proposed mechanism for authorization is based on a P2P mechanism known as a transitive trust system.

Transitive Trust Systems

In these systems, a user is authorized to perform a given action at a node if his trust level in a given community is greater than some threshold. The trust level is based on a locally evaluated formula based on the trust level of a user as established with other nodes within the community. Essentially, user A is trustworthy if nodes 1 through N are trustworthy and nodes 1 through N trust A.

The concept of transitive trust has been adopted before. The observant reader would notice that traditional search engines are a form of transitive trust with a maximum transitive hop count of 1, while Google search is also a form of transitive trust with a maximum transitive hop count of 2.

Within the context of P2P communities, when new nodes join a P2P network, they are assigned an initial trust value of zero. As the nodes participate in activities in the P2P community, other nodes will cache the trust value appropriate to the exchanges that occurred during that node’s interaction with the community. Other nodes that have not encountered this node in the past can request from the community their trust profile of the node. The trust profile can then be consolidated over a number of transitive hops to determine the final local trust value of the node.

Proposed Architecture

The Emergent Internet will be a global community in which nodes must interoperate to provide contextual services to the user. This community will consist of corporations and P2P communities. At the highest level, trust will be required between corporations and P2P communities. This intercommunity trust system will likely be based on a transitive trust system. The transitive trust service will be provided as a standardized web service to allow any node to interoperate with the system.

Corporations will have their own security system in use internally. This trust system need not be accessible remotely. This will allow existing system to continue to interoperate in the next generation of the Emergent Internet.

Within P2P communities, security will be provided using a cross layer approach that will leverage several P2P technologies. Token based computing and load balancing will be a service provided by the community to support security functions.

Transitive trust and trust levels will be used to provide authorization services within the P2P community. In this particular system, new nodes will be given an initial trust value. Other nodes will cache the trust value of the nodes they associate with during P2P interactions. The local trust value for a node will be calculated by consolidating trust over a number of transitive hops.

Some P2P networks will want to provide anonymous access to the community. To support this, the access rights within the community will have an access profile for nodes with a trust level of zero. Nodes who wish to remain anonymous will assume a trust level of zero within the community. This trust level will be taken as a standard trust value which does not need to be verified. Services provided by the community that allow for a trust value of zero will be granted to nodes that wish to remain anonymous.

Distributed storage will be used to store access rights and certificates within the community. In order to do this, trusted nodes will be part of a structured P2P overlay such as Chord or Tapestry that leverages a distributed hash table to store the security information and the access lists. Since only trusted nodes will be allowed to be part of the distributed storage system, there is a strong coupling between the distributed storage system and the transitive trust system. Distributed storage and transitive trust will be a cross layer challenge.

Finally, the P2P community will implement the capabilities of a modified version of RSA which allows for commutative encryption. Such an encryption scheme can leverage certificates according to the community profile. More importantly, the security system will be completely distributed which will make it possible to securely exchange information with other nodes with the option for zero knowledge proofs. This will serve as the basis for more complex security which is adaptable and flexible to the application of the community.