The field of network analysis is at the same time old and new: it is old because many different disciplines were already working on questions in this field and had supplied their methods; and it is new because the broadening and also integrating perspective on networks of the last ten years required new methods and models to be designed.
Not all of them were regarded well. In general, the field has the tendency to put possible network data out of context, apply some network analytic method to it, and interpret the result of it back in its context. Of course, it is the great advantage of network analysis, that it frames everything as a network and that any method developed for any network analytic question can be applied to any other network. On the other hand we will show in the following that network analytic methods and models are not context-free. They come with many implicit assumptions that are only very seldomly made explicit. Based on the scheme of network analytic projects proposed here, we will discuss different problems in each phase of the scheme and either provide a solution or state open questions in this realm.
- Pose the Question: Defining a question to be
answered within a complex system.
- Build the Network: Reducing the complex system at
hand to a network by defining a set of actors and a relationship
between these actors.
- Analyse the Structure:
- Structure on the level of actors: centrality,
clustering coefficient, degree
- Structure on the level of the whole graph:
spectral properties, aggregated actor measures (degree distribution,
- Structure on the level of groups of actors:
- Build a random graph model: Finding structures that
are significantly different from the proposed null model.
- Build a network model: Building a model that models
the newly found structures under the constraints given by the complex
- Build a process model: Defining a process that is of
- Simulate the process on the network model:
Understanding the relationship between the structure of a network and
the process on top of it.
Posing a Question and Building the NetworkWe need to differentiate at least two different approaches in network analysis: one starts with an interesting question and the other one starts with an interesting data set. The latter requires ingenuity and luck: by applying different analysis methods, guided by temporary results, one hopes to find an unusal, unexpected structure within the network. We will not discuss this approach here, since there is no real question stated before the analysis is conducted. The first approach is very challenging. It first requires to pose a specific question, then to find a network with which this problem can be solved.
Let's assume we have the following question to answer: given a protein, can we predicts its likely biological activity by looking at the proteins with which it interacts?
Certainly we know that proteins do interact to excert some biological function. However, the reverse is not necessarily true: not every two proteins that interact must follow the same biological role. For example, a protein could be an important signal transducer in cell proliferation and then transfer the signal to proteins responsible for gene expression and proteins involved in mitosis. But the main problem is where to get the data from. There are basically two types of protein interaction data: careful specific experiments that test pairwise whether two proteins are interacting or not in a specific cell. These experiments might be very fine granular and differentiate between interactions concerning a specific cell location (e.g., nucleus), certain cell cyclus phases and other conditions (hormons, nutrition, environmental conditions). The main data provided stems from very bold high-throughput essays. In this essay, a yeast cell carries two additional genes that will express the two proteins of interest. These are genetically altered such that they carry an additional piece of protein each. Whenever the two proteins of interest come close enough to each other, the two added pieces will also be in vicinity. They are designed such that in this case, they will emit light under certain conditions. With a simple light detector, it is then possible to understand which yeast colony emits light and which do not. These can be picked and analyse to see which to genes were included into the genome.
As beautifully designed this assay is, it has some caveats. Can you spot them?
- The proteins are overexpressed in yeast. They might be expressed in different quantities in the original organism;
- They might never be expressed at the same time;
- They might not normally be at the same place in the cell;
- The shape of the protein might be changed by the piece of protein that he carries and thus induce new or hinder normal interactions.
The problem is of course that on a large scale, we only have the second kind of data since the first kind of data is way to time consuming and costly to do for any pair of proteins. So, in essence we would like to have a high-quality data set from which we would construct a network. A simple clustering of this network --- possible with an overlapping clustering algorithm since proteins can excert different biological roles --- would then answer our question whether highly interacting proteins excert the same role and whether all groups of proteins with the same biological role would also highly interact with each other. Unfortunately, we only have the low-quality data. Thus, one open question is surely: how strongly do the networks need to be correlated such that we can analyse the low-quality one to state any result on the high-quality network? Which clustering algorithms are stable enough to deal with a lot of noise? What kind of noise do we expect between the low- and high-quality version of the network?
Similar difficult questions regard data that cannot be obtained because of privacy issues or because they are proprietary. How can they be simulated? Is it enough to show something on simple network models? Can, e.g., a freely available social network like the followers of somebody on Twitter model another social network, e.g, the email contact network?
A third aspect regards the handling of bipartite graphs.
A fourth aspect regards handling networks that are derived from pairwise similarity data. As Butts [Butts2009] showed very impressively, selecting the correct threshold value is a major problem and will influence all subsequent analysis steps.
Sometimes the concepts are also not yet well enough defined to really understand which network to look at. We are currently collaborating on understanding knowledge building in wikipedia. But of course, knowledge building mainly happens in the head of a reader or a writer. How can we map the idea that new terms or a new connection between terms emerges, stabilizes and starts to live a life of its own? Do we need to look at the interconnections between the writers? The articles? The words? Or all at once?
AnalysisCentralities: we have introduced a dozens different centrality indices. What is the single best one? As seen by the article by Borgatti [Borgatti2005] it mainly depends on the process we are interested in. Depending on the process we are interested in, we need to choose the according graph. So, as we see here, not only the question at hand determines the needed graph, but also the analysis method.
Motifs: see below
Using a Random Graph Model as Null-Hypothesis
As we said earlier, random graph models are used to find those structures in a real-world network that are not explainable by the known structures of this network. E.g.: at the start of the new field of network science, we only 'knew' that networks have a certain number of nodes and edges. It was then shown that the clustering coefficient must emerge from a process that is not explainable by a random connecting process, similarly to the skewed degree distributions that are as well not explainable by random connecting processes. From that one, we knew that we had to take into account the degree distribution into any decent random graph model
So, in essence we want to put as much information into a random graph model as necessary and then compare the real-world network with it to see which parts of it are still unexplained. Another view is that we take the real-world network and perturb that part of its structure of which we think it is non-random. If the perturbation maintaines the structural feature, we could not reject the hypothesis that it is basically random.
E.g.: say we hypothesized that the diameter of real-world networks is non-random. We would then maintain the number of nodes and edges, but perturb the edges by rewiring them. We do this sufficiently often (say 10,000) times and average over the resulting diameter. As we know now, the resulting diameter would still be very similar to the one we started with and thus we could not reject the hypothesis that this feature results from a random connecting process.
However, sometimes it is not so easy to see how the best random model should look like. We will give two examples:
- The first concerns the analysis of motifs in biological networks. As we have sketched above, the frequency of any motif is compared to the frequency of that same motif in the according random network (sampled and averaged, of course). If the frequency is significantly higher or lower, it is assumed that nature (architect) put effort to create this motif (avoid it) and that such the motif has a function (would be malign). In a very interesting comment on [Milo2002] and [Milo2004], Artzy-Raderup et al. showed that the chosen random graph model is indeed crucial [Artzy2004]: they relied on a model in which nodes are placed on a 2D grid. An edge is the more likely between two nodes the closer they are. This is considered a toy network model and it is not assumed that any real-world complex system really works according to it - however, it seems that the neural system of C. elegans shows some features of this. The point of Artzy-Raderup is that this toy-network would be considered highly non-random if tested by Milo et al.'s method. In two variants of the preferential attachment model, one would overrepresent feed-forward loops and the other would underrepresent feed-forward loops. Basically they state the question of what is the best random model to test against.
Another interesting question is why the single-node charateristics should be fixed and not the motifs. Maybe a random model maintaining the exact number of motifs would result in a totally different single-node characteristic - or even in exactly the same? These questions show that it is a very difficult problem to decide to which random model a real-world network model must be compared.
- We have shown that assessing the significance of association rules is equally dependent on defining the correct random graph model.
Modeling a real-world networkA good network model should not only produce a graph with the observed structures but also be realistically implementable within the given complex system. For example, most network models are required to be local, i.e., a node is supposed to have only partial knowledge about the (evolving) network structure. This requirement is certainly necessary when social network structures are to be explained because any person has an overview over her local network, but certainly not over the whole social network on earth.
We will show two examples where the proposed network model (at first) did not fit into known structures of the complex system.
- The preferential attachment model was introduced to describe how a skewed degree distribution can evolve. In some systems, like the WWW, it is understandable that any website designer might be aware of other sites proportional to the number of links this site already gathered. In other systems, like protein-protein-interaction networks, it is not so clear how a preferential attachment can be implemented: at first we need to think about how a new protein is added to the system. Current theories believe that new proteins are mainly build by duplication processes of old, established proteins. At the beginning, the duplicated gene will have the same shape and function (and thus also the same interactions as its copy). However, there is less selection pressure on the duplicate of the gene and hence it can quite freely mutate and thus change it shape and biological function. Based on this change, new interaction with other proteins might emerge and other interactions might vanish. Based on these assumptions, Vazquez et al. introduced the following network model: starting with a small network, in each time step a single vertex is duplicated with all its edges. Subsequently, some of the edges are dropped with some probability p and other edges emerge with probability p'. What is now the probability that any (old) node gains an edge in timestep t? There are two processes by which a node can gain an edge: by probability p' (which is the same for all nodes) or because some of his neighbors is copied (and the edge to him is not dropped with probability 1-p). Of course, the probability that one of his neighbors is copied, is directly proportional to its degree. Thus, through the backdoor, preferential attachment sneaks in again [Vazquez2002]. The model of Vazquez et al. is elegant and uses the known features of the system under observation. Whereas the abstract preferential attachment model could not be explained in the evolution of protein-protein-interaction networks, the new, biological model shows the power of the more abstract model to which it can be basically reduced.
- A second model was not as lucky and to our knowledge nobody ever showed how a biological system could implement it. It was given by Ravasz et al. to show how hierarchical networks can evolve [Ravasz2000]. The model was designed to 'bring modularity [as in the sense of Ravasz et al.], the high degree of clustering and the scale-free topology under a single roof'. Their model starts with a small graph with some denoted center vertex, e.g., a five-node graph where four nodes are connected to a ring and the fifth node is connected to all others. In each step, the current graph is copied k times and one of them is the discerned center graph. All peripheral nodes of all other graphs are connected directly to the center node of the center graph. It is easy to see that the degree distribution is multiplied by k each time and that a new node emerges with k times the maximal degree of the graph in the time step before. Thus, the resulting degree distribution is scale-free. Also, we see a high clustering coefficient and a hierarchical organization of clustering coefficient since the center nodes are connecting otherwise unconnected graphs. The model was introduced in an article that concentrated on the analysis of metabolic networks. In this system, it is very difficult to assume that enzymes are copied k times in regular events. It is well known that most organisms don't cope with large copies of their genetic information - most of the times they are lethal. Thus, this model has to be rejected for modeling the evolution of metabolic networks - or it has to be shown how this effect comes into play indirectly as in the example above. Anyway, so far we know no better model for the evolution of metabolic networks, making it a worthwhile open question.