Home‎ > ‎

Superlink-online

Superlink-online is a web portal which allows  geneticists to perform parametric genetic linkage analysis.

Genetics
This type of analysis allows to narrow down the genomic region(s) of a mutation responsible for the genetic disease. It is typically used for rare Mendelian diseases with single affected gene, though some more frequent diseases were also deciphered using this technique, such as Cystic Fibrosis, Breast cancer and others. This type of analysis is called parametric since it is based on a statistical model of inheritance. The fundamental biological phenomenon behind this model is recombination, or mixing of  the genetic material of the predecessors. The probability of recombination between two genetic locations (loci) depends on their physical genomic distance - the closer the loci, the less recombinations occur.

The input to the analysis comprises the pedigree structure with affected individuals,  the samples of the genetic information of some of its members in many ( up to 1M) known locations (called markers), and various model parameters, such as disease frequency in the population. The model allows evaluation of the probability of the input data under the hypothesis of the mutation being in a certain genomic location relative to the map of markers. Higher value of the ratio of this probability and the probability of the null hypothesis, also called Likelihood of Odds (LOD) score, is considered an indication for the mutation in the proximity of the specified location.

Bayesian networks
The model is expressed as a Bayesian (probabilistic) network. Bayesian networks are used in various domains to model complex multivariate probability distribution function. The joint function over all the variables cannot be represented explicitly due to the large size. Rather, probabilistic queries are computed using the factorized representation of the joint function as a product of the conditional probabilities of the respective variables. The particular query used in genetic linkage analysis is "probability of evidence", i.e. given the general model, the probability of the assignment of specific values to some of the variables is computed.

Why parallelism
Bayesian networks are used in various domains, but the particular feature of the networks which model genetic linkage analysis is their huge size. Furthermore, the computational complexity of computing the probability of evidence in these networks may reach years of CPU time. Hence, we parallelized the analysis by
breaking the problem into multiple independent parts and invoking them in parallel on thousands computers around the world. More details on the parallelization are available in this paper.


Opportunistic computing
The parallelization which requires no synchronization between the participants enables the use of non-dedicated opportunistic computing systems (grids), where running jobs can be preempted or executing computers may fail. Failed  jobs are simply restarted from the beginning, but other running jobs will not be affected by the failure. This approach is also very scalable, hence the use of many CPUs is possible. We managed to gain access to several grids and clusters around the world, including the TechnionEGEEOSGUW Madison Condor pool. We also built our own community grid called superlink@technion, which utilizes the power of desktop machines of thousands of volunteers in 105 countries.

Superlink-online portal
The use of the distributed computing is sometimes tricky, and in order to make the system useful all the complexity of the underlying system must be hidden. Superlink-online completely hides the internals, allowing geneticists to submit their data via a simple web interface.

Since 2006 Superlink-online has been used by hundreds of geneticsist from leading research institutions. Some important genetic mutations were identified using the Superlink-online system. 

There was some press coverage when the system was first released.
The more recent article was on iSGTW, echoed by HPCwire and other HPC blogs.

Under the hood

Superlink-online is a typical representative of domain-specific web portals. Under the hood it uses many opportunistic environments, some of which are quite unreliable. There are two enabling technologies:
  1. Grid Execution Hierarchy, which maps  shorter tasks to more reliable resources (the basic ideas are in this paper, but since 2006 quite a few changes were made)
  2. GridBot - the system which unifies resources from multiple grids/clusters and executes multiple Bags of Tasks (BOTs) on them. It implements various techniques to reduce the makespan of each BOT, such as policy-based matchmaking and replication. GridBot essentially creates a virtual cluster from all grids by establishing an overlay of slightly modified BOINC clients.