Superlink-online is a web portal which allows geneticists to perform parametric genetic linkage analysis. Genetics This type of analysis allows to narrow down the genomic region(s) of a mutation responsible for the genetic disease. It is typically used for rare Mendelian diseases with single affected gene, though some more frequent diseases were also deciphered using this technique, such as Cystic Fibrosis, Breast cancer and others. This type of analysis is called parametric since it is based on a statistical model of inheritance. The fundamental biological phenomenon behind this model is recombination, or mixing of the genetic material of the predecessors. The probability of recombination between two genetic locations (loci) depends on their physical genomic distance - the closer the loci, the less recombinations occur. The input to the analysis comprises the pedigree structure with affected individuals, the samples of the genetic information of some of its members in many ( up to 1M) known locations (called markers), and various model parameters, such as disease frequency in the population. The model allows evaluation of the probability of the input data under the hypothesis of the mutation being in a certain genomic location relative to the map of markers. Higher value of the ratio of this probability and the probability of the null hypothesis, also called Likelihood of Odds (LOD) score, is considered an indication for the mutation in the proximity of the specified location. Bayesian networks The model is expressed as a Bayesian (probabilistic) network. Bayesian networks are used in various domains to model complex multivariate probability distribution function. The joint function over all the variables cannot be represented explicitly due to the large size. Rather, probabilistic queries are computed using the factorized representation of the joint function as a product of the conditional probabilities of the respective variables. The particular query used in genetic linkage analysis is "probability of evidence", i.e. given the general model, the probability of the assignment of specific values to some of the variables is computed. Why parallelism Bayesian networks are used in various domains, but the particular feature of the networks which model genetic linkage analysis is their huge size. Furthermore, the computational complexity of computing the probability of evidence in these networks may reach years of CPU time. Hence, we parallelized the analysis by breaking the problem into multiple independent parts and invoking them in parallel on thousands computers around the world. More details on the parallelization are available in this paper. Opportunistic computing The parallelization which requires no synchronization between the participants enables the use of non-dedicated opportunistic computing systems (grids), where running jobs can be preempted or executing computers may fail. Failed jobs are simply restarted from the beginning, but other running jobs will not be affected by the failure. This approach is also very scalable, hence the use of many CPUs is possible. We managed to gain access to several grids and clusters around the world, including the Technion, EGEE, OSG, UW Madison Condor pool. We also built our own community grid called superlink@technion, which utilizes the power of desktop machines of thousands of volunteers in 105 countries. Superlink-online portal The use of the distributed computing is sometimes tricky, and in order to make the system useful all the complexity of the underlying system must be hidden. Superlink-online completely hides the internals, allowing geneticists to submit their data via a simple web interface. Since 2006 Superlink-online has been used by hundreds of geneticsist from leading research institutions. Some important genetic mutations were identified using the Superlink-online system. The more recent article was on iSGTW, echoed by HPCwire and other HPC blogs. Under the hoodSuperlink-online is a typical representative of domain-specific web portals. Under the hood it uses many opportunistic environments, some of which are quite unreliable. There are two enabling technologies:
|