Edunext Download For Windows 10

S/HIC classifies each window as a hard sweep (blue), linked to a hard sweep (purple), a soft sweep (red), linked to a soft sweep (orange), or neutral (gray). This classifier accomplishes this by examining values of various summary statistics in 11 different windows in order to infer the mode of evolution in the central window (the horizontal blue, purple, red, orange, and gray brackets). Regions that are centered on a hard (soft) selective sweep are defined as hard (soft). Regions that are not centered on selective sweeps but have their diversity impacted by a hard (soft) selective sweep but are not centered on the sweep are defined as hard-linked (soft-linked). Remaining windows are defined as neutral. S/HIC is trained on simulated examples of these five classes in order to distinguish selective sweeps from linked and neutral regions in population genomic data.

We simulated data for training and testing of our classifier using our coalescent simulator, discoal_multipop ( -lab/discoal_multipop). As discussed in the Results, we simulated training sets with different demographic histories (S1 Table), and, for positively selected training examples, different ranges of selection coefficients ( = 2Ns, where s is the selective advantage and N is the population size). For each combination of demographic history and range of selection coefficients, we simulated large chromosomal windows that we later subdivided into 11 adjacent and equally sized subwindows. We then simulated training examples with a hard selective sweep whose selection coefficient was uniformly drawn from the specified range, U(low, high). We generated 11,000 sweeps: 1000 where the sweep occurred in the center of the leftmost of the 11 subwindows, 1000 where the sweep occurred in the second subwindow, and so on. We repeated this same process for soft sweeps at each location; these simulations had an additional parameter, the derived allele frequency, f, at which the mutation switches from evolving under drift to sweeping to fixation, which we drew from U(0.05, 0.2), U(2/2N, 0.05), or U(2/2N, 0.2) as described in the Results. For our equilibrium demography scenario, we drew the fixation time of the selective sweep from U(0, 0.2)N generations ago, while for non-equilibrium demography the sweeps completed more recently (see below). We also simulated 1000 neutrally evolving regions. Unless otherwise noted, for each simulation the sample size was set to 100 chromosomes.

DOWNLOAD 🔥 https://fancli.com/2y8485 🔥

For each combination of demographic scenario and selection coefficient, we combined our simulated data into 5 equally-sized training sets (Fig 1): a set of 1000 hard sweeps where the sweep occurs in the middle of the central subwindow (i.e. all simulated hard sweeps); a set of 1000 soft sweeps (all simulated soft sweeps); a set of 1000 windows where the central subwindow is linked to a hard sweep that occurred in one of the other 10 windows (i.e. 1000 simulations drawn randomly from the set of 10000 simulations with a hard sweep occurring in a non-central window); a set of 1000 windows where the central subwindow is linked to a soft sweep (1000 simulations drawn from the set of 10000 simulations with a flanking soft sweep); and a set of 1000 neutrally evolving windows unlinked to a sweep. We then generated a replicate set of these simulations for use as an independent test set.

The difference between S/HIC and these two other methods is amplified when testing these classifiers on stronger hard sweeps ( ranging from 2,500 to 25,000). Our classifier is better able to narrow down the selected region by classifying flanking windows as hard-linked, while SFselect+ and evolBoosting classifies the vast majority of simulations even 5 windows away from the target of selection as hard sweeps (Fig 5). SFselect+ and evolBoosting both have more sensitivity to detect hard sweeps when examining the correct window (>99% versus 88.8%), as S/HIC misclassifies 10.9% of these stronger sweeps as hard-linked. On the other hand, S/HIC recover 91.8% of soft sweeps versus 87.7% for SFselect+ and 73.3% for evolBoosting+, and correctly classifies the mode of selection more often than these methods. We also misclassify relatively few regions linked to soft sweeps as sweeps themselves (~16% when one window away, versus ~50% for SFselect+ and ~20% for evolBoosting+).

Next, we examined the proportion of windows at various distances from sweeps that are assigned to each class under this scenario of demographic misspecification. We find that while S/HIC classifies hard sweeps with lower sensitivity than under constant population size scenario (56.0% and 19.1% of test examples are classified as hard and soft, respectively), relatively few linked windows are classified as sweeps (Fig 7A). For soft sweeps S/HIC fares less well (20.7% of windows are correctly classified, and 34.7% classified as hard sweeps), though again relatively few false positives are produced in linked or neutral regions. In contrast, evolBoosting+ classifies the majority of windows, selected or otherwise, as soft sweeps (Fig 7C): 68.5% of hard sweeps and 55.0% of neutral regions are misclassified as soft. For SFselect+ this problem in exacerbated: 68.6% of hard sweeps and 95.3% of neutral windows are classified as soft sweeps. Thus, under this scenario of demographic misspecification, S/HIC is the only method we examine which can discriminate between positively selected and unselected portions of the genome effectively.

In total, we examined 344 windows, each 200 kb in length. We classified 34 windows (9.9%) as centered around a hard sweep, 22 (6.4%) as linked to a hard sweep, 48 (14.0%) as centered around a soft sweep, 89 (25.9%) as linked to a soft sweep, and 151 (43.9%) as neutral. Surprisingly, we infer that over 56% of windows lie within regions whose patterns of variation are affected by sweeps either within the window or in linked regions. This may imply that, given the genomic landscape of recombination in humans, even if selective events are somewhat rare [58], they may nonetheless impact variation across large stretches of the genome. However, we cannot firmly draw this conclusion given the difficulty of distinguishing between linked selection and neutrality under the European demographic model (Fig 7).

An additional advantage of machine learning approaches such as ours is the relative ease with which the classifier can be extended to incorporate more features, potentially adding information complementary to current features that could further improve classification power. For example, our examination of linkage disequilibrium is limited to within each subwindow; including features measuring the degree of LD between subwindows could also add valuable information. In addition, we could add statistics currently omitted which capture patterns of genealogical tree imbalance (e.g. the maximum frequency of derived alleles [68]), or star-like sub-trees within genealogies (e.g. iHS [42], nSL [23]), both symptoms of various types of positive selection. Indeed, all tests for selective sweeps can be seen as methods to detect the distortions in the shapes of genealogies surrounding selected sites. Thus, if one could directly examine the ancestral recombination graph (ARG) surrounding a focal region, more powerful inference could be possible. It is now possible to estimate ARGs from sequence data [69], and summaries of these estimated trees could be incorporated as features to identify sweeps and classify their mode. These are just some of a multitude of possible features that one can use to make inferences about natural selection. The success of S/HIC, evolBoosting [40], and SFselect [37] in our tests relative to more conventional methods shows that machine learning approaches leveraging many different types of information have the potential to make far more powerful inferences than methods relying on an individual statistic. 006ab0faaa