Research

  • Research interests

    • Research interests: quantitative genetics, breeding

    • Keywords: GWAS, GS, mixed-effects model

    • Related area: statistics, machine learning (including reinforcement learning)

    • Programming: R (, C++, Python)

  • Backgrounds: What is GWAS (Genome-Wide Association Studies), and GP/GS (Genomic Prediction/Selection)?

  • GWAS (Genome-Wide Association Studies, see Right Figure) is a method to detect candidate genes for traits of interest by collecting many genotypes and corresponding phenotypes, and by testing the significance of each marker (with the utilization of LD (Linkage Disequilibrium) between markers and causal genes).


  • With the decreasing cost and increasing throughput of next-generation sequencing, the number of accessions that can be used for GWAS is increasing. Using such large sequencing data, GWAS has identified novel genes related to important agronomic traits and will contribute to the breeding optimization by combining with genome editing.

  • On the other hand, GP (Genomic Prediction) is a statistical method to predict genotypic values from genome-wide marker genotype and corresponding phenotypes by statistical/machine learning (see Right Figure).


  • GS (Genomic Selection) is a breeding method that performs selection with predicted values by GP (GEBV: Genomic Estimated Breeding Value), and is expected to contribute the speeding-up and the optimization of breeding because GS can be performed by individual unit regardless of time and place.

  • Theme 1: Choosing the optimal population for GWAS: A simulation study using whole-genome sequences of rice

  • To avoid potential false positives caused by population stratification/structure (= high genetic background, see Left Figure), a GWAS population should be selected that results in low stratification. However, if such a population is selected as an analytical population for a GWAS, the sample size may be limited and the detection power of the GWAS will decrease. Therefore, there should be the trade-off relationship between population stratification and sample size.

  • We conducted simulation experiments to see whether adding a population with a high diversity compared to a target population is appropriate, when the genetic diversity of the target population is small.

  • The results showed that the GWAS power with a mixture population was generally higher than with a separate population. Also, the GWAS optimal population varied depending on the fixation index FST of the quantitative trait nucleotide (QTN) and its polymorphism of QTN in each population. When a QTN is polymorphic in a target population, a target population combined with a higher diversity population improves the QTN detection power. Investigating FST and the expected heterozygosity He as factors influencing the detection power, we showed that SNPs with high FST or low He are less likely to be detected by GWAS with mixture populations. Sequenced/genotyped germplasm collections can improve the GWAS detection power by using a subset of them with a target population.

  • Related works

  • Theme 2: Development of a novel haplotype-based GWAS method

  • An allele is a position that causes the difference between individuals in genome, and a haplotype is a group of alleles (Figure Below).


  • Genes with complex compositions that consist of multiple rare alleles, such as haplotypes, are hard to be detected by conventional SNP-based GWAS (Right Figure).

  • In this study, we developed a novel single nucleotide polymorphism (SNP) set method, which tests multiple SNPs in each SNP-set at the same time, named “RAINBOW” and applied the method to haplotype-based GWAS by regarding a haplotype block as a SNP-set. Combining haplotype block estimation and SNP-set GWAS, haplotype-based GWAS can be conducted without prior information of haplotypes. We compared the power of our method, the conventional SNP-based GWAS, the conventional haplotype-based GWAS, and the conventional SNP-set GWAS. Our proposed method was shown to be superior to these in three aspects: (1) controlling false positives; (2) in detecting causal variants without relying on the linkage disequilibrium if causal variants were genotyped in the dataset; and (3) it showed greater power than the other methods, i.e., it was able to detect causal variants that were not detected by the others, primarily when the causal variants were located very close to each other, and the directions of their effects were opposite. By using the SNP-set approach as in this study, we expect that detecting not only rare variants but also genes with complex mechanisms, such as genes with multiple causal variants, can be realized.


  • RAINBOW was implemented as an R package named “RAINBOWR” and is available from CRAN and GitHub (see the link below).

  • Theme 3: Optimizing GP for costs and accuracy over phenotyping in early growth stages