Research
Research theme I: Machine learning (including deep learning) has been successful in many fields of very large samples, however yet to be extended to the fields with moderate or small samples. Medical genomics is a typical field with high-dimentional data however limited labelled samples. By utilizing larger unlablled samples, we conduct Representation Learning, which learns sensible representations of genomic data, paving the way to downstream analysis towards a focal disease with small samples. This research enables powerful statistical learning in the fields with small samples, in particular biological and medical applications.
Small sample (Picture from math with bad drawings)
Research theme II: Association mining and causality inference are critical techniques in statistics. In biology, many applications involve complex structures with multi-scale big-data, including DNA, RNA, protein, and epigenetic marks. We develop novel statistical models and their scalable implementations to discover associations and causal factors in multi-scale data. This research allows the prediction of important biological or medical properties such as the risk of disease and response to treatments.
Research theme III: Statistical inference based on noisy and biased data is challenging, however is frequently encountered in practice. In particular, the emerging single-cell sequencing technology provided unprecedented opportunity to analyze biological phenomona at the single-cell resolution, however still suffers from significant noise and experimental bias due to premature experimental instruments. We develop novel algorithms to mine sensible knowledge depite of noise and bias in the data. Our statistical models will bridge the gap between the ability of state-of-the-art sequencing instruments and the abitious biological applications.
Single-cell RNA-Seq data (Picture from Panoli's article at towardsdatascience.com)
Selected Works: (My trainees are underlined; * = joint first authors; # = corresponding author(s))
Statistical method development
Kossinna P, Kumarapeli S, Zhang Q# (2023+). “IBAS: Interaction-bridged association studies discovering genetic basis of complex traits with high stability”. Submitted. (Preprint and Software)
Wang D*, Perera D*, He J*, Cao C, Kossinna P, Li Q, Zhang W, Guo X, Alexander P, Wu J, Zhang Q#. (2023) “cLD: Rare-variant linkage disequilibrium between genomic regions identifies novel genomic interactions”. PLoS Genetics. 2023 Dec 18;19(12):e1011074. doi: 10.1371/journal.pgen.1011074. Online ahead of print. (Software)
He J, Li Q, Zhang Q# (2023) “rvTWAS: identifying gene-trait association using sequences by utilizing transcriptome-directed feature selection”. Genetics. 2023 Nov 24:iyad204. doi: 10.1093/genetics/iyad204. Online ahead of print. (Software)
Li Q, Yu Y, Kossinna P, Lun T, Liao W#, Zhang Q#. (2023) “XA4C: eXplainable representation learning via Autoencoders revealing Critical genes”. PLoS Computational Biology. 2023 Oct 2;19(10):e1011476. doi: 10.1371/journal.pcbi.1011476. PMID: 37782668 (Software)
Kossinna P, Cai W, Shemanko C, Lu X, Zhang Q#. (2022) “Stabilized COre gene and Pathway Election uncovers pan-cancer shared pathways and a cancer specific driver”. Science Advances. 2022 Dec 21;8(51):eabo2846. doi: 10.1126/sciadv.abo2846. PMID: 36542714 (Software)
Cao C, Kossinna P, Kwok D, Li Q, He J, Su L, Guo X, Zhang Q#, Long Q#. (2022) “Disentangling genetic feature selection and aggregation in transcriptome-wide association studies” Genetics (Cover Feature). 2022 Feb 4;220(2):iyab216. doi: 10.1093/genetics/iyab216. PMID: 34849857. (Software)
Zhang Q, Tyler-Smith C, Long Q (2015). “An extended Tajima’s D neutrality test incorporating SNP calling and imputation uncertainties”. Statistics and Its Interface. 2015, vol.8(4), 447-456.
Zhang Q, Long Q, Ott J (2014). “AprioriGWAS, a new pattern mining strategy for detecting genetic variants associated with disease through interaction effects”. PLoS Computational Biology, Jun 5; 10(6). (Software)
Long Q*, Zhang Q*, Vilhjalmsson BJ, Forai P, Seren Ü, Nordborg M (2013). “JAWAMix5: an out-of-core HDF5-based java implementation of whole-genome association studies using mixed models”. Bioinformatics 2013 March. (Software)
Data analysis
Long Q, Rabanal FA, Meng D, Huber CD, Farlow A, Platzer A, Zhang Q, Vilhjálmsson BJ, Korte A, Nizhynska V, Voronin V, Korte P, Sedman L, Mandáková T, Lysak MA, Seren U, Hellmann I, Nordborg M (2013). “Massive genomic variation and strong selection in Arabidopsis thaliana lines from Sweden”. Nature Genetics, 45(8): 884-90.
Zhang, Q as one of the listed participants of the International HapMap 3 Consortium. (2010) “Integrating common and rare genetic variation in diverse human populations”. Nature 467(7311): 52-8.
Zhang, Q as one of the listed participants of the International HapMap Consortium. (2007) “A second generation human haplotype map of over 3.1 million SNPs”. Nature 449(7164): 851-61.
Sun T, Gao Y, Tan W, Ma S, Shi Y, Yao J, Guo Y, Yang M, Zhang X, Zhang Q, Zeng C & Lin D. (2007) “A six-nucleotide insertion-deletion polymorphism in the CASP8 promoter is associated with susceptibility to multiple cancers”. Nature Genetics 39: 605-613
Zhang, Q as one of the listed participants of the International HapMap Consortium. (2005) “A Haplotype Map of the Human Genome”. Nature 437: 1299-1320
Zhang, Q as one of the listed participants of the International HapMap Consortium. (2003) “The International HapMap project”. Nature 426: 789-796