Talks & Discussions

Opening Remarks

Andy Yates, EMBL-EBI

Closing Remarks

Julia Wilson, Wellcome Sanger Institute

Keynote: Title Coming Soon

Paolo Di Tommaso, Seqera Labs

nf-core: Title Coming Soon

Harshil Patel, Seqera Labs

Panel Discussion: Scaling Sustainably

Peter Clapham, Wellcome Sanger Institute Leanne Haggerty, EMBL-EBI

Martin Pollard, Wellcome Sanger Institute Andrew Menzies, Wellcome Sanger Institute

Paolo Di Tommaso, Seqera Labs Harshil Patel, Seqera Labs

Community: Predicting the protein function impact of every single amino acid substitution in human pangenomes

Nuno Saraiva-Agostinho, Ensembl, EMBL-EBI

The Human Pangenome Reference Consortium (HPRC) has so far released 47 genome sequences representative of diverse human populations. To help interpret genetic variants from these assemblies, we created an open-source Nextflow pipeline to predict the functional impact of every single amino acid substitution (SAAS) with SIFT and PolyPhen-2.

Our pipeline uses genomic sequences (FASTA) and gene annotations (GTF) from assemblies produced by Ensembl, employing AGAT to generate protein sequences from protein-coding transcripts. For each unique protein, we massively parallelise the functional impact prediction of alternate amino acids per sequence position. The results are saved in space-efficient matrices within a SQL database. As new assemblies emerge, the pipeline will add predictions for novel proteins.

The pipeline is containerised in Docker, enabling portability and efficient SAAS data production across pangenome assemblies, and its results can be shared with the scientific community.

Community: How Nextflow is a key to the digital transformation of WSI's pipeline solutions

Nathan Cunningham, Platform Solutions, Wellcome Sanger Institute

The Wellcome Sanger Institute involves skilled pipeline teams managing sequencing device pipelines using various tools and coding languages. They utilise large scale digital environments and large storage solutions. However, complications arise from the lack of a formal data taxonomy, knowledge dependency, governance issues, inefficient resource utilisation, and reliance on legacy tools, leading to sparse documentation and steep learning curves for new team members.

To address these challenges, the institute will pursue a to-be architecture that includes developing a consistent data taxonomy, greater standardization, and recommendations for governance models and formalized demand and capacity planning. This future architecture will draw from best practices and tools, like Nextflow, within the institute and provide a future-proof platform, enhancing efficiency, standardisation, resilience, and flexibility, ultimately aligning with new operating models for the coming years.

Community: Modernizing MGnify: Nextflow and nf-core for microbiome research

Martin Beracochea, MGnify, EMBL-EBI

Within the Microbiome Informatics team, the MGnify production pipelines are currently undergoing migration to Nextflow and nf-core tools. While a selection of key pipelines have already been successfully migrated, the ongoing effort focuses on incorporating best practices. Reliability, speed of development and deployment, platform support, community engagement, and support at EBI are at the heart of this initiative. Embracing nf-core's modern development practices and tools has, step by step, bolstered the robustness and versatility of MGnify pipelines, allowing our team to make more effective use of their time and development efforts. This pragmatic approach seeks to ensure the MGnify resource's reliability and adaptability to serve the research community even more effectively.

Community: pgsc_calc: A reproducible workflow to calculate polygenic scores

Samuel Lambert, University of Cambridge & EMBL-EBI

Polygenic scores (PGS) have transformed human genetic research and have many potential clinical applications, including improved risk stratification for disease prevention, and predicting treatment response. PGS are now able to quantify the genetic predisposition of an individual to hundreds of common diseases and clinically relevant traits and are distributed via the PGS Catalog (the only FAIR repository of PGS). Calculating PGS requires significant bioinformatics expertise that is out of reach to many. Here we present the PGS Catalog Calculator (pgsc_calc), developed in Nextflow with the nf-core framework, to automate and reproducibly calculate PGS. pgsc_calc is scalable and portable, supporting deployment in high-performance computing clusters and airlocked trusted research environments that hold genetics data. Most importantly, pgsc_calc breaks barriers to the equitable application of PGS by implementing genetic ancestry estimation and score normalisation using reference data.

Community: Migrating Ensembl pipelines to Nextflow for enhanced production-ready workflow

Disha Lodha, Ensembl Metazoa, EMBL-EBI

Ensembl Metazoa provides an open access platform to integrate publicly available genome-scale data sets for non-vertebrate metazoa species, ranging from sponges to insects. We also collaborate with VEuPathDB to broaden our support to the research community focusing in key host, vector and pathogen species. Until recently, our pipelines were solely reliant on eHive, a workflow manager created and maintained by Ensembl. However, a recent strategic shift toward enhancing flexibility and portability led us on a mission to transition to Nextflow.

Currently, we have three pipelines that have been fully migrated to Nextflow, mainly using Python and Perl scripts. Nextflow has provided us with the opportunity to implement efficient unit testing and to create a CI/CD system for all our pipelines.

We intend to share the challenges faced and the lessons learned when using Nextflow to develop production-ready pipelines to help create user-friendly and shareable workflows for scientific research.

Community: Resource optimisation and sustainability in Sanger Tree of Life

Matthieu Muffato & Damon-Lee Pointon, Tree of Life, Wellcome Sanger Institute

At the Wellcome Sanger Institute, the Tree of Life programme aims at generating tens of thousands of gold-standard genomes over the next decade, covering all animals, plants, fungi and protists. Genome assembly is a complex, computationally-, memory-, and storage- intensive task. To operate at the required scale (60 new genomes every week), we need to tightly monitor and control the usage of the IT resources provided by the Institute.

All our assembly and analyses pipelines are written in Nextflow and leverage the nf-core tooling and methodologies. Here, we will present how we gather resource usage when running the pipelines, how we’re estimating our CO2 footprint, and how we analyse those data to define improved resource-usage rules, and the gains we’ve made.

Community: YASCP: A next-gen pipeline for ultra-large scale scRNA quality control & analysis

Matiss Ozols, Human Genetics, Wellcome Sanger Institute

The YASCP tool (https://github.com/wtsi-hgi/yascp) was designed with a straightforward goal: to create a massive, easily accessible bank of detailed cell data from blood samples of UKBiobank (UKBB) and East London Genes and Health (ELGH) participants (Cardinal). YASCP was written to support the Cardinal project, but is applicable to many other projects. We wanted to closely study and document the unique features of millions of these cells - a project of a scale not seen before. Using YASCP, the Cardinal project successfully achieved this, analysing data from 5,000 UKBB and 1,500 ELGH individuals, giving us a deeper understanding of their health and genetics. This success sets a new standard and opens doors for other big projects. For instance, YASCP might soon help the Jaguar project to understand the detailed cellular makeup of people from Latin America. And throughout all this, we can trust the accuracy and reliability of YASCP, thanks to the strong foundation provided by Nextflow.

Community: Speeding up variant annotation in the Ensembl Variant Effect Predictor (VEP) using Nextflow

Likhitha Surapaneni, Ensembl, EMBL-EBI

Ensembl Variant Effect Predictor (VEP) employs custom algorithms and integrates key tools and reference datasets to enable efficient variant annotation, filtering and prioritisation. Analysing and annotating large variant sets using VEP can be a time-consuming process. If a VEP run fails, it results in having to restart the job which is inefficient resource usage. To improve speed and scalability, we created a Nextflow workflow for VEP.

This open-source pipeline utilises strategies to parallelise variant annotation, enabling running across multiple species/assemblies simultaneously. The input data is validated and split into smaller chunks that are processed in parallel using VEP and finally merged to form a single output. Using caching allows restarting from intermediate steps in case of failure. Secondary files are deleted after each step for efficient storage. The workflow is containerised and supports high-performance clusters making VEP reusable by downstream pipelines.

Community: Using cfDNA methylation to detect and subtype cancer

Simon Pearce, Cancer Biomarker Centre, Cancer Research UK

Circulating cell-free DNA (cfDNA) is a biomarker for cancer, as tumour-released DNA is detectable in blood plasma. We have established an MBD-protein enrichment method (T7-MBD-Seq) to investigate genome-wide cfDNA methylation, and have developed 2 Nextflow pipelines for demultiplexing, alignment, and creation of analysis objects, applied to >3000 samples. We created a Nextflow pipeline to build classifiers, using data augmentation to simulate the wide range of tumour content found in cfDNA, and created a classifier to predict cancer type across 29 tumour classes. When tested on 143 cfDNA samples from patients, we achieve 84% sensitivity and 96% accuracy. This classifier was applied to 41 patients with Cancer of Unknown Primary; patients diagnosed with metastatic cancer but the primary cancer cannot be determined. Predictions were made in 32/41 cases (23/26 clinically consistent), showing we can predict tissue-of-origin from a liquid biopsy, increasing treatment options.