Blog

An archive of all our past and present blog posts. A mixture of informative articles, announcements and course updates.

Online Core R Workshop

11/05/23 Sarah Dowsland

Join us for this free online two-hour workshop which provides an introduction to R for complete beginners. By the end of the course you will be able to;

find your way around RStudio
use the basic data types and structures in R
organise your work with scripts and projects
import data, summarise it and create and format a graph

The workshop assumes no prior experience of coding.

Event details

💻 Core R Workshop

🗓️ Tuesday 20th June 2023

🕑 2-4 pm

📍 Online

Registration

There are 30 places available for this online course. Priority will be given to NERC funded students and researchers, as well as researchers from underrepresented groups, although we encourage everyone to apply.

Application deadline: 12pm Monday 5th June. Applicants will be notified by Tuesday 6th June if they have been allocated a place. Scholarships and funding for headsets and monitors will be available.

▶️ Registration is now closed.

🗓️ Save the date 20th June! 🗓️ NEW online workshop - Introduction to R

09/05/23 Sarah Dowsland

Save the date - registration opening soon!

Join us for this free online two-hour workshop which provides an introduction to R for complete beginners. By the end of the course you will be able to;

find your way around RStudio
use the basic data types and structures in R
organise your work with scripts and projects
import data, summarise it and create and format a graph

The workshop assumes no prior experience of coding. We allow plenty of time for questions and we provide a high level of support for each learner. The workshop will be delivered by Emma Rand, an excellent instructor with a passion for teaching.

Event details

💻 Core R Workshop

🗓️ Tuesday 20th June 2023

🕑 2-4 pm

📍 Online

Places available

There are 30 places available for this online course.

Priority will be given to NERC funded students and researchers, as well as researchers from underrepresented groups, although we encourage everyone to apply.

Application deadline: Friday 9th June. Applicants will be notified by Tuesday 13th June if they have been allocated a place.

Registration fee

This workshop is funded by the Natural Environment Research Council (NERC) and therefore it is free to attend for candidates based in the UK.

How to register

Registration will open shortly; an online registration form will be available on the Cloud-SPAN website. If you so wish, you can also register your interest in the workshop and we will contact you when registration opens.

Keep in touch!

Twitter

Cloud-SPAN website

Congratulations to James Chong and the team!

26/04/23 Sarah Dowsland

Professor James Chong selected as an Oracle Research Fellow

Cloud-SPAN PI, Prof James Chong has recently been selected (as one of three globally) Oracle for Research Fellows for their Spring 2023 cohort. This industrially-funded fellowship includes a cash award of $100,000 (£81,000) and £132,200 of Oracle Cloud compute credits for a project entitled Using shotgun metagenomics data to develop a synthetic anaerobic digestion (AD) community.

Although James officially holds the award, the successful application was a team effort driven by Dr Sarah Forrester and Annabel Cansdale, both part of the Centre of Excellence for Anaerobic Digestion (CEAD) and Cloud-SPAN Instructors, who were heavily involved in preparing the application, and who will make use of the Cloud compute award throughout this project.

Read more about the Spring 2023 Oracle Fellows Program cohort!

NorthernBUG Meeting

19/04/23 Sarah Dowsland

Hi everyone!

We are pleased to announce that registration is open for the 9th NorthernBUG Meeting! It's a great opportunity to meet people and network so we hope to see you all there! Please spread this widely among your networks so we can maximize the reach.

Meeting Details

📍 University of Liverpool
📅 16th June 2023
ℹ️ View more information here
✨ Confirm your attendance (registration deadline 31 May 2023)

NorthernBUG is a really useful and free meeting where lots of people working on bioinformatics get a chance to get together. More importantly, because the meeting is free, registration early is essential so that the appropriate level of catering can be ordered. So if you're interested in going, please register by the 31st of May. If you would like to give a poster/talk the deadline for this is the 30th of April. There will be prizes for posters so it is well worth submitting for a poster. NorthernBUG is a great event to go to whether you are completely new to bioinformatics or a seasoned bioinformatician.

For further information please contact Emily.Johnson@liverpool.ac.uk or jamie.soul@liverpool.ac.uk for more details if needed.

Cloud-SPAN Code Retreat

19/04/23 Sarah Dowsland

We are excited to announce that our next Code Retreat will take place at 10:30-15:00 on 31st May 2023 at the University of York. It provides a chance for course alumni to come together and work on their own data problems with the support of Cloud-SPAN instructors.

What happens at a Code Retreat?

At previous retreats some people took the opportunity to revisit course materials and ask questions about the topics they didn’t understand. There are plenty of helpers on hand to answer questions and test understanding.

You could also apply the workflows and analyses taught in the Genomics course to your own datasets. Helpers are also on hand to discuss topics such as:

how to organise bioinformatics projects;
which tools are most effective;
how to approach a problem;
what analysis is best for a certain type of data;

as well as many others.

Some participants already know what help they will need and have specific questions to address during the day. Those with a less clear understanding of their problem could benefit from talking through their data and getting guidance from our experienced instructors. You could even try out new software tools.

What do previous participants have to say?

"I had a great time at the Cloud-Span retreat. I went from Lancaster just for it, and it was worth it. I got support in analysing the data I had in hand. That was really satisfying applying what we preciously learned (using our data this time). Having the Cloud-Span Team around to support us was essential. It’s been great learning with them, and I would 100% recommend them!"

Join us for our next event!

Our next Code Retreat for Cloud-SPAN Alumni will take place 10:30-15:00 on 31st May 2023 at the University of York. Feel free to stay for the whole event or drop in when you are free. Lunch will be provided and we offer support to cover travel expenses.

Registration deadline - 12pm Wednesday 24th May.

Training courses

If you are interested in attending our free training courses>> please visit our website for further information.

In depth: course feedback

02/02/23 Evelyn Greeves

We’ve finished our analysis of the feedback from our courses held at the end of last year: Prenomics, Genomics and Metagenomics. We’re pleased to see that on average participants saw an improvement across all of the learning outcomes for all of our courses - if they didn’t we’d be doing something wrong!

Methodology

We ask all our learners the same questions before and after the course, asking them to rate their level of comfort with, and understanding of, a range of topics. The options range from “I don’t understand what this involves” to “Confident”. For analysis purposes these answers are recoded as numbers with 1 being the least confident and 6 being the most confident. We’re interested in the mean scores across participants and how they differ before and after the course.

The dumbbell charts you will see below indicate the average improvement in score for each criteria. The lighter colour circle represents the score before the course, and the darker one represents the score after the course.

We don’t currently have a way to track individual participants and how their scores change. Everything is done on averages. This gives us a good overall idea of how learners’ understanding might have improved as a result of the course.

Prenomics

Overall the biggest improvements were seen in our Prenomics course, which is what we might expect. Most participants started with little to no knowledge of most of the topics discussed, and we saw their confidence soar over the two day course. Prenomics is designed to give participants a gentle introduction to the command line and provide a confidence boost for those who might feel intimidated, so we’re really happy to see that this seems to be working.

In particular participants saw a big leap in their confidence using the command line with a mean score increase of almost 3.5 points. On average learners finished the course feeling somewhere between “comfortable most of the time” (5) and “confident” (6), which is excellent news.

Genomics

We also saw big improvements in confidence in Genomics. We also saw great improvement in Genomics topics - approximately equal across all themes. Most participants started the course with a middling level of confidence using the command line, likely as a result of taking part in Prenomics beforehand. The average increase in score was between 2 and 2.5 points depending on the topic, with the average score after taking the course hovering around 5 - “fairly comfortable most of the time”.

Metagenomics

The Metagenomics course feedback was of particular interest as it was the first time we had run the course, so we were keen to see how people got on. Overall, participants finished the course with an average understanding score between 4 (“Fairly comfortable in some aspects”) and 5 (“Comfortable most of the time”) for all of the topics. Most learners started the course with a higher initial understanding for core knowledge such as command line and sequencing, so the scope for improvement was more limited than in, for example, the Prenomics course. Overall we’re pleased with the increase in understanding but are looking at how we can make things even clearer during the course, to boost confidence even further.

Upcoming courses

We will be running our Metagenomics course again from Tuesday 11 April until Friday 21 April 2023. This course is aimed at environmental scientists looking to use high performance computing for metagenomics analysis, but would be suitable for anyone with an interest in metagenomics. ➡️ Register here! ⬅️

We will also be running a workshop on Statistically Useful Experimental Design on Friday 14 April 2023. This course will be held in-person at University of York campus and is primarily discussion-based. ➡️ Register here! ⬅️

Cloud-SPAN Code Retreat

16/12/22 Sarah Dowsland

We are excited to announce that our next Code Retreat will take place on Monday 9th January 2023 at the University of York. It provides a chance for course alumni to come together and work on their own data problems with the support of Cloud-SPAN instructors.

What happens at a Code Retreat?

You could also apply the workflows and analyses taught in the Genomics course to your own datasets. Helpers are also on hand to discuss topics such as:

how to organise bioinformatics projects;
which tools are most effective;
how to approach a problem;
what analysis is best for a certain type of data;

as well as many others.

What do previous participants have to say?

Join us for our next event!

Our next Code Retreat for Cloud-SPAN Alumni will take on 10:30-15:30, Monday 9th January 2023 at the University of York. Feel free to stay for the whole event or drop in when you are free. Lunch will be provided and we offer support to cover travel expenses.

Training courses

If you are interested in attending our free training courses please visit our website for further information.

💸Scholarships to attend the Northern Bioinformatics User Group meeting 13th Jan

16/12/22 Sarah Dowsland

We have scholarships to cover expenses to attend the Northern Bioinformatics User Group (Northern BUG) meeting to be held on Friday 13th January 2023 at the University of Huddersfield.

NorthernBUG is a network of bioinformaticians and users or bioinformatics services in the north of England which hold quarterly meetings to build a community of researchers and others using big data in biology. NorthernBUG meetings are open to anyone interested in bioinformatics or its application in life science research and beyond. Meetings are free and early career researchers are especially encouraged career to attend and present their work. It’s a great forum to practice your talks, float new ideas and approaches, and present early work.

We have scholarships to cover travel expenses for five of our Cloud-SPAN 'graduates' to attend NBUG. Register for NBUG and complete a scholarship application with us here by Thursday 5th January 2023.

If you are interested in attending our free training courses please visit our website for further information.

How do we support our learners?

07/11/22 Evelyn Greeves

Learning new skills can be difficult, and it’s easy to go away from a course feeling like you’re even more confused than when you started! Here's what do we do at Cloud-SPAN to make sure all our learners complete our courses feeling confident in their newly-learned skills.

Prenomics

Are you completely new to concepts like cloud computing and the command line? Our Prenomics course is designed to give you the best possible start to your command line journey. It requires absolutely zero prior knowledge or experience! We take it slow, starting at the beginning with file paths and directories before building up to logging onto the cloud for the first time. Once you’re logged on, we take you through all the core commands you need to work in the command line effectively.

All of this is done using the same files and environment as the Genomics course, meaning that you’ll feel extra prepared to continue your learning.

Live Coding

We use a technique called live coding to teach all our courses. This means that you code along with us, copying what we type and seeing the output in real time on your own computer. Live coding keeps things interesting and gives you lots of practice at the new skills we’re teaching you. In addition, it means that we can help solve any errors you come across as we go through the course.

Instructors and Helpers

Each of our courses is led by one or two experienced instructors, who will do most of the teaching and live coding. There’ll also be several helpers, who are there to help you with any problems you encounter. No problem is too small! We want you to succeed, and we’ll try our best to get you there.

Code Retreats

After finishing one of our courses, your next thought should be “what next”? It’s likely that you’ll want to apply the techniques you just learned to your own research data. If so, you can get help with this by attending a Code Retreat, where our instructors will be happy to help you work on your own analyses. If you have any questions, we’ll do our best to get them answered. Code Retreats are also a great networking opportunity where you can meet others working on similar projects to your own and share problems, achievements and tips for success.

Drop-ins

Some of our courses have regular online drop-in sessions where you can get help and troubleshoot your problems. How this works varies depending on the course.

For our ‘Metagenomics’ course, drop-ins occur once a week for the month-long duration of the taught course. We run it like this because some of the analyses can take a long time, so it would be impractical to teach in long blocks. Instead, we let you run these time-consuming analyses in your own time (they can run in the background of your usual work!).
For our self-taught ‘Create Your Own AWS Instance’ course, troubleshooting drop-ins happen online once a week. You get access to this useful resource when you register.

Slack Channel

Participating in a Cloud-SPAN course also gives you access to our Slack channel, where you can ask more questions and keep up with the latest news. We also share links to other training and workshops which we think might be useful for our learners. It’s a friendly, supportive environment and the best place to get your questions answered outside of the course itself.

How to Get Involved

If you like the sound of all this, why not sign up for one of our free courses today? Our full catalogue of courses is listed on our website.

Upcoming Courses

Prenomics ➡ 22-23 November 2022 (online workshop or complete via self-study)

Genomics ➡ 6-7 December 2022 (in-person workshop or complete via self-study)

Genomics self-study ➡ registration now open - start at anytime

Prenomics: Lowering barriers to participation

18/10/22 Evelyn Greeves

For most of our new learners, the idea of doing things like using the command line and accessing the cloud is completely new and a bit overwhelming. But these are skills which, once learned, can totally transform how you do your analysis. That’s why Cloud-SPAN offers a core ‘Prenomics’ course to help you gain confidence in these key skills and prepare you for the Genomics course.

How Prenomics was born

When we ran our first Genomics course in autumn 2021, we discovered that a major sticking point for many participants was directory structure and navigation. This was making it really difficult to help people log into the cloud, where we could then teach them core command line skills. We spent much more time troubleshooting and going over these basic skills than we’d expected, meaning we had to rush through more complex content later on about analysis tools and pipelines.

Overall, this meant our learners began their command line journey from a place of frustration (which has a huge impact on confidence) and we were forced to minimise the amount of time spent on the ‘genomics’ part of the Genomics course.

As a result, we split the original Genomics curriculum across two courses: Prenomics and Genomics. In Prenomics, all the focus is on the command line. We teach about files and directory structure, and spend lots of time guiding learners through creating a folder and navigating to it without using the command line. We also teach those crucial core skills - commonly used commands like ls, cd and mv. Later, as learners gain confidence, we move on to more complex commands like grep and redirection with > and >>.

Why is Prenomics important?

Separating the curriculum into two courses has several benefits. Most importantly, it gives us lots more time to dedicate to core skills and troubleshooting. It also gives us more time to spend on Genomics, as learners start that course already confident about their command line abilities. Plus, those already familiar with the command line can skip Prenomics and go straight to Genomics.

And it works! We know this because we collect evaluation feedback from participants both before and after the course, asking them to evaluate their understanding of various topics. After we ran the newly designed Prenomics/Genomics combo course in Spring 2022, we compared the understanding scores of learners across three general themes of Command Line, Genomics and Project Organisation (a higher number means a higher level of understanding, with no understanding being 1 and most confident understanding being 6).

Scholarships, funding and so much more!

11/10/22 Sarah Dowsland

Here at Cloud-SPAN HQ our official tagline is that we are here to upskill researchers and enable them to perform complex analyses on cloud-based platforms. Which we are achieving through the provision of training courses and online modules. However, as a member of the team who works behind the scenes as project manager, I feel that we are offering much more than that.

Welcoming learning environment

Up until now we have delivered courses online and held code retreats in person. The course materials and delivery are excellent thanks to our talented instructors. I find the real added value here is that the learning environment is very welcoming, encouraging and inclusive. A lot of our participants are complete beginners or new to this world of big data, command line and cloud computing. And learning such new skills could be very daunting for some. At our courses there aren’t any big egos, there’s just someone there to work through questions and find solutions to your problems.

Community

We can see that through the provision of these events it has created valuable networks between the learners. We know that on an everyday basis they may be the only one in their lab or team working on this type of work and they lack contact with a colleague where they can chat through ideas or get some advice. We offer a Slack channel, weekly online help sessions and code retreats to help support learners. We are also pleased to be working closely with SSI, NorthernBUG, EBnet and University of York Bioinformatics Group which also enables learners a wider access to learning resources and opportunity to network.

Equality and diversity

When organising courses and online sessions we strive to accommodate individuals’ specific needs to ensure that all learners have the same equal access and opportunities to our learning resources. We recognise that unfortunately it is still the case that women, members of the LGBTQ community, people with disabilities and those from ethnic-minorities or socially-disadvantaged groups are consistently underrepresented in research and HPC. To try and combat this, we offer Equality and Diversity Scholarships to help those from underrepresented groups to support costs related to completing the training courses such as travel, childcare etc.

Funding

We understand that financial hardship and lack of available funding is a key player in preventing people accessing and attending training activities, a problem which has been exacerbated by the recent increases in cost of living. Therefore we have recently introduced Hardship Scholarships which are available to anyone who requires financial assistance to attend Cloud-SPAN courses and events. Scholarships can be used to cover travel, childcare or general costs associated with completing the training.

How to apply for funding

Check your eligibility on the scholarships page on our website. To apply for funding please complete the course registration form and include the relevant information in the scholarship section. All applications received before the Scholarship deadline will be reviewed by the Scholarship Panel.

Upcoming scholarship deadlines

🔷 Scholarship deadline 13 March Metagenomics with High Performance Computing 11-21 April (Online workshop)

🔷 Scholarship deadline 31 March Statistically useful experimental design 14 April 2023 (University of York)

Any questions get in touch cloud-span-project@york.ac.uk or follow us @SpanCloud

📢 Online workshop on FAIR data hosted by the UK Reproducibility Network

11/10/22 Sarah Dowsland

Just to let you know regarding an event happening on Wednesday 19th October which might be of interest: the UK Reproducibility Network are running an online workshop on FAIR data in the life sciences which should be really informative and useful. You'll learn about what FAIR data is, why it's important and where you can go for further training on this issue. Plus, our very own Cloud-SPAN's Evelyn Greeves will be talking about Cloud-SPAN and how we've made our courses FAIR!

If you're:

👩‍🔬👩🏿‍🔬👨🏽‍🔬 someone who produces and analyses data
📊 a bioinformatician
👩🏻‍💻👨🏾‍💻 a research software engineer
🔢 a statistician

then this workshop is for you. You can read more about the workshop here and register to attend here. We hope to see some of you there!

Upcoming Cloud-SPAN Courses

Prenomics: 22-23 November 2022 (online workshop or complete via self-study)

Genomics: 6-7 December 2022 (in-person workshop or complete via self-study)

Genomics self-study: registration now open - start at anytime

Cloud-SPAN awarded NERC funding to expand training courses

05/09/22 Sarah Dowsland

The Cloud-SPAN team has been awarded a grant to expand our training to the Natural Environment Research Council (NERC) remit under the Advanced training for early-career environmental scientists call. We will be developing an accessible online course "Getting started with High Performance Computing: FAIR training for environmental scientists" using the same cloud-based infrastructure used in our other courses. We will supply AWS services to up to 60 participants undertaking the course either in tutor-led workshops or by self-study.

Those needing funds for childcare or accessibility to enable their participation in the training will be able apply for a Scholarship and up to 30 participants will receive headsets and second monitors facilitate online engagement. The course will be running in April 2023 - please let your ecology colleagues know to look out for the course!

Key info

🔷 Metagenomics with High Performance Computing ▶ 11-21 April (Online workshop)

🔷 Scholarship deadline 13 March

🔷 Register for the online workshop

🔷 Register for the self-study mode

Our Solutions to Challenges in Environmental 'Omics

31/08/22 Evelyn Greeves

Here’s what we’re doing at Cloud-SPAN to address the challenges we discussed previously.

Just to recap, here are our four main challenges:

Hardware - the size and nature of ‘omics data means high performance computing (HPC) resources are needed, which are rapidly evolving and differ between institutions.
Software - software used for analysis has a steep learning curve, can be difficult to install and tends to have limited teaching resources.
Skills - using HPC requires skills many scientists do not have, such as using the command line and specifying resources.
Time - there are time investments needed for learning new skills, running long analysis scripts and interpreting the results.

At Cloud-SPAN we have two main weapons in our arsenal against these challenges: cloud computing and training. Let’s look at each of those in detail.

Cloud HPC

We use commercially-available cloud computing resources to address the challenges of hardware, software and time.

Here’s how it works: we teach our courses inside a containerised instance, which is a virtual environment containing pre-loaded software and files. The instance runs on hardware borrowed from a commercial provider (in our case, Amazon Web Services). All our learners have to do is log into the cloud instance we provide, and they will have access to all of the software and data used in the course. The instances all start out identical, so the file structure always looks the same, and they run on enough borrowed resources to make analysis relatively quick and easy.

Using this kind of pre-loaded instance benefits our learners by removing complications around different HPC setups and the installation of software onto a cluster. It also makes it easier for us to teach - we know that everyone is starting with the same setup, so we can offer our courses to anyone regardless of institution. It allows us to model best practices for directory structure and project organisation, and save time when troubleshooting issues.

Another major advantage of running analyses on cloud resources is that we don’t have to wait in a queue for resources to be available, as we’ve already loaned out the resources we need. This can save a lot of time for large analyses, and is especially useful for teaching.

After completing one of our courses, learners can go away and apply their new skills to their own data by setting up their own cloud instance, which is identical to the one used in the course. It contains all the same software and basic file structures, so provides a familiar environment for further learning and analysis. Using these kinds of services outside of the course does incur a cost, but we also support learners in applying for research credits.

Training

Our other main solution is providing high quality, free-of-charge training courses with a low barrier to entry and no assumptions of prior knowledge. These courses allow us to address challenges around skills and the skills gap.

All of our courses are underpinned by a core ‘Prenomics’ module, which provides a supportive and carefully paced introduction to navigating directories and using the UNIX command line. Accessing the cloud HPC services described above is impossible without these skills. Using our purpose-built environment we guide new learners through the basics with a strong focus on contextualising new skills within the field of environmental ‘omics.

We also provide core modules in using the programming language R, again with a focus on analysis and visualisation of ‘omics data, and on creating cloud instances.

Our specialised courses follow on from Prenomics and cover topics such as genomics, metagenomics, automating analyses and designing statistically useful experiments. Again, we keep the barrier to participation low by contextualising learning and keeping new ideas to the bare minimum - no frills!

Summary

In summary, the Cloud-SPAN project combines high-quality training with cloud computing expertise to address four major challenges in the field of environmental ‘omics. Using our unique containerised instances we can provide adequate resources for efficient analysis with software pre-installed, saving time and effort. Our courses are designed to equip learners with the skills they need to perform analyses on these cloud instances and beyond, with the overall aim to lessen the load of learning on the backs of researchers and help them improve analysis times and efficiency.

Our containerised instances solve the challenges of:

Hardware, by providing a harmonised HPC platform across institutions and potentially increasing the compute resources available to individuals.
Software, by providing pre-installed software which removes the time-consuming step of software installation.
Time, by speeding up analyses and eradicating the need to queue to access HPC.

Our training approach solves the challenge of:

Skills, by ensuring a low barrier to participation and equipping researchers with vital skills.

Find out for yourself how our solutions to these challenges could benefit you. Read more about the courses we offer or take a look at our introductory 'Prenomics' course materials or specialised Genomics course to see how our training can equip you better!

💸Scholarships to attend the Northern Bioinformatics User Group meeting 9th September

18/07/22 Emma Rand

Hello Cloud-SPAN community!

We have scholarships to cover expenses to attend the Northern Bioinformatics User Group (Northern BUG) meeting to be held on September 9th 2022 at the University of Bradford.

One of my project students, Chloe Brook, presented her work there:

She is now at the Edinburgh Parallel Computing Centre.

We have scholarships to cover travel expenses for five of our Cloud-SPAN 'graduates' to attend NBUG. Register for NBUG here and complete a scholarship application with us here by Monday 22nd August.

Challenges in Environmental 'Omics

13/07/22 Evelyn Greeves

Hardware

The size and nature of ‘omics data means it is often necessary to employ high performance computing (HPC) resources for analysis. This presents an inherent challenge as use of such resources requires a specific skill set that not all researchers will have (see ‘skills’ below for more details).

A secondary challenge relating to hardware is the rapidly changing HPC landscape. Between institutions HPC architectures can vary wildly. Although the basic skills needed to access them remain the same, the setup and execution of jobs may look quite different. Even within institutions, HPC setups mature and are replaced regularly as new technologies develop and the demand for resources grows. For example, the Biology department at the University of York has had access to three different setups (c2d2, YARCC and Viking) in the last nine years, with a new iteration (Viking2) currently in the works. This frequent turnover requires users to continually adjust and adapt their workflows to the new system.

Software

There are several issues surrounding the software involved in analysis of environmental omics datasets. Firstly, software tends to have a steep learning curve, requiring a substantial time investment for researchers. This investment will not necessarily always pay off, if the end result is not what is required.

Secondly, even if a piece of software does do what is needed, it is not guaranteed that it will be usable on the HPC architecture available. Installation of software is not always straightforward, if it is allowed in the first place. The rapid turnover and replacement of HPC architectures only serves to compound this problem, and the heterogeneity of HPC setups between institutions makes it difficult to find bespoke instructions for software installation.

The final, broader issue is around access to learning resources and tutorials. Some popular, non-field-specific tools such as R or Python have countless online tutorials and instructions dedicated to their use, aimed at all different levels of understanding. Others, especially more niche software programs, have very few resources. Those that do exist may be out of date, or assume a level of knowledge beyond that of most novices (for example, many documentation pages are entirely inaccessible to a newcomer). As new software emerges and supersedes previously popular programs, the lack of help available only worsens.

Skills

As previously mentioned, environmental omics analysis has a steep learning curve. A major challenge for many new researchers is grappling with previously unencountered skills such as using the UNIX command line, navigating file systems, writing shell scripts, grappling with dependencies and specifying resources for HPC. This is all before any specific pieces of software are involved, each of which will require its own set of skills and understanding.

These skills are required on top of the experimental design and data collection skills needed to generate datasets in the first place. Often those collecting data are the ones best placed to know how to interrogate it, as all experiments are different and bespoke analysis is crucial. This requires researchers to learn and juggle a large collection of skills, not all of which are immediately relevant to their chosen area of study.

Time

Finally, there are time investments involved in all of the above challenges. There is the ‘brain time’ involved in learning new skills, problem-solving and working with new software. Then, once an analysis is ready to run, it will take time to run. HPC resources are usually shared across many users, with jobs being added to a queue to run when resources are available - analyses requiring large amounts of compute may be queued for days or weeks waiting for the required resources to come available. In addition, some analyses take a long time to run given the size of the datasets involved and the complexity of the analysis.

Once analysis is completed time must be invested in interpreting and visualising the results. If parameters need to be adjusted following this, then the whole process must begin again. This makes optimisation of analysis difficult and time-consuming to the point that it may not even happen at all.

At Cloud-SPAN our goal is to help you overcome these challenges. Read more about the courses we offer or take a look at our introductory 'Prenomics' course materials or specialised Genomics course to see how our training can equip you better!

Making the Prenomics summary poster

06/07/22

I recently designed a poster that acts as a “cheat sheet” for Cloud-SPAN’s Prenomics course. Here I’ll give a quick overview of my thought process while making the poster and how I chose what to include.

Command cheatsheet

I started with the command cheatsheet, which is a bit back-to-front given that the command line comes chronologically last in the course. However, it seemed like an easy entry point. I had previously produced a glossary for Prenomics, so I took this and organised the commands into logical groupings based on their function.

I ended up with four groups: navigating files, viewing files, editing files and searching files. There were two commands which didn’t fit into these groups, which were history and man. I tried to find a way to include them, but ultimately left them out as I decided they were not crucial knowledge.

I tried to format the commands in a way that made it clear how to use them while still keeping it general and easily readable. I used colour to clarify which parts of the command corresponded to which parts of the explanation. For example:

Example from poster showing the command `mv file directory` and the explanation 'move file to directory'.

In this case ‘file’ (the file to be moved) is pale blue while ‘directory’ (the location to which the file should be moved) is dark blue. I also used underlining to indicate where the command came from - in this case, mv comes from the m and the v of ‘move’. My aim here was to aid recall of the command in future, as hopefully it will help the reader make a stronger mental connection between the command mv and its function (‘move a file’).

Files, paths and file types

The command line makes up a significant portion of the Prenomics course, but it could be summarised in a relatively small amount of space. To work out what else should be included, I looked at the course more holistically. The first lesson of the course is about files and directories, so it was clear to me that there needed to be a section about this. This lesson also covers the file types used in the course, including .FASTQ and .PEM files which are likely to be new to most learners, so I wanted to include this too.

I found it difficult to summarise the information about file paths and directories into a mostly graphical format. In particular, I found that I couldn’t rely on giving examples as much as we do in the course itself, due to the need to reduce text as much as possible. I ended up just including definitions of absolute and relative paths, along with a diagram of the file system inside the Cloud-SPAN AWS instance.

If I’d had space, I would have included more information on working directories and examples of how the file system diagram can be represented with file paths.

Why use…?

The next sections I designed were the ‘Why learn command line?’ and ‘Why use the cloud?’ sections.

The question of why the command line is useful is covered in episode three of the first day of Prenomics, as part of the introduction to the shell. I distilled the reasons given here down into four main themes: automation of repetitive tasks, reducing human error (as a result of automation), improving reproducibility and the ability to access new tools (either because the command line offers more functionality or because it opens up use of high performance computing systems (HPC) such as the cloud).

Extract from poster section 'Why learn command line?'. Icons and text give four reasons: improve reproducibility, reduce human error, access new tools, and automate repetitive tasks.

The other question- why cloud computing is useful- is not technically covered in Prenomics material. However, it is discussed in our Genomics course, where the three reasons given for using HPC are lack of resources needed to run analyses, analyses taking a long time to run and problems installing software. These reasons align closely with three of the challenges identified by Cloud-SPAN project lead, James Chong, as facing the field of metagenomics (hardware, time and software).

However, these reasons could actually be used as a reason to use any kind of HPC resource, not just the cloud. I wanted to include reasons specific to the cloud. The two major reasons I came across were the ability to share software or data containers across different institutions, and use of the cloud when other HPC is inaccessible.

In an earlier version of the poster, the ‘why use cloud?’ section was framed in terms of challenges that users might face, such as long analysis times or issues installing software. I reworded this section to match the framing of the ‘why learn command line?’ section; that is, a solutions-oriented summary, with ‘shorten analysis times’ replacing ‘long analysis times’ and ‘use pre-installed software’’ replacing ‘issues installing software’.

Extract from poster section 'Why use the cloud?'. Icons and text give five reasons: access more hardware resources, use pre-installed software, shorten analysis time, share software or data containers, and overcome barriers to accessing high performance computing

File types

Lastly I wanted to include a brief summary of the file types used in the Prenomics course, as it is likely that two out of three of these will be new to learners. This part was quite easy - I just wrote a short sentence to describe each file type and paired it with the relevant file extension.

In the course a significant amount of time is dedicated to introducing the .fastq file structure, which codes sequencing data into a text format with four lines per read. I considered including this information but I didn’t have room.

The big reveal...

And finally, here's the full poster! You can download a high-res version of the poster for your own enjoyment here.

An image of the finished poster with sections: why learn command line?, why use the cloud?, file types, files and paths, and command cheatsheet

FAIR at Cloud-SPAN

29/06/22 Evelyn Greeves

Previously we shared some information about FAIR data, and explained why it’s important to make sure your data is as reusable as possible.

The FAIR principles aren't just for data. Our aim is to apply the principles to all our training resources to ensure that they can be reused and remixed by others for their own teaching purposes. Here's a look at how we're doing it:

Findable

Remember, findability is about making it easy to find your data or resource. We’ve added metadata to our resources, which enables us to register our courses with TeSS, a life sciences training repository. The metadata means people can search and filter to find our course based on what they need. You can see the metadata for our Prenomics course at the top of the source page here.

In addition, we have also registered our training resources on Zenodo, another repository which assigns a DOI to each stored item. This persistent identifier will give our resources a permanent home, even after other links become deprecated.

Accessible

To be accessible it needs to be easy to retrieve a resource without any special tools. It should also be clear how to do this. We’ve made this really easy for ourselves by hosting our courses online for free on a dedicated set of webpages via GitHub Pages.

Interoperable

Interoperability means ensuring that computers can understand and open a resource. We do this by providing data for analysis in de facto file standards such as FASTQ and using Markdown (a widely-used and platform-independent text formatting language for writing resources) for course material.

We also help computers to understand how our resources fit into a bigger picture by using an ‘ontology’ to describe the topics of our courses. This forms part of the metadata and helps people to filter and understand what our resources are about. For example, we use the EDAM ontology of bioscientific data analysis and data management. We've labelled our Prenomics course as falling under topics 3372 (software engineering) and 0622 (genomics).

Reusable

All of the things just described help promote reusability. In particular, we promote reusability by tagging our resources with rich metadata - we use the Bioschemas Training Material protocol which suggests a list of metadata properties for biosciences training materials.

We also help people reuse our materials by applying a Creative Commons Attribution (CC-BY) licence, which means anyone can distribute, remix, adapt or build on our work as long as they credit us. We include details of this licence in our metadata, in our GitHub repositories and at the bottom of our course pages so it’s clear to everyone what the rules are.

Over to you!

What steps are you taking to make your data and other digital resources as FAIR as possible? There are some great resources available online to help you if you're not sure where to start - try howtofair.dk or the FAIR Cookbook for helpful articles, videos and step-by-step guides!

Bioinformatics Meeting on ‘Career pathways into bioinformatics’

22/06/22 Sarah Dowsland

Join the University of York's Bioinformatics Meeting on ‘Career pathways into bioinformatics’.

Hosted by Sarah Forrester, we have a jam packed hour and a half.

Evelyn Greeves from the Cloud-SPAN team will be leading a short session on "Introduction to FAIR and metadata" with an opportunity for you to ask any questions that you may have.

Following this, have you pondered what direction to take your career or how to use your data skills in future projects? We have 3 speakers who explain how data analysis has been incorporated into their work.

🔸 Emma Rand highlights the different paths into academia and using big data skills.

🔸 James Chong explains how learning bioinformatics was the only way to get past the bottleneck of being able to analyse data he was generating.

🔸 Sarah Forrester explores how bioinformatics opens doors to moving between different research niches.

We will then have a discussion with all speakers for the remainder of the session, which will include signposting resources.

Session slides will be available following the event.

Event details: Wednesday, 6th July, 15:00-16:30 in room B/T/019 University of York.

Cloud-SPAN April Code Retreat

14/06/22 Evelyn Greeves

In April we held our first code retreat event: a chance for course alumni to come together and work on their own data problems with the support of Cloud-SPAN instructors. The event took place at the University of York, with several participants travelling from other institutions for the day.

Some people took the opportunity to revisit course materials and ask questions about the topics they didn’t understand. There were plenty of helpers on hand to answer questions and test understanding.

Others chose to apply the workflows and analyses taught in the Genomics course to their own datasets. Again, helpers were on hand to discuss topics such as:

how to organise bioinformatics projects;
which tools are most effective;
how to approach a problem;
what analysis is best for a certain type of data;

as well as many others.

Some participants knew what help they needed and had specific questions to address during the day. Those with a less clear understanding of their problem benefited from talking through their data and getting guidance from our experienced instructors. Some even tried out new software tools not discussed during the course.

Finally, some participants decided to trial our new self-study course on creating your own Amazon Web Services cloud instance. They were able to ask questions about the content and provided valuable feedback on which parts of the course needed improvement.

Everyone enjoyed the chance to meet new people and find out about each others’ research. It was a great chance to network and build some community amongst course alumni.

Our next code retreat for Cloud-SPAN course alumni will be on July 6th at the University of York. We cover travel expenses and lunch is provided. We’re looking forward to meeting more of our community members and providing valuable one-to-one support!

Image: Instructors and course alumni at April's code retreat.

More online computational training for life scientists!

08/06/22 Emma Rand

The Cloud-SPAN team would like to let you know about more opportunities for online training! Ed-DaSH from the University of Edinburgh, is a Data Science training programme for Health and Biosciences funded under the same scheme as Cloud-SPAN (UKRI innovation scholars award). Like Cloud-SPAN, Ed-DaSH is partnered with the The Software Sustainability Institute and you will find similarities in our approach to teaching computational topics to life scientists. Their upcoming workshops are:

13:00-17:00 14-17 June – FAIR in (Biological) Practice
10:00-13:00 5-8 July – Introduction to Statistics with R (this is a course written by Cloud-SPAN's Emma Rand and Univerity of York Biology PhD student Ezra Herman)
09:30-13:00 26-29 July – High dimensional statistics with R
13:00-16:00 23-26 August – Machine learning

You can register via the University of Edinburgh ePay system - the courses are free but have a refundable deposit. Contact them at ed-dash@ed.ac.uk or on Twitter @EdDaSH_Training, with any questions

What is FAIR data?

31/05/22 Evelyn Greeves

At Cloud-SPAN we care deeply about making science as open as possible. A lot of this comes down to project management and data organisation - which we teach as part of our Genomics course. Today we want to introduce you to the FAIR data principles, which are a framework for thinking about how to ensure the scientific community gets the most out of the data we produce. In this case, this means making it easier for people to find and reuse our hard-earned data!

The FAIR framework aims to encourage data reuse by both humans and computers by improving the findability, accessibility, interoperability and reusability of data and other resources. So what are the steps involved in FAIR-ifying data?

F is for Findable

Before data can be reused, we need to make sure it can be found. One way to do this is by tagging it with metadata (information about the data), such as what type of data it is, who collected it, the conditions used, and so on. This allows it to be indexed in a searchable registry so more people will see it. Metadata is important for both helping people find your data and for understanding the context they were generated in.

Another key way to make sure data is findable is by assigning it a persistent identifier. This is a long-lasting digital reference which ensures a resource can always be found, no matter where it’s stored. DOIs (digital object identifiers) are one type of persistent identifier that you have probably heard of before - they can be applied to things like journal articles, data sets and other publications.

A is for Accessible

Once we’ve made sure people can find our data, we need to make sure they can access it if they have permission. This means making it retrievable using some kind of standardised protocol, without any need for specialised or proprietary tools. We also need to tell people how they can get access, so we should include this as one of our metadata fields.

A common misconception is that all FAIR data is ‘open’ or ‘free’. Heavily protected or private data can still be FAIR as long as it is clear under which conditions the data is accessible.

I is for Interoperable

So now we’ve made it possible for someone to find and access our data. How can we make sure they can actually use it? There are two aspects to interoperability. The first is using standardised and open formats so that data can be exchanged and used across multiple different applications and systems. This means avoiding proprietary formats and conforming with field-specific standards about what format data should come in.

The second relates to how computers understand our data in comparison with other data. This is possible using a ‘controlled vocabulary’ or ‘ontology’ which ensures that everyone is using the same words for the same thing. Again, we should try and conform with field-specific standards around ontologies.

R is for Reusable

This final principle emphasises the idea that by following the previous three principles- findable, accessible, interoperable- we should be aiming to make our data as reusable as possible. This means using accurate and richly described metadata that gives a full overview of our experimental process and data analysis workflow.

We should also make it clear what rights the discoverer has when reusing our data. This is achieved by applying a licence, and clearly specifying this in the metadata. For example, a Creative Commons Attribution 4.0 International licence (or CC-BY for short) lets anyone reuse, remix and adapt material as long as credit is given to the original creator.

Summary

The FAIR framework guides us through ensuring that our data is easy to find, easy to understand and easy to reuse. This ensures that our data is used to the fullest extent possible.

The FAIR principles apply to digital objects beyond just data. At Cloud-SPAN we are working hard to make sure our learning resources are as FAIR as possible. Find out what we’re doing to achieve this by visiting our handbook, or look out for our next blog post!

Genomics Self-Study now LIVE! 🧬

22/05/22 Evelyn Greeves

We are pleased to announce that the registration for Genomics self-study is now live!

Here at Cloud-SPAN we recognise that all learners are individuals with specific needs and different levels of ability. This is why we have developed a Genomics ‘self-study’ option, where you can create your own schedule and learn at your own pace. These modules teach data management and analytical skills for genomic research.

Here's how to get involved:

Step 1 - Register! 📝

The educational materials are available free of charge, however we ask that you complete the registration form so we are able to send you updates and useful information.

Step 2 - Create your own instance ⛅

Start with the ‘Create your own instance’ module, where you receive a step by step guide to creating your own Amazon Web Services instance. You will go on to use this instance to complete the subsequent Prenomics and Genomics modules.

Step 3 - Prenomics 💻

If you are new to the realm of navigating file systems and using the command line we recommend that you complete the Prenomics module. We have designed this module to allow more time for those with less experience to cover some foundation concepts. If you aren’t sure how to gauge your skills take the self-assessment quiz to help you decide.

Step 4 - Genomics🧬

The Genomics module allows you to move on to the more fun stuff as you develop your skills in managing data. You will tackle tasks such as assessing read quality, trimming and filtering, and variant calling.

Step 5 - Community 🤸‍♂️

After completing the modules we hope that you will be able to attend one of our regular code retreats, where our course instructors will be on-hand to help solve any issues you encounter while applying your new skills to your own datasets. We also strongly encourage you to take advantage of our welcoming Cloud-SPAN community, so don’t be afraid to lean on your peers for help or discussions on our forum.

Need support?

Join the Cloud-SPAN Slack workspace (you will receive a link upon registering), post on the forum or follow us on social media. All three are good options for you to continue on your mission to master the art of genomics!

View the website for further information or drop us an email at cloud-span-project@york.ac.uk if you have any questions.

Course Update

14/12/21 Evelyn Greeves

Well, it's been a couple of weeks since we ran our first course and we've certainly learned a lot from it! Most of the problems we stumbled into were minor teething issues, but we're also going to be making some fairly major changes based on what we learned. Here's a bit of an update:

First thing's first... positive feedback!

We had some amazing feedback from our first run-through! We asked our participants to rate their level of comfort with a number of topics before and after completing the course.

As you can see in the graph below, on average our participants felt that their level of comfort had improved after taking the course for all topics. This is great news!

In particular, participants ended the course feeling really comfortable with using the command line to navigate file directories, create and modify files and search for keywords.

The BIG Problem: Time flies

The biggest problem we ran into was time. We just didn't have enough of it...

...and that means we need to seriously rethink how we structure the content of our course. You can see from the graph above that the topics people felt they had improved their confidence the least in were those relating to genomics - assessing read quality, trimming and filtering, and variant calling.

That's because before we could teach those topics, we had to make sure people felt really comfortable using the command line and organising their files. Unfortunately, as we got more and more behind schedule, this meant the genomics topics got less and less time allocated to them.

This is a problem, because the title of the course was Foundational Genomics. If participants are finishing our course feeling like they are not comfortable with variant calling (which is what a score of 3 means) then we need to make some changes. Plus, the genomics bit is meant to be the fun bit!

The written feedback we had from participants after the course, while mostly positive, generally reflected this sentiment too.

The BIG Solution: Course structure

In light of this feedback, we've decided to split our 4 x half-day Foundational Genomics course into two shorter courses: one 2 x half-day "Prenomics" course and one 4 x half-day "Genomics" course.

The Prenomics course will be aimed at complete beginners and will be a gentle introduction to file directories, working paths and basic command line commands. By giving ourselves two half-days to cover this content we should be able to ensure all participants are fully on board before we move onto anything more complex.

This course will also be optional for those who already have some basic experience using the command line. We plan to screen potential attendees using a self-test to help them determine whether they would benefit from attending the Prenomics course, or skipping straight to Genomics.

The Genomics course will follow directly on from Prenomics, using many of the same commands and building on them further. We should have much more time to cover essential topics such as read quality and variant calling which were rushed last time, allowing us to offer a much better educational experience.

Some smaller problems which this change also solves...

We plan to run the Genomics course over two weeks, with 2 x half days per week. This, combined with a hopefully much less stressful volume of content to cover, will decrease pressure on our instructors and helpers, most of whom have multiple other work and teaching commitments.
Some of our participants felt the course went far too slowly; others felt it went too fast. By screening for those who have the least experience working in the command line, we should be able to both provide more support to those who need it and stop more experienced participants from getting bored.

Other updates in the pipeline

These two new courses, Prenomics and Genomics, will make up the 'Foundational' element of our content. We still have plans to developed 'Advanced' modules on topics such as automation and setting up cloud instances. We asked our participants to let us know what they'd like to see, and we'll be taking these suggestions into account as we go.

We also have plans to run some (hopefully) in-person 'hack days' for alumni of our Foundational courses, where we can help participants apply the skills learned in the course to their own problems, and assist in troubleshooting any problems. We hope this will help develop our community of practice further by providing opportunities for networking and building relationships.

And that's it. Thanks for reading this far - we hope it explains some of our rationale for why the course is structured the way it is. We're really proud of our achievements with our first ever course, and we're looking forward to trying some new stuff out next time.

Till next time! 👋