In this session you are going use a variety of bioinformatic databases and applications to identify a disease from a DNA sequence and characterise a protein encoded for by that sequence. Today you will gain an introduction to the contents and layout of the major web-accessible databases holding information on genes, proteins and enzymes. You will navigate between these databases to discover the source, identity and attributes of a disease given its DNA sequence. All the resources are available free online.
The session should take you about 1 hour but you can stop and come back to it later if you need to.
You will be using this session to learn many of the steps involved in Bioinformatics and learning about the different data recorded in databases on genes and proteins.
To start with, have a watch of this video to get an overview of what Bioinformatics is.
Below you will find some instructions, and two documents you will need to download to complete the activity. The activity has 5 steps so begin by downloading the below documents and then scroll down for the steps.
Document containing the DNA sequence
Activity worksheet for your answers
Activity worksheet for your answers
Accessing the worksheet
Open the Worksheet above by clicking the top-right icon to 'Pop out' the document.
When it's open you may need to download these files by clicking on the top-right icon with a half box and arrow pointing down.
Use another electronic document or a pen and paper to complete the workbook by following the instructions.
If you have any questions about accessing the documents, please email us at discover@brookes.ac.uk.
Document containing the DNA sequence
Downloading the document
Click here to open the folder where the document above is saved.
Once you're in the folder, you will need to download this file as a .txt by right clicking and selecting 'Download'.
Save this file somewhere easy to find as you will need to upload it in a later step.
If you have any questions about accessing the documents, please email us at discover@brookes.ac.uk.
Now you've downloaded the documents you can start the activity! You will be using the steps below to navigate between the databases and identify the source, identity and attributes of a disease given its DNA sequence.
Follow the instructions below to explore your DNA sequence and consequent proteins. Complete the worksheet (you have just downloaded) as you go along and submit it via the form at the end.
Note about these instructions
The web pages change frequently, so search the pages for the most similar options to those specified!
The way in which the data is recorded in the databases differs all the time for different genes and proteins, so go with the nearest answer in the MCQs.
Workshop Tips:
You might like to save pages and files for future reference, and to have a Word document open so that you can paste in useful results from your searches.
To make it easier to understand the software and important features many are highlighted in bold.
1. The first step is to open the DNA sequence file you downloaded earlier. When you go to open it on your device, it should open in Notepad by default (do not open in word as this will add problematic spaces and tabs).
2. Your DNA sequence is in what bioinformaticians call ‘FASTA format’ – a title line starting with ‘>’, followed by the sequence. Select all of the sequence without the title line and copy to clipboard (right click and choose copy or by pressing CONTROL & C keys).
Example DNA sequence, known as the FASTA format.
3. You now need to access an application that will search the nucleotide databases for closely related sequences to known DNA sequences of genes and proteins.
You will be using an application from the National Center for Biotechnology Information (NCBI). We will use nucleotide BLAST (Blastn) from NCBI. Click on this link and then click ‘Nucleotide blast’.
4. Click on ‘Enter Query Sequence’ box at the top and paste your sequence from the notepad file you downloaded (Right click paste or CONTROL V).
Under the ‘Choose Search Set’ then ‘Database’ section, select ‘nucleotide collection nr/nt’ which is a combined database of all ‘non-redundant’ sequences.
As we want to search all species, do not type in anything under organism
In the ‘Program Selection’ box select the option for searching for highly similar sequences
Below the BLAST button, open the ‘Algorithm parameters’. Under General Parameters, select for 5000 Max target sequences. We do this as we want to view the full number of matching sequences.
Click ‘show results in a new window’ on the right of the big BLAST button.
Press ‘BLAST’ to start the search. It may take up to a few minutes for the results to load.
When the process is complete we should have a page containing the BLAST output. Scroll down to the ‘Descriptions’ table. You will have multiple hits from your sequence.
Hint: we think you will have heard of this one!
5. You’ll notice that many of these results are variants of the virus, collected from different patients. There are also some matches with synthetic clones of the virus, which scientists have made and published to aid Coronavirus research. Record these initial results in the table of the results.
6. ‘Percentage identity’ tells you how similar your sequence is to published viral sequencing data. Your first hit (or more) will have a 100% identity with your sequence. Scroll down your list until you find a match with a 99.98% percentage identity. Click on the full name link for this entry and it will show you how your DNA sequence aligns with the virus sequence isolated from a patient.
7. You will see a similar box as above on top of your aligned sequence. The identities number tells you how many mutations there are between your sequence and the patient virus sequence.
Scrolling through the sequence alignment, can you spot the mutation? What is the letter mutation, and what type of mutation is this known as?
Hint: Look out for no joining line between the sequences at the mutation. They can be hard to spot, so move on if you can’t find it!
8. Go back to your full list of identified hits. Above your list of hits, click on the option ‘Taxonomy’ tab. This will give you a lineage of related viruses which have caused previous pandemics.
9. Go back to the ‘Descriptions’ tab with your list of hits and go to the first hit by clicking on the blue coloured ‘Accession number’ or ‘sequence ID’ on the right (good to right click these and open in another tab so you can come back to other pages). This will take you to the detailed information on that nucleotide. Have a quick look around, especially at the features section. Note these features on your worksheet.
10. Under the ‘Related information’ menu on the right hand side of the page, right click on ‘Protein’ and open in a new tab: this will take you to a page very similar to the nucleotide page but detailing the information of 10 different proteins which your DNA sequence codes for.
11. Scroll down through the proteins and select the surface glycoprotein. Note its name and amino acid length on your worksheet.
12. Click on FASTA at the top left of the page and copy the protein sequence to notepad and save it for later, do not need to copy the >title line. This is the protein polypeptide sequence (amino acids rather than nucleotide bases). A FASTA file is the raw sequence without any text formatting.
We are now in a position to investigate the structure and properties of the protein coded by this gene and determine more about its structure and function. The ExPASy site in Switzerland is a very useful site, which gives us access to the UniProtKB (formerly SwissProt) database and many analytical programmes.
1. Go to the ExPASy page. Click on ‘proteomics’ on the left menu to open up the databases and software available to us.
2. Click on ‘UniProtKB’ under ‘Databases’ (should be the first one)
3. When the UniProt page has loaded, click on ‘BLAST’ in the top left hand corner. Enter your copied PROTEIN sequence into the box. You do not need to change any of the settings. Select ‘Run BLAST’.
4. As when we ran the DNA sequence, this protein search will bring up a number of hits which are similar to your sequence. The information for the 100% match at the top of your list is not yet complete, so we are going to study the most closely related protein. Click the blue entry number on the second hit on the list.
5. This will take you through to the annotated information on that protein. This will show you all kinds of information about the protein and would be one of your first places to come if you wanted to investigate a protein. Note your proteins UniProt number and function on your worksheet.
We can obtain some useful information about the likely conformation of the protein if we know the extent of secondary structure such as α-helix and β-sheet in the sequence. Knowing the structure of proteins is what allows us to design drugs to target their binding sites. There are a number of applications, some of them listed on the Proteomics page, which attempt to predict the location of secondary sequence. Today we will just continue to use UniProt .
Scroll down your protein details on UniProt to the structure section. You will see a number of PDB entries in a blue box on the right hand side, with a protein structure on the left. Click through the different entries and take a look at their protein structures. You can drag the protein structure around to view it at different angles.
Beneath the protein structure there is a linear map depicting the α-helix, β-strand and turns of the secondary structure. If you click ‘show more details’ it will give you a list of these different positions in the protein. Note how many α-helix, β-strand and turns there are in your protein, and draw your favourite secondary structure picture (roughly!).
So far we have looked at the primary structure (peptide sequence) and secondary structure (α-helix and β-strand) of our chosen protein. Lastly, the bonds which form between different peptides in a protein go onto create the overall shape of a protein – its tertiary structure. Within tertiary structures, protein domains are a conserved, functional area that tell us a lot about the function of a protein. We can compare the sequence to other well-known proteins to find domains and hence build correlations of what it might do.
1. On the Uniprot page, click ‘Family & domains’ in the blue table on the left hand side of the page.
2. Scroll down to Family and domain databases. Here you will see many different links to software. Select ‘View protein in Pfam’.
3. This will load the Pfam entry for your protein. Click on one of the domains shown. Click through the ‘Wiki article, Pfam and Interpro’ tabs and see what info you can find. If stated, how many domains does your protein have?
4. Use the blue tabs on the left hand side to explore your protein domains. In particular, look at the ‘Species’ tab. You can view the different species in a ‘Sunburst’ or ‘Tree’ format – which ever makes more sense to you. Note down the different genus and species that these protein domains are conserved in. Can you see why it is so difficult to identify the original host of the virus?
We have now used a couple of different programmes to look at various aspects of your protein. Our final programme - RCSB Protein Data Bank - also shows the 3D structure of your protein and protein domains.
1. Open the link and search with your Uniprot number from the initial sequence.
2. Have a look through the different protein domains and their structures.
3. Do any of your protein domains have Unique ligands? We can use this information to design ligands which mimic these natural ones for treatment purposes.
4. You can view the alignment of the protein peptide sequence with the α-helix and β-strands by clicking the ‘sequence’ tab along the top once you have selected a domain.
If you complete this practical for one of the virus proteins, go back to the original DNA sequence and select another protein to research! Understanding each of the virus proteins is key to learning how the virus works and how we can treat it.
We hope we have shown you how important and useful bioinformatics can be in researching new diseases! If you want to learn more about Bioinformatics or the tools then try the EBI training online modules that are for all levels and take you through some of the major software.
Submit your work to us via email: brookesengage@brookes.ac.uk