Genome assemblies are our best representation of what the actual genome sequence looks like. Genome assemblies not biologically true (yet) because of limits of sequencing technologies.
Assemblies are made of sequence fragments.
The smallest fragment is made from layering (aka aligning) together overlapping sequences until there are no more overlapping sequences, these are called contigs.
Next, contigs are ordered and oriented by some kind of scaffolding technique such as optical mapping, these larger fragments are called scaffolds.
Scaffolds are then ordered and oriented to form chromosome representations using genetic marker maps or other genome assemblies.
Availability of genome assemblies varies by organism. Some species don't have genome assemblies available at all, while other species have dozens. Model species and economically important species usually have many genome assemblies.
The quality of the genome assemblies also varies. If your species has more than one assembly, then you'll need to choose which one to use. You can choose to use the assembly that is most closely related to your experimental material or the assembly with the best quality.
TAIR10.1 is assembled to the chromosome level. Alternatives are pseudochromosome level, scaffold level and contig level. Assemblies above the pseudochromosome level are preferred.
The genome assembly size of TAIR10.1 is 119.1 MegaNases. Compare this to the estimate genome size of Arabidopsis at 135 Mb. The assembly captures about 88% of the genome.
Genome assemblies are scattered across many large and small databases. There are often multiple, independent assembly databases even within the same species and versions of the same assemblies. This is because different research groups and consortiums have made their own assemblies.
Check the literature. Many genome assemblies have accompanying publications. Those publications will tell you where to download the assemblies from in the Data Availability section.
Check large databases, such as NCBI
Check consortium databases, such as Soybase
Check the publications, especially in the Data Availability section
Ask others also working on your species about where genome assemblies for your species are generally found
We have already downloaded the assemblies for this class. Please do not download them again. Here is what we did and the code we used.
Navigated to the genome assembly database, for example https://v1.legumefederation.org/data/v2/
Found the assembly we wanted, right clicked the hyperlink and copied the link address
In the Hazel directory that we wanted the assembly to be in, we ran this line of code on the command line (no need for job script).
We waited for the download to finish. It should only take a minute. We didn't type or click in the terminal because that would have cancelled the download, though it is okay to minimize the terminal window.
NCBI downloads are different than other databases.
You can use curl in the terminal
curl -OJX GET "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCF_000004515.6/download?include_annotation_type=GENOME_FASTA,GENOME_GFF&filename=GCF_000004515.6.zip" -H "Accept: application/zip"
Or you can download to your computer and use Globus to upload to Hazel
The whole learning community will use the Glycine max cv. Lee, assembly version 2
Species: Glycine max
Germplasm/cultivar/variety/accession: Lee
Version: v2
Why: Lee is the same soybean variety that we did RNA-seq on. Version 2 of the Lee assembly is the most up to date.
Publication/website: Garg V, et al., Chromosome-length genome assemblies of six legume species provide insights into genome organization, evolution, and agronomic traits for crop improvement. J Adv Res. 2022 Dec;42:315-329. doi: 10.1016/j.jare.2021.10.009. Epub 2021 Nov 3. PMID: 36513421; PMCID: PMC9788938.
Genome assembly download link: https://v1.legumefederation.org/data/v2/Glycine/max/genomes/Lee.gnm2.K7BV/glyma.Lee.gnm2.K7BV.genome_main.fna.gz
Gene annotation: https://v1.legumefederation.org/data/v2/Glycine/max/annotations/Lee.gnm2.ann1.1FNT/glyma.Lee.gnm2.ann1.1FNT.gene_models_main.gff3.gz
Functional annotation:
The gff3 file the following functional annotations condensed in the 9th column. The original KEGG, GO and INTERPRO SCAN files are located:
https://cegresources.icrisat.org/data_public/legumepedia_data/Glycine_max/Annotation/gma.iprscan.txt
Directory: /share/bitcpt/S23/referenceGenomes/Glycine_max_Lee_v2
Genome file: glyma.Lee.gnm2.K7BV.genome_main.fna
Gene annotation file: glyma.Lee.gnm2.ann1.1FNT.gene_models_main.gff3
Soybean reference genome (2021)
Lee version 1 (pre-HiFi in 2017)
Williams 82 ISU01 v2 (with HiFi 2022)
Glycine soja (wild ancestor of soybean)
Species: Glycine max
Germplasm/cultivar/variety/accession: Williams 82
Version: v4
Why: This is the primary reference assembly available on NCBI and has the NCBI annotation.
Publication/website: https://www.ncbi.nlm.nih.gov/data-hub/genome/GCF_000004515.6/
How to download from NCBI
curl -OJX GET "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCF_000004515.6/download?include_annotation_type=GENOME_FASTA,GENOME_GFF&filename=GCF_000004515.6.zip" -H "Accept: application/zip"
Directory: /share/bitcpt/S23/referenceGenomes/Portfolios/Glycine_max_THEreference
Genome file: GCF_000004515.6_Glycine_max_v4.0_genomic.fna
Gene annotation file: genomic.gtf
this annotation file is throwing an error at the make transcriptome stage. Dr. Delorean tried downloading the gtf and gff3 again.
Species: Glycine max
Germplasm/cultivar/variety/accession: Lee
Version: v1
Why: Lee is the same soybean variety that we did RNA-seq on. Version 1 of the Lee assembly is was replaced with version 2.
Publication/website: Valliyodan B, et al., Construction and comparison of three reference-quality genome assemblies for soybean. Plant J. 2019 Dec;100(5):1066-1082. doi: 10.1111/tpj.14500. Epub 2019 Oct 28. PMID: 31433882.
Genome assembly download link: https://v1.legumefederation.org/data/v2/Glycine/max/genomes/Lee.gnm1.BXNC/glyma.Lee.gnm1.BXNC.genome_main.fna.gz
Gene annotation: https://v1.legumefederation.org/data/v2/Glycine/max/annotations/Lee.gnm1.ann1.6NZV/glyma.Lee.gnm1.ann1.6NZV.gene_models_main.gff3.gz
Functional annotation: Functional annotations are included in the 9th column of the gff3 file.
Directory: /share/bitcpt/S23/referenceGenomes/Portfolios/Glycine_max_Lee_v1
Genome file: glyma.Lee.gnm1.BXNC.genome_main.fna
Gene annotation file: glyma.Lee.gnm1.ann1.6NZV.gene_models_main.gff3
Species: Glycine max
Germplasm/cultivar/variety/accession: Williams 82
Version: v2
Why: This assembly is haplotype phased (by sequencing a highly inbred individual of Williams 82), sequenced with PacBio HiFi reads and scaffolded with HiC. It is the most complete soybean assembly.
Publication/website: Haun, W., et al., The Composition and Origins of Genomic Variation among Individuals of the Soybean Reference Cultivar Williams 82, Plant Physiology, Volume 155, Issue 2, February 2011, Pages 645–655, https://doi.org/10.1104/pp.110.166736
Note that the publication was for v1 of Williams 82 ISU01. The assembly results and methods described in the publication are for v1. Consult the assembly download page and the JGI release page for information about v2 of this assembly.
Genome assembly download link: https://soybase.org/data/v2/Glycine/max/genomes/Wm82_ISU01.gnm2.JFPQ/glyma.Wm82_ISU01.gnm2.JFPQ.genome_main.fna.gz
Gene annotation: https://soybase.org/data/v2/Glycine/max/annotations/Wm82_ISU01.gnm2.ann1.FGFB/glyma.Wm82_ISU01.gnm2.ann1.FGFB.gene_models_main.gff3.gz
Functional annotation: gff3 doesn't have the functional annotations
Directory: /share/bitcpt/S23/referenceGenomes/Portfolios/Glycine_max_Wm82-ISU01_v2
Genome file: glyma.Lee.gnm1.BXNC.genome_main.fna
Gene annotation file: glyma.Wm82_ISU01.gnm2.ann1.FGFB.gene_models_main.gff3
Functional annotation file: glyma.Wm82_ISU01.gnm2.ann1.FGFB.annotation_info.txt.gz
Species: Glycine soja
Germplasm/cultivar/variety/accession: W05
Version: v2
Why: This is the primary reference assembly available for the wild ancestor of soybean on NCBI and has the NCBI annotation.
Publication/website: https://www.ncbi.nlm.nih.gov/labs/data-hub/genome/GCF_004193775.1/
How to download from NCBI
curl -OJX GET "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCF_004193775.1/download?include_annotation_type=GENOME_FASTA,GENOME_GFF,SEQUENCE_REPORT&filename=GCF_004193775.1.zip" -H "Accept: application/zip"
Directory: /share/bitcpt/S23/referenceGenomes/Portfolios/Glycine_soja_W05
Genome file: GCF_004193775.1_ASM419377v2_genomic.fna
Gene annotation file: genomic.gff