Genome Assemblies

Introduction - What are genome assemblies really?

Genome assemblies are our best representation of what the actual genome sequence looks like. Genome assemblies not biologically true (yet) because of limits of sequencing technologies.

Assemblies are made of sequence fragments.

The smallest fragment is made from layering (aka aligning) together overlapping sequences until there are no more overlapping sequences, these are called contigs.
Next, contigs are ordered and oriented by some kind of scaffolding technique such as optical mapping, these larger fragments are called scaffolds.
Scaffolds are then ordered and oriented to form chromosome representations using genetic marker maps or other genome assemblies.

Availability and Quality

Availability of genome assemblies varies by organism. Some species don't have genome assemblies available at all, while other species have dozens. Model species and economically important species usually have many genome assemblies.

The quality of the genome assemblies also varies. If your species has more than one assembly, then you'll need to choose which one to use. You can choose to use the assembly that is most closely related to your experimental material or the assembly with the best quality.

How to assess assembly quality

Assembly level

TAIR10.1 is assembled to the chromosome level. Alternatives are pseudochromosome level, scaffold level and contig level. Assemblies above the pseudochromosome level are preferred.

Size

The genome assembly size of TAIR10.1 is 119.1 MegaNases. Compare this to the estimate genome size of Arabidopsis at 135 Mb. The assembly captures about 88% of the genome.

Methods 1 - How to find assemblies

Genome assemblies are scattered across many large and small databases. There are often multiple, independent assembly databases even within the same species and versions of the same assemblies. This is because different research groups and consortiums have made their own assemblies.
- - Check the literature. Many genome assemblies have accompanying publications. Those publications will tell you where to download the assemblies from in the Data Availability section.
  - Check large databases, such as NCBI
  - Check consortium databases, such as Soybase
  - Check the publications, especially in the Data Availability section
  - Ask others also working on your species about where genome assemblies for your species are generally found

Methods 2 - How to download assemblies

We have already downloaded the assemblies for this class. Please do not download them again. Here is what we did and the code we used.
- Navigated to the genome assembly database, for example https://v1.legumefederation.org/data/v2/
- Found the assembly we wanted, right clicked the hyperlink and copied the link address
- In the Hazel directory that we wanted the assembly to be in, we ran this line of code on the command line (no need for job script).
- wget https://v1.legumefederation.org/data/v2/Glycine/max/genomes/Lee.gnm2.K7BV/glyma.Lee.gnm2.K7BV.genome_main.fna.gz
- We waited for the download to finish. It should only take a minute. We didn't type or click in the terminal because that would have cancelled the download, though it is okay to minimize the terminal window.
- NCBI downloads are different than other databases.
  - You can use curl in the terminal
    - curl -OJX GET "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCF_000004515.6/download?include_annotation_type=GENOME_FASTA,GENOME_GFF&filename=GCF_000004515.6.zip" -H "Accept: application/zip"
  - Or you can download to your computer and use Globus to upload to Hazel

Results

Genome assemblies for leaning community group work

The whole learning community will use the Glycine max cv. Lee, assembly version 2

Lee Assembly

Species: Glycine max

Germplasm/cultivar/variety/accession: Lee

Version: v2

Why: Lee is the same soybean variety that we did RNA-seq on. Version 2 of the Lee assembly is the most up to date.

Publication/website: Garg V, et al., Chromosome-length genome assemblies of six legume species provide insights into genome organization, evolution, and agronomic traits for crop improvement. J Adv Res. 2022 Dec;42:315-329. doi: 10.1016/j.jare.2021.10.009. Epub 2021 Nov 3. PMID: 36513421; PMCID: PMC9788938.

Genome assembly download link: https://v1.legumefederation.org/data/v2/Glycine/max/genomes/Lee.gnm2.K7BV/glyma.Lee.gnm2.K7BV.genome_main.fna.gz
Gene annotation: https://v1.legumefederation.org/data/v2/Glycine/max/annotations/Lee.gnm2.ann1.1FNT/glyma.Lee.gnm2.ann1.1FNT.gene_models_main.gff3.gz
Functional annotation:
- The gff3 file the following functional annotations condensed in the 9th column. The original KEGG, GO and INTERPRO SCAN files are located:
  - https://cegresources.icrisat.org/data_public/legumepedia_data/Glycine_max/Annotation/gma.longest.annotation.description.kegg.txt
  - https://cegresources.icrisat.org/data_public/legumepedia_data/Glycine_max/Annotation/gma.longest.annotation.GO.txt
  - https://cegresources.icrisat.org/data_public/legumepedia_data/Glycine_max/Annotation/gma.iprscan.txt

Directory: /share/bitcpt/S23/referenceGenomes/Glycine_max_Lee_v2

Genome file: glyma.Lee.gnm2.K7BV.genome_main.fna
Gene annotation file: glyma.Lee.gnm2.ann1.1FNT.gene_models_main.gff3

Genome assemblies for individual final portfolios

Soybean reference genome (2021)
Lee version 1 (pre-HiFi in 2017)
Williams 82 ISU01 v2 (with HiFi 2022)
Glycine soja (wild ancestor of soybean)

THE soybean reference genome v1

Species: Glycine max

Germplasm/cultivar/variety/accession: Williams 82

Version: v4

Why: This is the primary reference assembly available on NCBI and has the NCBI annotation.

Publication/website: https://www.ncbi.nlm.nih.gov/data-hub/genome/GCF_000004515.6/

How to download from NCBI
curl -OJX GET "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCF_000004515.6/download?include_annotation_type=GENOME_FASTA,GENOME_GFF&filename=GCF_000004515.6.zip" -H "Accept: application/zip"

Directory: /share/bitcpt/S23/referenceGenomes/Portfolios/Glycine_max_THEreference

Genome file: GCF_000004515.6_Glycine_max_v4.0_genomic.fna
Gene annotation file: genomic.gtf
- this annotation file is throwing an error at the make transcriptome stage. Dr. Delorean tried downloading the gtf and gff3 again.

Lee Assembly v1

Species: Glycine max

Germplasm/cultivar/variety/accession: Lee

Version: v1

Why: Lee is the same soybean variety that we did RNA-seq on. Version 1 of the Lee assembly is was replaced with version 2.

Publication/website: Valliyodan B, et al., Construction and comparison of three reference-quality genome assemblies for soybean. Plant J. 2019 Dec;100(5):1066-1082. doi: 10.1111/tpj.14500. Epub 2019 Oct 28. PMID: 31433882.

Genome assembly download link: https://v1.legumefederation.org/data/v2/Glycine/max/genomes/Lee.gnm1.BXNC/glyma.Lee.gnm1.BXNC.genome_main.fna.gz
Gene annotation: https://v1.legumefederation.org/data/v2/Glycine/max/annotations/Lee.gnm1.ann1.6NZV/glyma.Lee.gnm1.ann1.6NZV.gene_models_main.gff3.gz
Functional annotation: Functional annotations are included in the 9th column of the gff3 file.

Directory: /share/bitcpt/S23/referenceGenomes/Portfolios/Glycine_max_Lee_v1

Genome file: glyma.Lee.gnm1.BXNC.genome_main.fna
Gene annotation file: glyma.Lee.gnm1.ann1.6NZV.gene_models_main.gff3

Williams 82 ISU01 Assembly v2

Species: Glycine max

Germplasm/cultivar/variety/accession: Williams 82

Version: v2

Why: This assembly is haplotype phased (by sequencing a highly inbred individual of Williams 82), sequenced with PacBio HiFi reads and scaffolded with HiC. It is the most complete soybean assembly.

Publication/website: Haun, W., et al., The Composition and Origins of Genomic Variation among Individuals of the Soybean Reference Cultivar Williams 82, Plant Physiology, Volume 155, Issue 2, February 2011, Pages 645–655, https://doi.org/10.1104/pp.110.166736

Note that the publication was for v1 of Williams 82 ISU01. The assembly results and methods described in the publication are for v1. Consult the assembly download page and the JGI release page for information about v2 of this assembly.

Genome assembly download link: https://soybase.org/data/v2/Glycine/max/genomes/Wm82_ISU01.gnm2.JFPQ/glyma.Wm82_ISU01.gnm2.JFPQ.genome_main.fna.gz
Gene annotation: https://soybase.org/data/v2/Glycine/max/annotations/Wm82_ISU01.gnm2.ann1.FGFB/glyma.Wm82_ISU01.gnm2.ann1.FGFB.gene_models_main.gff3.gz
Functional annotation: gff3 doesn't have the functional annotations

Directory: /share/bitcpt/S23/referenceGenomes/Portfolios/Glycine_max_Wm82-ISU01_v2

Genome file: glyma.Lee.gnm1.BXNC.genome_main.fna
Gene annotation file: glyma.Wm82_ISU01.gnm2.ann1.FGFB.gene_models_main.gff3
Functional annotation file: glyma.Wm82_ISU01.gnm2.ann1.FGFB.annotation_info.txt.gz

Glycine soja

Species: Glycine soja

Germplasm/cultivar/variety/accession: W05

Version: v2

Why: This is the primary reference assembly available for the wild ancestor of soybean on NCBI and has the NCBI annotation.

Publication/website: https://www.ncbi.nlm.nih.gov/labs/data-hub/genome/GCF_004193775.1/

How to download from NCBI
curl -OJX GET "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCF_004193775.1/download?include_annotation_type=GENOME_FASTA,GENOME_GFF,SEQUENCE_REPORT&filename=GCF_004193775.1.zip" -H "Accept: application/zip"

Directory: /share/bitcpt/S23/referenceGenomes/Portfolios/Glycine_soja_W05

Genome file: GCF_004193775.1_ASM419377v2_genomic.fna
Gene annotation file: genomic.gff

S23 Final Portfolio Reference Genome Sign Up

Final portfolio genome sheet signup

Page updated

Report abuse