Genome assemblies are our best representation of what the actual genome sequence looks like. Genome assemblies not biologically true (yet) because of limits of sequencing technologies.
Assemblies are made of sequence fragments.
The smallest fragment is made from layering (aka aligning) together overlapping sequences until there are no more overlapping sequences, these are called contigs.
Next, contigs are ordered and oriented by some kind of scaffolding technique such as optical mapping, these larger fragments are called scaffolds.
Scaffolds are then ordered and oriented to form chromosome representations using genetic marker maps or other genome assemblies.
Availability of genome assemblies varies by organism. Some species don't have genome assemblies available at all, while other species have dozens. Model species and economically important species usually have many genome assemblies.
The quality of the genome assemblies also varies. If your species has more than one assembly, then you'll need to choose which one to use. You can choose to use the assembly that is most closely related to your experimental material or the assembly with the best quality.
TAIR10.1 is assembled to the chromosome level. Alternatives are pseudochromosome level, scaffold level and contig level. Assemblies above the pseudochromosome level are preferred.
The genome assembly size of TAIR10.1 is 119.1 MegaNases. Compare this to the estimate genome size of Arabidopsis at 135 Mb. The assembly captures about 88% of the genome.
Genome assemblies are scattered across many large and small databases. There are often multiple, independent assembly databases even within the same species and versions of the same assemblies. This is because different research groups and consortiums have made their own assemblies.
Check the literature. Many genome assemblies have accompanying publications. Those publications will tell you where to download the assemblies from in the Data Availability section.
Check large databases, such as NCBI
Check consortium databases, such as Sol Genomics Network
Check the introduction sections of newer genome assembly publications to see if the authors have cited the other genome assemblies for their organism.
Ask others also working on your species about where genome assemblies for your species are generally found
We have already downloaded the assemblies for this class. Please do not download them again. Here is what we did and the code we used.
Navigated to the genome assembly database, for example http://solomics.agis.org.cn/tomato/ftp/genome/
Found the assembly we wanted, right clicked the hyperlink and copied the link address
In the Henry II directory that we wanted the assembly to be in, we ran this line of code on the command line (no need for job script).
wget http://solomics.agis.org.cn/tomato/ftp/genome/TS3.fasta.gz
We waited for the download to finish. It should only take a minute. We didn't type or click in the terminal because that would have cancelled the download, though it is okay to minimize the terminal window.
Whole learning community will use the newest M82 tomato genome assembly. This is an assembly of the M82 variety of tomato. It was sequenced using PacBio HIFi. M82 is the same variety that our RNA-seq data is from. It was published by https://www.nature.com/articles/s41586-022-04808-9
Species: Solanum lycopersicum
Germplasm/cultivar/variety/accession: M82
Why: This version of the M82 assembly is brand new, high quality and was made with newest technology, PacBio HiFi sequencing. M82 is the same tomato variety that we did RNA-seq on in this class.
Directory: /share/bitcpt/Fall2022/referenceGenomes/Solanum_lycopersicum/M82_vTS3
Publication/website: Graph pangenome captures missing heritability and empowers tomato breeding | Nature
Download link: Index of /tomato/ftp/ (agis.org.cn)
GID: TS3
_________________________________________________________________________
________________________________________________________________________________________________
________________________________________________________________________
M82 v 2019 Tomato (pre-HiFi)
Heinz 1706 v 5.0 Tomato (HiFi)
Cherry Tomato (HiFi)
Wild Ancestor of Tomato (HiFi)
Diploid Potato (HiFi)
Species: Solanum lycopersicum
Germplasm/cultivar/accession: M82
Why: This version of the M82 assembly is from 2019 and uses pre-HIFI assembly methods. It's a very good quality assembly. M82 is the same tomato variety that we did RNA-seq on in this class. How do you think our RNA-seq data will align to an older assembly of the same variety as our samples?
Directory: /share/bitcpt/Fall2022/referenceGenomes/Solanum_lycopersicum/Portfolio
Publication/website: RaGOO: fast and accurate reference-guided scaffolding of draft genomes | Genome Biology | Full Text (biomedcentral.com)
Download link: https://solgenomics.net/ftp/genomes/Solanum_lycopersicum/M82/
A secondary annotation: M82 Genome Annotation (UGA-v1)
Species: Solanum lycopersicum
Germplasm/cultivar/accession: Heinz 1706
Why: Heinz 1706 was the first tomato genome assembled. Any guesses why this important variety was chosen to be the first? Version 5 is another HIFI assembly. How do you think our RNA-seq data will align to a different tomato variety?
Directory: TBA
Publication/website: Graph pangenome captures missing heritability and empowers tomato breeding | Nature
Download link: Index of /tomato/ftp/ (agis.org.cn)
GID: SL5.0
Species: Solanum lycopersicum var. cerasiforme (cherry tomato)
Germplasm/cultivar/accession: Peacevine
Why: Cherry tomatoes are a different subspecies than M82. How do you think our RNA-seq data will align to a different subspecies?
Directory: TBA
Publication/website: Graph pangenome captures missing heritability and empowers tomato breeding | Nature
Download link: Index of /tomato/ftp/ (agis.org.cn)
Sample GID: TS545
Species: Solanum pimpinellifolium (PIM)
Germplasm/cultivar/accession: Hacienda Buenos Aires
Why: This is a new HIFI assembly for the wild ancestor of tomato. If the your species of interest is a polyploid, then the only reference assembly available might be for a diploid ancestor or relative. Polyploid genomes can be very technically challenging and expensive to assemble. Also if your species is not well studied, then you may have a reference genome only for a closely related relative. How do you think our RNA-seq data will align to the wild ancestor of tomato?
Directory: TBA
Publication/website: Graph pangenome captures missing heritability and empowers tomato breeding | Nature
Download link: Index of /tomato/ftp/ (agis.org.cn)
Sample GID: TS265
Species: Solanum tuberosum
Germplasm/cultivar/accession: RH10-15
Why: This is a new HIFI assembly for a diploid potato landrace. Potatoes are in the same family as tomatoes, Solanaceae, also known as the nightshades. Sometimes, understudied species have genome assemblies for only distant relatives. How do you think our RNA-seq data will align to a distant relative?
Directory: TBA
Publication/website: Genome evolution and diversity of wild and cultivated potatoes | Nature