Multi-FASTA format

FASTA format

FASTA is a text-file format for representing nucleotide sequences or peptide (amino acids) sequences.

A FASTA file begins with a description line which starts with ">" and includes the sequence identifier and a description. The following lines contain the sequence data.

Multi-FASTA format

A multi-FASTA file contains multiple FASTA formated sequences.

Example:

>sequenceID-001 description

AAGTAGGAATAATATCTTATCATTATAGATAAAAACCTTCTGAATTTGCTTAGTGTGTAT

ACGACTAGACATATATCAGCTCGCCGATTATTTGGATTATTCCCTG

>sequenceID-002 description

CAGTAAAGAGTGGATGTAAGAACCGTCCGATCTACCAGATGTGATAGAGGTTGCCAGTAC

AAAAATTGCATAATAATTGATTAATCCTTTAATATTGTTTAGAATATATCCGTCAGATAA

TCCTAAAAATAACGATATGATGGCGGAAATCGTC

>sequenceID-003 description

CTTCAATTACCCTGCTGACGCGAGATACCTTATGCATCGAAGGTAAAGCGATGAATTTAT

CCAAGGTTTTAATTTG

How to extract all FASTA header IDs without the leading >

# Get only the ID and description

grep '^>' sequences.fasta | sed 's/^>//'

sequenceID-001 description

sequenceID-002 description

sequenceID-003 description

# Get only the ID (no description)

grep '^>' sequences.fasta | sed 's/^>//' | cut -d' ' -f1

sequenceID-001

sequenceID-002

sequenceID-003

grep '^>' # selects only FASTA headers

sed 's/^>//' # removes the leading >

cut -d' ' -f1 # keeps only the first field (the ID)

Multi-FASTA format

FASTA format

Multi-FASTA format

How to extract all FASTA header IDs without the leading >

See also