The relationship between isolates, sequences, and FASTA headers depends on the assembly level of the genome:
For a single complete bacterial genome (one isolate):
- The FASTA file contains multiple sequences (one per chromosome/plasmid)
- Each header represents a different replicon (chromosome, plasmid, or organelle)
Example from one Klebsiella isolate:
>chromosome_1 Klebsiella pneumoniae strain ABC123
ATCGATCG...
>plasmid_pKP100 Klebsiella pneumoniae strain ABC123
GCTAGCTA...
>plasmid_pKP200 Klebsiella pneumoniae strain ABC123
TTAACCGG...
This single FASTA file represents one isolate with 3 genetic elements.
For a draft genome (one isolate):
- The FASTA file contains dozens to hundreds of sequences
- Each header represents a different contig (assembled fragment)
Example from one draft bacterial genome:
>contig_1 Klebsiella pneumoniae isolate XYZ789
ATCGATCG...
>contig_2 Klebsiella pneumoniae isolate XYZ789
GCTAGCTA...
>contig_3 Klebsiella pneumoniae isolate XYZ789
TTAACCGG...
[... 50-500 more contigs ...]
This represents one isolate incompletely assembled into many pieces.
FASTA headers typically contain the sequence accession number and organism/strain information, like ">NZ_CP019962.1 Eubacterium limosum strain ATCC 8486 chromosome, complete genome"
Common header elements:
- Accession number (unique database ID)
- Organism name and strain/isolate identifier
- Sequence type (chromosome, plasmid, contig, etc.)
- Sometimes: assembly information
When researchers work with multiple isolates:
Each isolate should be in a single separate file that includes any plasmid or organelle sequences, with separate files for each genome rather than all genomes together
So if you're studying 100 isolates:
- 100 separate FASTA files (standard practice)
- Each file has 1-500+ sequences depending on assembly quality
- The isolate/strain identifier appears in the headers within each file
A typical research scenario:
Study: 50 Klebsiella pneumoniae isolates from hospital outbreak
File structure:
- isolate_001.fasta (contains 87 contigs, all from patient #1)
- isolate_002.fasta (contains 104 contigs, all from patient #2)
- isolate_003.fasta (contains 92 contigs, all from patient #3)
...
- isolate_050.fasta (contains 78 contigs, all from patient #50)
Each file represents one bacterial isolate, but contains multiple sequence headers for the different assembled pieces of that isolate's genome.
Isolate ≠ Sequence
- 1 isolate = 1 biological sample = 1 genome
- 1 genome = 1 FASTA file with multiple sequences (contigs/chromosomes/plasmids)
- Multiple sequences per file is normal and expected—they're all fragments from the same isolate