Skip to content

Instantly share code, notes, and snippets.

@jbothma
Last active November 1, 2025 17:59
Show Gist options
  • Select an option

  • Save jbothma/c9977c7972973b9241293ee1fae3fcec to your computer and use it in GitHub Desktop.

Select an option

Save jbothma/c9977c7972973b9241293ee1fae3fcec to your computer and use it in GitHub Desktop.
Claude, what's the relationship between isolate, fasta file, sequence, and fasta file header when sharing genetic data for research?

The relationship between isolates, sequences, and FASTA headers depends on the assembly level of the genome:

Complete Genomes

For a single complete bacterial genome (one isolate):

  • The FASTA file contains multiple sequences (one per chromosome/plasmid)
  • Each header represents a different replicon (chromosome, plasmid, or organelle)

Example from one Klebsiella isolate:

>chromosome_1 Klebsiella pneumoniae strain ABC123
ATCGATCG...
>plasmid_pKP100 Klebsiella pneumoniae strain ABC123
GCTAGCTA...
>plasmid_pKP200 Klebsiella pneumoniae strain ABC123
TTAACCGG...

This single FASTA file represents one isolate with 3 genetic elements.

Draft Assemblies (Most Common)

For a draft genome (one isolate):

  • The FASTA file contains dozens to hundreds of sequences
  • Each header represents a different contig (assembled fragment)

Example from one draft bacterial genome:

>contig_1 Klebsiella pneumoniae isolate XYZ789
ATCGATCG...
>contig_2 Klebsiella pneumoniae isolate XYZ789
GCTAGCTA...
>contig_3 Klebsiella pneumoniae isolate XYZ789
TTAACCGG...
[... 50-500 more contigs ...]

This represents one isolate incompletely assembled into many pieces.

What Goes in the Header?

FASTA headers typically contain the sequence accession number and organism/strain information, like ">NZ_CP019962.1 Eubacterium limosum strain ATCC 8486 chromosome, complete genome"

Common header elements:

  • Accession number (unique database ID)
  • Organism name and strain/isolate identifier
  • Sequence type (chromosome, plasmid, contig, etc.)
  • Sometimes: assembly information

Multiple Isolates in Research

When researchers work with multiple isolates:

Each isolate should be in a single separate file that includes any plasmid or organelle sequences, with separate files for each genome rather than all genomes together

So if you're studying 100 isolates:

  • 100 separate FASTA files (standard practice)
  • Each file has 1-500+ sequences depending on assembly quality
  • The isolate/strain identifier appears in the headers within each file

Practical Example

A typical research scenario:

Study: 50 Klebsiella pneumoniae isolates from hospital outbreak

File structure:
- isolate_001.fasta (contains 87 contigs, all from patient #1)
- isolate_002.fasta (contains 104 contigs, all from patient #2)
- isolate_003.fasta (contains 92 contigs, all from patient #3)
...
- isolate_050.fasta (contains 78 contigs, all from patient #50)

Each file represents one bacterial isolate, but contains multiple sequence headers for the different assembled pieces of that isolate's genome.

Key Takeaway

Isolate ≠ Sequence

  • 1 isolate = 1 biological sample = 1 genome
  • 1 genome = 1 FASTA file with multiple sequences (contigs/chromosomes/plasmids)
  • Multiple sequences per file is normal and expected—they're all fragments from the same isolate
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment