Skip to content

Instantly share code, notes, and snippets.

@roelven
Last active January 24, 2026 13:17
Show Gist options
  • Select an option

  • Save roelven/7bcfca6489af974ceb611c32429629ce to your computer and use it in GitHub Desktop.

Select an option

Save roelven/7bcfca6489af974ceb611c32429629ce to your computer and use it in GitHub Desktop.
Blood Lab PDF Parser - AI-powered extraction to structured JSON

Claude Code Blood Lab PDF Parser

Parse blood lab PDFs into structured JSON. You can use this as a command in Claude Code, or run the python script if Claude Code is not installed. The python script will call the Anthropic API directly.

Setup

# Install dependencies
pip install anthropic

# Set API key
export ANTHROPIC_API_KEY="your-key-here"

Usage

Command Line

# Parse a single PDF
python parse_blood_pdf.py befund-21012026.pdf

# Specify output directory
python parse_blood_pdf.py befund.pdf -o ./data/blood-results/

# Output JSON to stdout (for piping)
python parse_blood_pdf.py befund.pdf --json-only

As a Module

from parse_blood_pdf import parse_blood_pdf

results = parse_blood_pdf("befund-21012026.pdf")
print(results["summary"])

File Structure

health-api/
├── commands/
│   └── parse-blood-results.md    # Claude command prompt
├── parse_blood_pdf.py            # Python script
├── health-data/
│   └── raw/
│       └── blood-results/
│           ├── 2026-01-15.json   # Parsed results
│           └── 2025-07-20.json
└── README.md

Output Format

See commands/parse-blood-results.md for the full JSON schema.

Key features:

  • Normalized marker names for cross-lab comparison
  • Dual unit storage (conventional + SI) for cross-lab trending
  • Panel classification (CBC, lipid panel, etc.)
  • Status flags (normal/low/high)
  • Clinical notes for interpretation
  • Attention items summary
  • Schema version 1.1

Adding to Claude Code

To use as a Claude Code custom command:

  1. Copy commands/parse-blood-results.md to your Claude Code commands directory
  2. Reference it with /parse-blood-results in Claude Code
  3. Attach the PDF when invoking

Parse Blood Lab Results

Parse a blood lab PDF and extract all test results into a structured JSON format.

Input

A PDF file containing blood lab results from any lab (German, English, or other languages).

Process

  1. Read the PDF and identify all test results
  2. Extract metadata: date, lab name/location, sample IDs, collection time
  3. For each test result, extract:
    • Original marker name (as printed)
    • Value (numeric or qualitative)
    • Unit (as printed)
    • Reference range (min/max or expected value)
    • Method (if listed)
  4. Normalize markers to canonical English identifiers
  5. Normalize units to conventional system (see unit conversion below)
  6. Determine status: normal, low, high based on reference range
  7. Identify panels included and missing
  8. Generate summary with flags for attention items

Critical: Output Encoding

Always output valid UTF-8 JSON. Preserve special characters correctly:

  • German: ä ö ü ß Ä Ö Ü
  • French: é è ê ë à â ç
  • Do NOT output mojibake like "Hämatologie" — output "Hämatologie"

Marker Normalization

Map original marker names to normalized identifiers. Use snake_case, English terms.

Original (German/English variants) Normalized marker
Hämoglobin, Hemoglobin, Hb hemoglobin
Leukozyten, Leukocytes, WBC leukocytes
Erythrozyten, Erythrocytes, RBC erythrocytes
Thrombozyten, Platelets, PLT platelets
Hämatokrit, Hematocrit, HCT hematocrit
GPT/ALAT, ALT, SGPT alt
GOT/ASAT, AST, SGOT ast
Gamma-GT, GGT, γ-GT ggt
Alkalische Phosphatase, ALP, AP alp
Kreatinin, Creatinine, Crea creatinine
Harnstoff, Urea, BUN urea
Harnsäure, Uric Acid uric_acid
Cholesterin, Cholesterol cholesterol_total
Triglyceride, Triglycerides triglycerides
HDL-Cholesterin, HDL Cholesterol, HDL hdl
LDL-Cholesterin, LDL Cholesterol, LDL ldl
Non-HDL-Cholesterin non_hdl
Ferritin ferritin
Eisen, Iron iron
Transferrin transferrin
Transferrinsättigung, Transferrin Saturation transferrin_saturation
Vitamin D (25-OH), 25-Hydroxyvitamin D, Vitamin D3 vitamin_d_25oh
Vitamin B12, Cobalamin vitamin_b12
Folsäure, Folate, Folic Acid folate
TSH, TSH basal tsh
fT3, Free T3, freies T3 ft3
fT4, Free T4, freies T4 ft4
HbA1c hba1c
HbA1C (n. IFCC), HbA1c IFCC hba1c_ifcc
Glucose, Glukose (nüchtern/fasting) glucose_fasting
Glucose, Glukose (random/nicht nüchtern) glucose
CRP, C-reaktives Protein, C-Reactive Protein crp
hs-CRP, hochsensitives CRP crp_hs
Bilirubin (gesamt), Total Bilirubin bilirubin_total
Bilirubin (direkt), Direct Bilirubin bilirubin_direct
Natrium, Sodium, Na sodium
Kalium, Potassium, K potassium
Calcium, Ca calcium
Magnesium, Mg magnesium
Phosphat, Phosphate, P phosphate
Chlorid, Chloride, Cl chloride
GFR (MDRD), GFR (CKD-EPI), eGFR gfr
Gesamteiweiß, Total Protein total_protein
Albumin albumin
HBs-Antigen, HBsAg hbs_antigen
Anti-HCV, HCV-Ak anti_hcv
Anti-HBs, HBs-Ak anti_hbs
Anti-HBc, HBc-Ak anti_hbc

For differential counts, use:

  • Percentage: neutrophils_percent, lymphocytes_percent, monocytes_percent, eosinophils_percent, basophils_percent
  • Absolute: neutrophils_absolute, lymphocytes_absolute, monocytes_absolute, eosinophils_absolute, basophils_absolute

For markers not in this list, create a reasonable snake_case identifier.

Unit Normalization (Critical for Cross-Lab Comparison)

Different labs use different unit systems. Always normalize to conventional units for consistency, while preserving the original values.

Conversion Table

Marker SI Unit Conventional Unit Conversion
Hemoglobin mmol/l g/dl × 1.611
Hematocrit l/l % × 100
Glucose mmol/l mg/dl × 18.02
Cholesterol (total, HDL, LDL) mmol/l mg/dl × 38.67
Triglycerides mmol/l mg/dl × 88.57
Creatinine µmol/l mg/dl ÷ 88.4
Urea mmol/l mg/dl × 6.006
Uric Acid µmol/l mg/dl ÷ 59.48
Bilirubin µmol/l mg/dl ÷ 17.1
Iron µmol/l µg/dl × 5.587
Calcium mmol/l mg/dl × 4.008
Magnesium mmol/l mg/dl × 2.431
Phosphate mmol/l mg/dl × 3.097
Total Protein g/l g/dl ÷ 10
Albumin g/l g/dl ÷ 10
MCH fmol pg × 16.11
MCHC mmol/l g/dl × 1.611

Leukocyte/Cell Count Equivalents

These are the same value, just different notation:

  • 10³/µl = Gpt/l = ×10⁹/l = thousand/µl
  • 10⁶/µl = Tpt/l = ×10¹²/l = million/µl
  • /µl = cells/µl (absolute count, no multiplier)

Normalize to: 10³/µl for WBC/platelets, 10⁶/µl for RBC, /µl for absolute differentials.

Enzyme Activity

  • µkat/l to IU/l (U/l): × 60

Panel Classification

Assign each marker to a panel:

Panel Markers
complete_blood_count leukocytes, erythrocytes, hemoglobin, hematocrit, platelets, mcv, mch, mchc, rdw, mpv, neutrophils_, lymphocytes_, monocytes_, eosinophils_, basophils_*
lipid_panel cholesterol_total, triglycerides, hdl, ldl, non_hdl, ldl_hdl_ratio, vldl
liver_function alt, ast, ggt, alp, bilirubin_total, bilirubin_direct, albumin
kidney_function creatinine, urea, gfr, cystatin_c, uric_acid
thyroid tsh, ft3, ft4, t3, t4, anti_tpo, anti_tg
iron_studies ferritin, iron, transferrin, transferrin_saturation, tibc
glycemic glucose, glucose_fasting, hba1c, hba1c_ifcc, insulin, c_peptide
vitamin_d vitamin_d_25oh, vitamin_d_1_25oh
vitamin_b12 vitamin_b12, holotranscobalamin
folate folate
inflammation_markers crp, crp_hs, esr, il6
electrolytes sodium, potassium, chloride, calcium, magnesium, phosphate
hepatitis_screening hbs_antigen, anti_hbs, anti_hbc, anti_hcv
metabolic_basic glucose_fasting, total_protein, albumin, uric_acid

Output Schema

{
  "date": "YYYY-MM-DD",
  "lab": {
    "id": "lab_identifier_snake_case",
    "name": "Full Lab Name",
    "address": "Address if available",
    "phone": "Phone if available"
  },
  "sample_ids": ["id1", "id2"],
  "collection_time": "HH:MM",
  "panels_included": ["panel1", "panel2"],
  "panels_missing": ["panel3", "panel4"],
  "results": [
    {
      "category": "Category as printed (preserved)",
      "panel": "panel_name",
      "marker": "normalized_marker_id",
      "marker_original": "Original Name As Printed",
      "value": 15.1,
      "unit": "g/dl",
      "value_si": 9.37,
      "unit_si": "mmol/l",
      "reference_min": 13.5,
      "reference_max": 17.2,
      "reference_min_si": 8.38,
      "reference_max_si": 10.67,
      "status": "normal",
      "method": "Method if listed",
      "clinical_note": "Optional note for interpretation"
    },
    {
      "category": "Category",
      "panel": "panel_name",
      "marker": "qualitative_marker",
      "marker_original": "Original Name",
      "value_text": "negative",
      "reference_expected": "negative",
      "status": "normal",
      "method": "Method"
    }
  ],
  "flags": ["marker_status"],
  "summary": {
    "total_markers": 38,
    "normal": 37,
    "low": 1,
    "high": 0,
    "attention_items": [
      "Human readable note about items needing attention"
    ]
  },
  "source_file": "original_filename.pdf",
  "parsed_at": "ISO8601 timestamp",
  "schema_version": "1.1"
}

Field Details

Primary values (value, unit, reference_min, reference_max):

  • Always in conventional units (g/dl, mg/dl, %, etc.)
  • If the lab reports in SI units, convert to conventional and store converted values here

SI values (value_si, unit_si, reference_min_si, reference_max_si):

  • Always in SI units (mmol/l, µmol/l, l/l, etc.)
  • If the lab reports in conventional units, convert to SI and store here
  • Omit these fields if no conversion is needed (e.g., percentages, counts)

Qualitative tests:

  • Use value_text instead of numeric value
  • Use reference_expected instead of min/max

Status Determination

  • normal: value within reference range
  • low: value below reference_min
  • high: value above reference_max
  • critical_low: value <50% of reference_min
  • critical_high: value >200% of reference_max

For qualitative tests, compare value_text to reference_expected.

Clinical Notes

Add clinical_note for:

  • Values technically out of range but clinically favorable (e.g., low HbA1c is good)
  • Values at the edge of normal that may warrant monitoring
  • Important context about the marker
  • Conversion notes if the original used unusual units

Output Instructions

  1. Verify extraction: Print a summary table for the user to review
  2. Highlight flags: Show any values outside reference range
  3. Save JSON: Output filename should be YYYY-MM-DD.json based on the test date
  4. Use UTF-8: Ensure proper encoding of all special characters

Example Output Summary

Parsed: Example Lab Name (2025-07-15)
Sample IDs: 12345678
Collection: 09:30

Markers extracted: 42
Panels: complete_blood_count, lipid_panel, liver_function, kidney_function, thyroid, vitamin_d

⚠️  Attention:
  - Vitamin D: 18.5 ng/ml [LOW] (ref: 20-50) - consider supplementation
  - Ferritin: 28 ng/ml [NORMAL but low-end] (ref: 22-275) - monitor if vegetarian/vegan

✓ 40 markers within normal range

Unit conversions applied:
  - Hemoglobin: 9.2 mmol/l → 14.8 g/dl
  - Cholesterol: 4.04 mmol/l → 156 mg/dl
  - Glucose: 4.83 mmol/l → 87 mg/dl

Saved to: 2025-07-15.json

Example JSON (Partial)

{
  "date": "2025-07-15",
  "lab": {
    "id": "example_lab_berlin",
    "name": "Example Lab Berlin GmbH",
    "address": "Musterstraße 123, 10115 Berlin"
  },
  "sample_ids": ["12345678"],
  "collection_time": "09:30",
  "panels_included": ["complete_blood_count", "lipid_panel", "thyroid"],
  "panels_missing": ["hepatitis_screening", "inflammation_markers"],
  "results": [
    {
      "category": "Hämatologie",
      "panel": "complete_blood_count",
      "marker": "hemoglobin",
      "marker_original": "Hämoglobin",
      "value": 14.8,
      "unit": "g/dl",
      "value_si": 9.2,
      "unit_si": "mmol/l",
      "reference_min": 13.5,
      "reference_max": 17.2,
      "reference_min_si": 8.38,
      "reference_max_si": 10.67,
      "status": "normal",
      "method": "Photometry"
    },
    {
      "category": "Lipide",
      "panel": "lipid_panel",
      "marker": "cholesterol_total",
      "marker_original": "Cholesterin",
      "value": 156,
      "unit": "mg/dl",
      "value_si": 4.04,
      "unit_si": "mmol/l",
      "reference_min": null,
      "reference_max": 200,
      "reference_min_si": null,
      "reference_max_si": 5.17,
      "status": "normal",
      "method": "Enzymatic"
    },
    {
      "category": "Vitamine",
      "panel": "vitamin_d",
      "marker": "vitamin_d_25oh",
      "marker_original": "Vitamin D3 (25-OH)",
      "value": 18.5,
      "unit": "ng/ml",
      "reference_min": 20,
      "reference_max": 50,
      "status": "low",
      "method": "ECLIA",
      "clinical_note": "Below optimal range - consider supplementation, especially in winter months"
    },
    {
      "category": "Serologie",
      "panel": "hepatitis_screening",
      "marker": "hbs_antigen",
      "marker_original": "HBs-Antigen",
      "value_text": "negative",
      "reference_expected": "negative",
      "status": "normal",
      "method": "CMIA",
      "clinical_note": "No evidence of Hepatitis B infection"
    }
  ],
  "flags": ["vitamin_d_25oh_low"],
  "summary": {
    "total_markers": 42,
    "normal": 41,
    "low": 1,
    "high": 0,
    "attention_items": [
      "Vitamin D: 18.5 ng/ml is below reference range (20-50) - consider supplementation"
    ]
  },
  "source_file": "lab_results_2025-07-15.pdf",
  "parsed_at": "2025-07-15T14:30:00Z",
  "schema_version": "1.1"
}
#!/usr/bin/env python3
"""
Parse blood lab PDF using Claude Code.
Usage:
python parse_blood_pdf.py <pdf_file>
python parse_blood_pdf.py befund-21012026.pdf
Requirements:
- anthropic Python package (pip install anthropic)
- ANTHROPIC_API_KEY environment variable set
- PDF file accessible at the given path
"""
import argparse
import base64
import json
import os
import sys
from datetime import datetime
from pathlib import Path
try:
import anthropic
except ImportError:
print("Error: anthropic package not installed. Run: pip install anthropic")
sys.exit(1)
def load_command_prompt() -> str:
"""Load the parse-blood-results command prompt."""
command_path = Path(__file__).parent / "commands" / "parse-blood-results.md"
if not command_path.exists():
# Try alternative location
command_path = Path("commands/parse-blood-results.md")
if not command_path.exists():
raise FileNotFoundError(
f"Command file not found at {command_path}. "
"Ensure parse-blood-results.md is in the commands/ directory."
)
return command_path.read_text()
def load_pdf_as_base64(pdf_path: str) -> tuple[str, str]:
"""Load PDF file and return base64 encoded content and filename."""
path = Path(pdf_path)
if not path.exists():
raise FileNotFoundError(f"PDF file not found: {pdf_path}")
if not path.suffix.lower() == ".pdf":
raise ValueError(f"File must be a PDF: {pdf_path}")
with open(path, "rb") as f:
content = base64.standard_b64encode(f.read()).decode("utf-8")
return content, path.name
def parse_blood_pdf(pdf_path: str, output_dir: str = None) -> dict:
"""
Parse a blood lab PDF using Claude.
Args:
pdf_path: Path to the PDF file
output_dir: Optional output directory for JSON (default: ./health-data/raw/blood-results/)
Returns:
Parsed results as a dictionary
"""
# Load the command prompt
command_prompt = load_command_prompt()
# Load the PDF
pdf_base64, pdf_filename = load_pdf_as_base64(pdf_path)
# Initialize Anthropic client
client = anthropic.Anthropic()
# Create the message with PDF attachment
print(f"Sending {pdf_filename} to Claude for parsing...")
message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=8192,
messages=[
{
"role": "user",
"content": [
{
"type": "document",
"source": {
"type": "base64",
"media_type": "application/pdf",
"data": pdf_base64,
},
},
{
"type": "text",
"text": f"""Parse this blood lab PDF according to the following instructions.
{command_prompt}
After parsing, output ONLY the JSON (no markdown code blocks, no explanation before or after).
The JSON should be valid and complete.
Source filename: {pdf_filename}
Parsed at: {datetime.now().isoformat()}
"""
}
],
}
],
)
# Extract the response
response_text = message.content[0].text
# Try to parse as JSON
try:
# Handle case where Claude wraps in code blocks
if response_text.startswith("```"):
# Extract JSON from code block
lines = response_text.split("\n")
json_lines = []
in_block = False
for line in lines:
if line.startswith("```") and not in_block:
in_block = True
continue
elif line.startswith("```") and in_block:
break
elif in_block:
json_lines.append(line)
response_text = "\n".join(json_lines)
results = json.loads(response_text)
except json.JSONDecodeError as e:
print(f"Warning: Could not parse response as JSON: {e}")
print("Raw response:")
print(response_text[:2000])
return {"error": "Failed to parse JSON", "raw_response": response_text}
# Determine output path
if output_dir is None:
output_dir = Path("./health-data/raw/blood-results")
else:
output_dir = Path(output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
# Generate output filename from date
test_date = results.get("date", datetime.now().strftime("%Y-%m-%d"))
output_file = output_dir / f"{test_date}.json"
# Handle existing file
if output_file.exists():
lab_id = results.get("lab", {}).get("id", "unknown")
output_file = output_dir / f"{test_date}_{lab_id}.json"
# Save the results
with open(output_file, "w", encoding="utf-8") as f:
json.dump(results, f, indent=2, ensure_ascii=False)
print(f"\n✓ Saved to: {output_file}")
# Print summary
print_summary(results)
return results
def print_summary(results: dict):
"""Print a human-readable summary of the parsed results."""
print("\n" + "=" * 60)
print(f"Parsed: {results.get('lab', {}).get('name', 'Unknown Lab')} ({results.get('date', 'Unknown date')})")
summary = results.get("summary", {})
print(f"Markers extracted: {summary.get('total_markers', len(results.get('results', [])))}")
panels = results.get("panels_included", [])
if panels:
print(f"Panels: {', '.join(panels)}")
# Show attention items
attention = summary.get("attention_items", [])
flags = results.get("flags", [])
if attention or flags:
print("\n⚠️ Attention:")
for item in attention:
print(f" - {item}")
# Count by status
normal = summary.get("normal", 0)
low = summary.get("low", 0)
high = summary.get("high", 0)
if low == 0 and high == 0:
print("\n✓ All markers within normal range")
else:
if low > 0:
print(f"\n⬇️ {low} marker(s) below range")
if high > 0:
print(f"⬆️ {high} marker(s) above range")
print("=" * 60)
def main():
parser = argparse.ArgumentParser(
description="Parse blood lab PDF into structured JSON using Claude"
)
parser.add_argument(
"pdf_file",
help="Path to the blood lab PDF file"
)
parser.add_argument(
"-o", "--output-dir",
help="Output directory for JSON files (default: ./health-data/raw/blood-results/)",
default=None
)
parser.add_argument(
"--json-only",
action="store_true",
help="Output only the JSON to stdout (for piping)"
)
args = parser.parse_args()
# Check for API key
if not os.environ.get("ANTHROPIC_API_KEY"):
print("Error: ANTHROPIC_API_KEY environment variable not set")
sys.exit(1)
try:
results = parse_blood_pdf(args.pdf_file, args.output_dir)
if args.json_only:
print(json.dumps(results, indent=2, ensure_ascii=False))
except FileNotFoundError as e:
print(f"Error: {e}")
sys.exit(1)
except Exception as e:
print(f"Error: {e}")
sys.exit(1)
if __name__ == "__main__":
main()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment