Skip to content

Instantly share code, notes, and snippets.

@jmanhype
Created December 8, 2025 15:40
Show Gist options
  • Select an option

  • Save jmanhype/cf502d30e9db50895374bad7ec2cbd4c to your computer and use it in GitHub Desktop.

Select an option

Save jmanhype/cf502d30e9db50895374bad7ec2cbd4c to your computer and use it in GitHub Desktop.
LVM Thin Pool Disaster: Lessons Learned from a Proxmox Media Server

LVM Thin Pool Disaster: Lessons Learned from a Proxmox Media Server

TL;DR: Don't Set Up Thin Pools Like This

I spent hours fighting filesystem corruption on my Proxmox media server. The root cause? LVM thin pool with only 20MB of physical free space. Here's what went wrong and how to avoid it.

The Setup

  • Hardware: Proxmox VE 8.4.1 with Seagate SlimBUP 931GB external drive
  • Storage: LVM thin pool (SlimBUP) with containers for Plex and media services
  • Problem: Thin pool allocated 931.5GB out of 931.51GB physical space (leaving only 20MB free)

The Symptoms

Initial Signs

  • Filesystem corruption errors: EXT4-fs error: Detected aborted journal
  • I/O errors: Buffer I/O error on dev dm-13, logical block 9258
  • Thin pool metadata failures: device-mapper: thin: dm_thin_find_block() failed: error = -5
  • Filesystem remounted read-only

What Triggered It

  1. Downloading media files (~14GB) pushed thin pool from 70% → 80%
  2. Plex transcoding - writing temp files triggered metadata operations
  3. Running fsck - filesystem repair operations write to physical volume
  4. Any write-heavy operation at high capacity

The Root Cause

Thin Pool Design Flaw

# Physical volume status
PV         VG      Fmt  Attr PSize   PFree
/dev/sdd1  SlimBUP lvm2 a--  931.51g 20.00m

# Thin pool allocation
[SlimBUP_tdata]   - 912.86g (data storage)
[SlimBUP_tmeta]   - 9.32g   (metadata)
[lvol0_pmspare]   - 9.32g   (spare metadata)
Total: 931.5g (leaving only 20MB PFree)

The Problem: Thin pools need physical free space for:

  • Metadata updates during write operations
  • Copy-on-write operations
  • Journal operations
  • Filesystem repairs

With only 20MB free, any operation that writes to the physical volume causes metadata allocation failures, triggering cascading corruption.

What Didn't Work

Attempt 1: Deleting Media

  • Deleted 100GB of media files
  • Thin pool usage dropped from 89% → 64%
  • BUT: Filesystem metadata was corrupt, so free space wasn't registered
  • Required: e2fsck to fix metadata, then fstrim to inform thin pool

Attempt 2: Filesystem Repair

  • Ran e2fsck -f -y /dev/SlimBUP/vm-103-disk-0
  • Fixed corrupt metadata
  • BUT: Running fsck itself writes to physical volume (20MB space)
  • Result: Created MORE corruption while trying to fix corruption

Attempt 3: Just Using the System Normally

  • Thought we were safe at 73% thin pool usage (242GB free internally)
  • User played a video in Plex
  • Plex transcoder wrote temp files
  • Triggered the same corruption cascade

The Real Fix

Immediate Solution

Move Plex transcoding off the thin pool:

# Create transcoder directory on Proxmox host (not on thin pool)
mkdir -p /var/lib/plex-transcoder
chown 100000:100000 /var/lib/plex-transcoder

# Add mount point to Plex container
pct set 101 -mp2 /var/lib/plex-transcoder,mp=/transcode

# Restart Plex
pct stop 101 && pct start 101

Then configure Plex:

  1. Settings → Transcoder
  2. Set "Transcoder temporary directory" to /transcode
  3. Save

Why this works: Transcoding temp files write to Proxmox host's local storage (49GB free) instead of the thin pool with 20MB physical free.

Long-term Solutions

Option 1: Rebuild Thin Pool Correctly

WRONG WAY (what was done):

# Allocates entire physical volume
lvcreate -T -L 912.86g SlimBUP/SlimBUP

RIGHT WAY:

# Leave 10-15% physical space free
lvcreate -T -L 800g SlimBUP/SlimBUP
# This leaves ~131GB physical free for metadata operations

Option 2: Don't Use Thin Pools

Use regular LVM volumes or direct filesystem on partitions:

# Create regular logical volume instead
lvcreate -L 800g -n media SlimBUP
mkfs.ext4 /dev/SlimBUP/media

Pros: No thin pool metadata issues, simpler management Cons: No snapshots, no overprovisioning

Option 3: Use ZFS Instead

Proxmox supports ZFS which handles thin provisioning better:

zpool create -o ashift=12 media /dev/sdd
zfs create -o compression=lz4 media/storage

Recovery Procedure

When corruption happens:

# 1. Stop all containers using the thin pool
pct stop 101
pct stop 103

# 2. Deactivate and reactivate the volume
lvchange -an SlimBUP/vm-103-disk-0
lvchange -ay SlimBUP/vm-103-disk-0

# 3. Run filesystem check
e2fsck -f -y /dev/SlimBUP/vm-103-disk-0

# 4. Mount and trim to reclaim space
mount /dev/SlimBUP/vm-103-disk-0 /mnt
fstrim -v /mnt
umount /mnt

# 5. Check thin pool status
lvs SlimBUP

# 6. Restart containers
pct start 103
pct start 101

Key Lessons

1. Always Leave Physical Free Space

  • Minimum 10-15% of physical volume should remain unallocated
  • Thin pools need room for metadata operations

2. Separate Write-Heavy Operations

  • Transcoding, temp files, downloads → separate storage
  • Media library → can live on thin pool
  • Don't mix temp/volatile with permanent storage

3. Monitor Thin Pool Usage

# Check regularly
lvs -o lv_name,data_percent,metadata_percent SlimBUP

# Set up alerts at:
# - 80% data usage (warning)
# - 85% data usage (critical)
# - 50% metadata usage (warning)

4. The 20MB Physical Free is Normal... Until It's Not

  • Thin pools can run for weeks with 20MB PFree
  • Normal operations (downloads, streaming) write INSIDE the thin pool
  • Problems occur when operations touch the physical volume:
    • Filesystem checks (fsck)
    • LVM metadata updates
    • Heavy write operations at high capacity

5. Corruption Cascades

Once corruption starts:

  1. Filesystem goes read-only
  2. Can't fix filesystem without stopping containers
  3. Running fsck writes to physical volume (triggers more corruption)
  4. Each repair attempt can make it worse
  5. Need to stop EVERYTHING to break the cycle

Best Practices for Proxmox Media Servers

Storage Layout

Physical Drive (931GB)
├── LVM Physical Volume (931GB)
│   ├── Thin Pool (800GB max) ← Leave headroom!
│   │   ├── vm-101-disk-0 (16GB) - Plex OS
│   │   └── vm-103-disk-0 (820GB) - Media storage
│   └── Unallocated (131GB) ← Breathing room
│
Host Local Storage (for temp files)
├── /var/lib/plex-transcoder (transcoding temp)
└── /var/tmp/downloads (incomplete downloads)

Container Configuration

Plex Container (/etc/pve/lxc/101.conf):

mp0: SlimBUP:vm-103-disk-0,mp=/shared_root  # Media access
mp1: /mnt/plex-usb,mp=/mnt/usb-media        # USB archive
mp2: /var/lib/plex-transcoder,mp=/transcode # Temp transcoding

Media Services Container (/etc/pve/lxc/103.conf):

  • Radarr, Sonarr, SABnzbd should run with PUID=1000 PGID=1000
  • Prevents permission issues with media files

Radarr Docker Setup

docker run -d \
  --name radarr \
  --restart=unless-stopped \
  -e PUID=1000 \
  -e PGID=1000 \
  -e TZ=America/New_York \
  -v /mnt/media/config/radarr:/config \
  -v /mnt/media/downloads:/data/downloads \
  -v /mnt/media/movies-slimbup:/data/movies \
  --network=host \
  linuxserver/radarr

Signs You're About to Have a Bad Time

Watch for these warnings:

# Check thin pool usage
lvs SlimBUP

# If you see:
Data% > 80%  # Warning zone
Data% > 85%  # Danger zone
Meta% > 50%  # Metadata issues incoming

# Check physical free space
pvs

# If you see:
PFree < 100GB  # Concerning
PFree < 50GB   # Dangerous
PFree < 1GB    # You're gonna have a bad time

Related Issues to Watch For

Permission Problems

Files created by containers may have wrong ownership:

# Fix ownership (from host)
pct exec 103 -- chown -R 1000:1000 /mnt/media/movies-slimbup/

Bind Mount Visibility

If Container A bind mounts Container B's filesystem, and Container B mounts additional drives AFTER the bind mount is created, Container A won't see them. Solution: Mount drives before starting containers.

Resources

Conclusion

Don't create thin pools that use 100% of physical volume space. The 20MB physical free is a ticking time bomb. Leave 10-15% unallocated, move write-heavy operations off the thin pool, and monitor usage closely.

Your future self will thank you when you're not spending hours fighting filesystem corruption at 2 AM.


Lessons learned the hard way on 2025-12-07. Total downtime: ~4 hours. Data lost: 0 (got lucky).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment