Incident Report: Server M2DT — Duplicate Telegraf + EMQX Unhealthy

Date: 15 มี.ค. 2026, 15:40-18:00 ICT Reported by: User report Resolved by: AI Operations Team + Guardian (DustBoy PhD Oracle) Severity: HIGH — server ล่ม ต้องไม่เกิดขึ้น

Summary

Server M2DT (mqtt.laris.co) มีปัญหา 2 ชั้น:

RAM 94% — Telegraf 2 ชุดซ้อนกัน (เก่า Oct 2025 + ใหม่ Mar 2026)
EMQX unhealthy 12 วัน — conn_congestion 6,044 ครั้ง → pipeline stall → data gap 9 ชม.

Timeline

Time	Event
3 มี.ค.	EMQX Erlang node unhealthy — conn_congestion เริ่มเกิด
3-14 มี.ค.	Data ยังผ่านได้บ้าง แต่มี congestion warning ซ้ำๆ
12-13 มี.ค.	แก้ OOM ครั้งแรก — ลบ dead InfluxDB config, restart pipeline, เพิ่ม swap แต่ไม่ได้แก้ EMQX
15 มี.ค. ~08:25	Pipeline stall — Telegraf outputs.mqtt blocked → data หยุดไหลไป InfluxDB
15:40	ได้รับแจ้งว่า server มีปัญหา
15:41	SSH ตรวจพบ RAM 94%, swap 0B, 39 telegraf containers
15:47	ระบุสาเหตุ: telegraf เก่า 7 ตัวจากปี 2025 กิน 570-600MB/ตัว (~4GB รวม)
15:49	sudo kill old processes → RAM 94% → 53%
15:50	docker stop old stack 12 containers → RAM 53% → 42%
15:52	เพิ่ม swap 2GB (permanent)
16:xx	Guardian เคลียร์ disk freed 10.2G → Disk 89% → 72%
17:20	Guardian พบ EMQX unhealthy 12 วัน — conn_congestion 6,044 ครั้ง
17:32	Restart EMQX + pipeline → 20 telegraf reconnected, data flowing
17:35	InfluxDB last write confirmed — data gap closed

Root Cause

1. RAM: Telegraf ซ้อน 2 ชุด

เมื่อ 13 มี.ค. 2026 ตอนแก้ OOM ครั้งแรก:

ทำ docker compose down && up ใน directory ใหม่
แต่ ไม่ได้ down stack เก่า (จาก Oct 2025)
ทั้ง 2 stack ใช้ Docker Compose prefix ต่างกัน → Docker ไม่รู้ว่าเป็นตัวเดียวกัน

ชุด	ตัว	RAM/ตัว	รวม	อายุ
เก่า	7 process + 12 container	570-600MB	~4GB	Oct 2025 (5 เดือน)
ใหม่	20	67-77MB	~1.5GB	Mar 13 (2 วัน)

Telegraf มี metric buffer — รันนานยิ่ง buffer สะสม ยิ่งชุดเก่ามี dead output ค้างอยู่ buffer ยิ่งโต

2. EMQX Unhealthy 12 วัน

EMQX (MQTT broker หลักที่รับ data จาก 920+ sensors) มี Erlang node ไม่ตอบ ping มา 12 วัน แต่ TCP listener ยังทำงาน:

Sensors (920+) → EMQX (:1883 host) → bridge → Mosquitto (Docker)
                  ↑ UNHEALTHY                        │
                  │ Erlang ไม่ตอบ ping               ▼
                  │ conn_congestion 6,044x       21x Telegraf
                  │                                   │
                  ↓                         InfluxDB (CCDC)
           Telegraf outputs.mqtt
           BLOCKED HERE!

3. Disk: Log ไม่มี rotation

mosquitto-ws-bridge log file 4.4G ไม่มี rotation → Disk 89%

Impact

Severity: HIGH — server ล่ม เรื่องนี้ต้องไม่เกิดขึ้น
Data gap: ~9 ชม. (08:25-17:35 ICT วันที่ 15 มี.ค.)
EMQX degraded: 12 วัน (3-15 มี.ค.)
Affected: 920+ sensors, 80+ DustBoy (30 Model-N + 53 Model-T)

Resolution

Action	Result
kill telegraf เก่า 7 ตัว	RAM 94% → 53% (คืน 3.2GB)
docker stop container เก่า 12 ตัว	RAM 53% → 42%
เพิ่ม swap 2GB (permanent)	safety net กัน OOM
Guardian เคลียร์ disk	Disk 89% → 72% (freed 10.2G)
restart EMQX	Erlang node กลับมา, congestion cleared
restart pipeline	20 telegraf reconnected stable

สถานะสุดท้าย

Metric	Before	After
RAM	94% (7.3G)	42% (3.3G)
RAM free	166MB	3.6GB
Swap	0B	2.0GB
Disk	89%	72%
Containers	53	35
Telegraf	39 (2 stacks)	20 (1 stack)
EMQX	unhealthy 12d	healthy
InfluxDB	9hr gap	data flowing

Prevention

ทำแล้ว

Kill duplicate telegraf stack
เพิ่ม swap 2GB (permanent)
Restart EMQX + pipeline
Disk cleanup freed 10.2G
แต่งตั้ง Guardian (DustBoy PhD Oracle) เป็น Pipeline Guardian
สร้าง mqtt-health skill + health check script
เพิ่ม data freshness check (InfluxDB last write delta)

ยังไม่ได้ทำ

Docker log rotation (ต้อง restart Docker)
mosquitto-ws-bridge log rotation
Disk monitoring alert (80% warn, 90% critical)
InfluxDB write health check (alert if no writes 30min)
Hourly cron health check
EMQX → Mosquitto migration (ลด complexity)
Topic restriction / ACL (donaus/floodboy ไม่ควรเห็น DustBoy data)

Lessons Learned

Docker Compose prefix matters — docker compose down ใน dir หนึ่งไม่ kill stack จาก dir อื่น ถ้า project name ต่างกัน
Long-running Telegraf accumulates RAM — buffer สะสมตาม dead outputs ยิ่งรันนานยิ่งโต
Swap เป็น safety net จำเป็น — server ที่รัน 20+ containers ต้องมี swap เสมอ
ต้องมีเจ้าของ — pipeline ไม่มีเจ้าของล่มแล้วไม่มีใครรู้
เช็ค data freshness ไม่ใช่แค่ service status — ทุก container running ไม่ได้แปลว่า data ไหล
EMQX ดูปกติทั้งที่พัง — TCP listener ทำงาน แต่ Erlang node ตาย → ต้องเช็คลึกกว่า docker ps

— AI Operations Team + Guardian (DustBoy PhD Oracle AI) | 15 มีนาคม 2026

nazt/mqtt-incident-report-redacted.md

Select an option

No results found