Skip to content

Instantly share code, notes, and snippets.

@andrewleech
Created February 23, 2026 19:42
Show Gist options
  • Select an option

  • Save andrewleech/5686ed5242e0948d8679c432579e002e to your computer and use it in GitHub Desktop.

Select an option

Save andrewleech/5686ed5242e0948d8679c432579e002e to your computer and use it in GitHub Desktop.
MicroPython CI flaky test analysis — 575 master push runs over 14 months

MicroPython CI Flaky Test Report

Analysis of intermittent test failures in the unix port GitHub Actions workflow on the master branch of micropython/micropython.

Methodology

Job-level pass/fail data was collected via the GitHub API for all ports_unix.yml workflow runs triggered by pushes to master. For the subset of runs where GitHub still retains logs (roughly the last 90 days), the specific failing test name was extracted from the run-tests.py --print-failures output.

Only the unix port workflow has test failures on master. All other workflows (ports_qemu.yml, ports_stm32.yml, ports_esp32.yml, ports_rp2.yml, etc.) pass consistently.

Data collected:

  • 575 non-cancelled push-triggered runs from 2024-12-19 to 2026-02-12
  • Job-level pass/fail status for all 20 jobs in each of 103 failed runs
  • Test-level failure details from 20 runs with available log data

Overall Failure Rate

103 of 575 non-cancelled master push runs failed: 17.9% per-run failure rate.

87% of failed runs had exactly one job fail. The failures are distributed across 15 of the 20 jobs in the workflow, with no single job dominating.

Failed jobs per run Occurrences
1 90
2 9
3 2
5 2

Monthly Breakdown

Month Runs Failures Rate
2024-12 10 3 30.0%
2025-01 32 5 15.6%
2025-02 41 8 19.5%
2025-03 37 3 8.1%
2025-04 40 4 10.0%
2025-05 50 7 14.0%
2025-06 48 5 10.4%
2025-07 55 15 27.3%
2025-08 48 15 31.2%
2025-09 46 2 4.3%
2025-10 49 8 16.3%
2025-11 37 8 21.6%
2025-12 28 5 17.9%
2026-01 32 8 25.0%
2026-02 22 7 31.8%

The July/August 2025 spike (27-31%) correlates with the GitHub Actions macOS runner migration to macOS 15 (announced August 4 2025), which produced 11 macOS job failures in August alone. The September 2025 dip (4.3%) has no obvious explanation beyond normal variance.

Per-Job Failure Rates

Measured directly from 575 push runs. Each job executes once per workflow run.

Job Failures Rate Runner
settrace_stackless 25 4.3% ubuntu-latest
macos 25 4.3% macos-26
qemu_mips 11 1.9% ubuntu-latest (QEMU)
qemu_arm 9 1.6% ubuntu-latest (QEMU)
qemu_riscv64 8 1.4% ubuntu-latest (QEMU)
standard_v2 8 1.4% ubuntu-latest
settrace 7 1.2% ubuntu-latest (removed from current workflow)
coverage 7 1.2% ubuntu-latest
sanitize_undefined 6 1.0% ubuntu-latest
float 5 0.9% ubuntu-latest
standard 4 0.7% ubuntu-latest
coverage_32bit 3 0.5% ubuntu-latest
nanbox 2 0.3% ubuntu-latest
float_clang 2 0.3% ubuntu-latest
longlong 2 0.3% ubuntu-latest
minimal 0 0% ubuntu-latest
reproducible 0 0% ubuntu-latest
gil_enabled 0 0% ubuntu-latest
stackless_clang 0 0% ubuntu-latest
repr_b 0 0% ubuntu-latest
sanitize_address 0 0% ubuntu-latest

The product of per-job pass rates gives an aggregate predicted pass rate of 80.4%, close to the observed 82.1%, confirming the individual job failures are approximately independent events.

Confirmed Flaky Tests

The following tests were directly observed failing on master in runs where log data was available (20 runs, covering 2026-01-05 to 2026-02-13). Every failure in every available log was caused by one of these six tests.

thread/thread_gc1.py

Observed failures 9 (in 20 runs with logs)
Jobs affected settrace_stackless (6), coverage (3)
Failure output Expected True, got False
Dates observed 2026-01-13, 2026-01-13, 2026-01-24, 2026-01-27, 2026-01-30, 2026-02-04, 2026-02-05, 2026-02-06, 2026-02-12
Already excluded from macos, qemu_mips, qemu_arm, qemu_riscv64 (in tools/ci.sh)
Not excluded from settrace_stackless, coverage, standard, standard_v2, coverage_32bit, nanbox, longlong, float, float_clang, stackless_clang, gil_enabled, sanitize_address, sanitize_undefined, repr_b

The test spawns threads that perform garbage collection and checks a boolean result. The ci.sh file already contains comments acknowledging this test is flaky and excludes it from 4 of 20 jobs.

thread/stress_aes.py

Observed failures 7 (in 20 runs with logs)
Jobs affected qemu_riscv64 (5), qemu_arm (2)
Failure output Expected done, got TIMEOUT
Dates observed 2026-01-14, 2026-01-20, 2026-01-23, 2026-01-26, 2026-01-30, 2026-01-31, 2026-02-03
Already excluded from none
Notes ci.sh comments note this test "takes around 70/90/180 seconds" on QEMU ARM/MIPS/RISC-V but does not exclude it; timeouts are set to 90/180/200s respectively

The test performs AES encryption across threads. Under QEMU emulation the execution time approaches or exceeds the configured timeout.

cmdline/repl_lock.py

Observed failures 3 (in 20 runs with logs)
Jobs affected qemu_arm (2), qemu_riscv64 (1)
Failure output Missing >>> prompt prefix on micropython.heap_lock() line
Dates observed 2026-01-13, 2026-02-03, 2026-02-13
Already excluded from none

The expected output shows >>> micropython.heap_lock() but the actual output drops the >>> prefix. This is a REPL prompt timing issue under QEMU emulation.

extmod/time_time_ns.py

Observed failures 2 (in 20 runs with logs)
Jobs affected float (1), longlong (1)
Failure output One timing assertion returns False instead of True
Dates observed 2026-01-05, 2026-02-04
Already excluded from none

The test makes assertions about time.time_ns() precision. On shared CI runners the wall clock can have insufficient precision or the process can be descheduled between measurements.

cmdline/repl_cont.py

Observed failures 1 (in 20 runs with logs)
Jobs affected macos (1)
Failure output Differences in quote escaping in REPL continuation prompts (e.g. "'" vs '\'')
Dates observed 2026-01-27
Already excluded from none

The expected REPL output differs from what macOS produces, with differences in how escaped quotes and continuation lines are rendered. The macOS job already excludes several other tests due to platform differences.

thread/stress_schedule.py

Observed failures 1 (in 20 runs with logs)
Jobs affected qemu_riscv64 (1)
Failure output Expected PASS, got CRASH
Dates observed 2026-02-05
Already excluded from none (but skipped on qemu_arm per ci.sh)

The test exercises micropython.schedule() under thread stress. Under QEMU RISC-V emulation it intermittently crashes.

Existing Exclusions in tools/ci.sh

The following tests are already excluded from specific jobs with comments marking them as flaky:

Test Excluded from Exclusion reason (from ci.sh comments)
thread/thread_gc1.py macos, qemu_mips, qemu_arm, qemu_riscv64 "is flaky"
thread/stress_recurse.py qemu_mips, qemu_arm, qemu_riscv64 "is flaky"
thread/stress_heap.py macos "is flaky"
float_parse.py macos "parse/print floats out by a few mantissa bits"
float_parse_doubleprec.py macos "parse/print floats out by a few mantissa bits"
ffi_callback macos "crashes for an unknown reason"

Estimated Failure Attribution

Note: This section combines the directly observed data above with inference to attribute the 94 failed runs whose logs have expired (older than ~90 days). The reasoning is described for each estimate.

For runs without log data, the failing job is known but the specific test is not. The estimates below attribute job failures to likely tests based on:

  1. 100% consistency in the 20 runs where both job and test are known
  2. The test exclusion patterns in ci.sh which restrict what can fail in each job
  3. Each job runs largely the same test suite, differing only in build configuration and platform

Estimated per-test failure rates

Test Attributed failures Executions per run Total opportunities Est. rate per execution
thread/thread_gc1.py 62 8 jobs that don't exclude it 4,600 ~1.3%
thread/stress_aes.py 28 3 QEMU jobs 1,725 ~1.6%
cmdline/repl_*.py 25 1 (macOS) 575 ~4.3%
extmod/time_time_ns.py 7 2 jobs (float, longlong) 1,150 ~0.6%

Reasoning for thread/thread_gc1.py estimate (62 failures): The 25 settrace_stackless failures, 8 standard_v2 failures, 7 coverage failures, 7 settrace failures, 6 sanitize_undefined failures, 4 standard failures, 3 coverage_32bit failures, and 2 nanbox failures are attributed to this test. All of these jobs run test_full or test_full_no_native without excluding thread_gc1.py. In the 10 runs with log data from these jobs, 100% (10/10) failed on thread_gc1.py and nothing else.

Reasoning for thread/stress_aes.py estimate (28 failures): The 11 qemu_mips, 9 qemu_arm, and 8 qemu_riscv64 failures are attributed primarily to this test. These jobs exclude thread_gc1.py and thread_stress_recurse.py, leaving stress_aes.py as the dominant remaining flaky test. In 11 runs with log data from QEMU jobs, 7 were stress_aes.py, 3 were cmdline/repl_lock.py, and 1 was thread/stress_schedule.py. The QEMU MIPS logs are all expired so the exact split for that job is unknown.

Reasoning for cmdline/repl_*.py estimate (25 failures): All 25 macOS job failures are attributed to REPL-related tests. The macOS job already excludes thread_gc1.py, stress_heap.py, float_parse*.py, and ffi_callback. In the 1 run with log data from the macOS job, the failure was cmdline/repl_cont.py. The 11 macOS failures in August 2025 coincide with the GitHub Actions macOS 15 runner migration.

Reasoning for extmod/time_time_ns.py estimate (7 failures): The 5 float failures and 2 longlong failures are attributed to this test. In 2 runs with log data from these jobs, both were time_time_ns.py. The float job runs a reduced test set (basic run-tests.py without test_full) making timing tests the most likely flaky candidate; the 2 float_clang failures may also be this test but could be a different root cause.

Unattributed failures

The stackless_clang job has 1 failure across 575 runs, with no log data available. The root cause is unknown.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment