MicroPython CI Flaky Test Report

Analysis of intermittent test failures in the unix port GitHub Actions workflow on the master branch of micropython/micropython.

Methodology

Job-level pass/fail data was collected via the GitHub API for all ports_unix.yml workflow runs triggered by pushes to master. For the subset of runs where GitHub still retains logs (roughly the last 90 days), the specific failing test name was extracted from the run-tests.py --print-failures output.

Only the unix port workflow has test failures on master. All other workflows (ports_qemu.yml, ports_stm32.yml, ports_esp32.yml, ports_rp2.yml, etc.) pass consistently.

Data collected:

575 non-cancelled push-triggered runs from 2024-12-19 to 2026-02-12
Job-level pass/fail status for all 20 jobs in each of 103 failed runs
Test-level failure details from 20 runs with available log data

Overall Failure Rate

103 of 575 non-cancelled master push runs failed: 17.9% per-run failure rate.

87% of failed runs had exactly one job fail. The failures are distributed across 15 of the 20 jobs in the workflow, with no single job dominating.

Failed jobs per run	Occurrences
1	90
2	9
3	2
5	2

Monthly Breakdown

Month	Runs	Failures	Rate
2024-12	10	3	30.0%
2025-01	32	5	15.6%
2025-02	41	8	19.5%
2025-03	37	3	8.1%
2025-04	40	4	10.0%
2025-05	50	7	14.0%
2025-06	48	5	10.4%
2025-07	55	15	27.3%
2025-08	48	15	31.2%
2025-09	46	2	4.3%
2025-10	49	8	16.3%
2025-11	37	8	21.6%
2025-12	28	5	17.9%
2026-01	32	8	25.0%
2026-02	22	7	31.8%

The July/August 2025 spike (27-31%) correlates with the GitHub Actions macOS runner migration to macOS 15 (announced August 4 2025), which produced 11 macOS job failures in August alone. The September 2025 dip (4.3%) has no obvious explanation beyond normal variance.

Per-Job Failure Rates

Measured directly from 575 push runs. Each job executes once per workflow run.

Job	Failures	Rate	Runner
settrace_stackless	25	4.3%	ubuntu-latest
macos	25	4.3%	macos-26
qemu_mips	11	1.9%	ubuntu-latest (QEMU)
qemu_arm	9	1.6%	ubuntu-latest (QEMU)
qemu_riscv64	8	1.4%	ubuntu-latest (QEMU)
standard_v2	8	1.4%	ubuntu-latest
settrace	7	1.2%	ubuntu-latest (removed from current workflow)
coverage	7	1.2%	ubuntu-latest
sanitize_undefined	6	1.0%	ubuntu-latest
float	5	0.9%	ubuntu-latest
standard	4	0.7%	ubuntu-latest
coverage_32bit	3	0.5%	ubuntu-latest
nanbox	2	0.3%	ubuntu-latest
float_clang	2	0.3%	ubuntu-latest
longlong	2	0.3%	ubuntu-latest
minimal	0	0%	ubuntu-latest
reproducible	0	0%	ubuntu-latest
gil_enabled	0	0%	ubuntu-latest
stackless_clang	0	0%	ubuntu-latest
repr_b	0	0%	ubuntu-latest
sanitize_address	0	0%	ubuntu-latest

The product of per-job pass rates gives an aggregate predicted pass rate of 80.4%, close to the observed 82.1%, confirming the individual job failures are approximately independent events.

Confirmed Flaky Tests

The following tests were directly observed failing on master in runs where log data was available (20 runs, covering 2026-01-05 to 2026-02-13). Every failure in every available log was caused by one of these six tests.

`thread/thread_gc1.py`


Observed failures	9 (in 20 runs with logs)
Jobs affected	settrace_stackless (6), coverage (3)
Failure output	Expected `True`, got `False`
Dates observed	2026-01-13, 2026-01-13, 2026-01-24, 2026-01-27, 2026-01-30, 2026-02-04, 2026-02-05, 2026-02-06, 2026-02-12
Already excluded from	macos, qemu_mips, qemu_arm, qemu_riscv64 (in `tools/ci.sh`)
Not excluded from	settrace_stackless, coverage, standard, standard_v2, coverage_32bit, nanbox, longlong, float, float_clang, stackless_clang, gil_enabled, sanitize_address, sanitize_undefined, repr_b

The test spawns threads that perform garbage collection and checks a boolean result. The ci.sh file already contains comments acknowledging this test is flaky and excludes it from 4 of 20 jobs.

`thread/stress_aes.py`


Observed failures	7 (in 20 runs with logs)
Jobs affected	qemu_riscv64 (5), qemu_arm (2)
Failure output	Expected `done`, got `TIMEOUT`
Dates observed	2026-01-14, 2026-01-20, 2026-01-23, 2026-01-26, 2026-01-30, 2026-01-31, 2026-02-03
Already excluded from	none
Notes	`ci.sh` comments note this test "takes around 70/90/180 seconds" on QEMU ARM/MIPS/RISC-V but does not exclude it; timeouts are set to 90/180/200s respectively

The test performs AES encryption across threads. Under QEMU emulation the execution time approaches or exceeds the configured timeout.

`cmdline/repl_lock.py`


Observed failures	3 (in 20 runs with logs)
Jobs affected	qemu_arm (2), qemu_riscv64 (1)
Failure output	Missing `>>>` prompt prefix on `micropython.heap_lock()` line
Dates observed	2026-01-13, 2026-02-03, 2026-02-13
Already excluded from	none

The expected output shows >>> micropython.heap_lock() but the actual output drops the >>> prefix. This is a REPL prompt timing issue under QEMU emulation.

`extmod/time_time_ns.py`


Observed failures	2 (in 20 runs with logs)
Jobs affected	float (1), longlong (1)
Failure output	One timing assertion returns `False` instead of `True`
Dates observed	2026-01-05, 2026-02-04
Already excluded from	none

The test makes assertions about time.time_ns() precision. On shared CI runners the wall clock can have insufficient precision or the process can be descheduled between measurements.

`cmdline/repl_cont.py`


Observed failures	1 (in 20 runs with logs)
Jobs affected	macos (1)
Failure output	Differences in quote escaping in REPL continuation prompts (e.g. `"'"` vs `'\''`)
Dates observed	2026-01-27
Already excluded from	none

The expected REPL output differs from what macOS produces, with differences in how escaped quotes and continuation lines are rendered. The macOS job already excludes several other tests due to platform differences.

`thread/stress_schedule.py`


Observed failures	1 (in 20 runs with logs)
Jobs affected	qemu_riscv64 (1)
Failure output	Expected `PASS`, got `CRASH`
Dates observed	2026-02-05
Already excluded from	none (but skipped on qemu_arm per ci.sh)

The test exercises micropython.schedule() under thread stress. Under QEMU RISC-V emulation it intermittently crashes.

Existing Exclusions in `tools/ci.sh`

The following tests are already excluded from specific jobs with comments marking them as flaky:

Test	Excluded from	Exclusion reason (from ci.sh comments)
`thread/thread_gc1.py`	macos, qemu_mips, qemu_arm, qemu_riscv64	"is flaky"
`thread/stress_recurse.py`	qemu_mips, qemu_arm, qemu_riscv64	"is flaky"
`thread/stress_heap.py`	macos	"is flaky"
`float_parse.py`	macos	"parse/print floats out by a few mantissa bits"
`float_parse_doubleprec.py`	macos	"parse/print floats out by a few mantissa bits"
`ffi_callback`	macos	"crashes for an unknown reason"

Estimated Failure Attribution

Note: This section combines the directly observed data above with inference to attribute the 94 failed runs whose logs have expired (older than ~90 days). The reasoning is described for each estimate.

For runs without log data, the failing job is known but the specific test is not. The estimates below attribute job failures to likely tests based on:

100% consistency in the 20 runs where both job and test are known
The test exclusion patterns in ci.sh which restrict what can fail in each job
Each job runs largely the same test suite, differing only in build configuration and platform

Estimated per-test failure rates

Test	Attributed failures	Executions per run	Total opportunities	Est. rate per execution
`thread/thread_gc1.py`	62	8 jobs that don't exclude it	4,600	~1.3%
`thread/stress_aes.py`	28	3 QEMU jobs	1,725	~1.6%
`cmdline/repl_*.py`	25	1 (macOS)	575	~4.3%
`extmod/time_time_ns.py`	7	2 jobs (float, longlong)	1,150	~0.6%

Reasoning for thread/thread_gc1.py estimate (62 failures): The 25 settrace_stackless failures, 8 standard_v2 failures, 7 coverage failures, 7 settrace failures, 6 sanitize_undefined failures, 4 standard failures, 3 coverage_32bit failures, and 2 nanbox failures are attributed to this test. All of these jobs run test_full or test_full_no_native without excluding thread_gc1.py. In the 10 runs with log data from these jobs, 100% (10/10) failed on thread_gc1.py and nothing else.

Reasoning for thread/stress_aes.py estimate (28 failures): The 11 qemu_mips, 9 qemu_arm, and 8 qemu_riscv64 failures are attributed primarily to this test. These jobs exclude thread_gc1.py and thread_stress_recurse.py, leaving stress_aes.py as the dominant remaining flaky test. In 11 runs with log data from QEMU jobs, 7 were stress_aes.py, 3 were cmdline/repl_lock.py, and 1 was thread/stress_schedule.py. The QEMU MIPS logs are all expired so the exact split for that job is unknown.

Reasoning for cmdline/repl_*.py estimate (25 failures): All 25 macOS job failures are attributed to REPL-related tests. The macOS job already excludes thread_gc1.py, stress_heap.py, float_parse*.py, and ffi_callback. In the 1 run with log data from the macOS job, the failure was cmdline/repl_cont.py. The 11 macOS failures in August 2025 coincide with the GitHub Actions macOS 15 runner migration.

Reasoning for extmod/time_time_ns.py estimate (7 failures): The 5 float failures and 2 longlong failures are attributed to this test. In 2 runs with log data from these jobs, both were time_time_ns.py. The float job runs a reduced test set (basic run-tests.py without test_full) making timing tests the most likely flaky candidate; the 2 float_clang failures may also be this test but could be a different root cause.

Unattributed failures

The stackless_clang job has 1 failure across 575 runs, with no log data available. The root cause is unknown.

andrewleech/flaky-tests-report.md

Select an option

No results found

Select an option

No results found

MicroPython CI Flaky Test Report

Methodology

Overall Failure Rate

Monthly Breakdown

Per-Job Failure Rates

Confirmed Flaky Tests

`thread/thread_gc1.py`

`thread/stress_aes.py`

`cmdline/repl_lock.py`

`extmod/time_time_ns.py`

`cmdline/repl_cont.py`

`thread/stress_schedule.py`

Existing Exclusions in `tools/ci.sh`

Estimated Failure Attribution

Estimated per-test failure rates

Unattributed failures

andrewleech/flaky-tests-report.md

MicroPython CI Flaky Test Report

Methodology

Overall Failure Rate

Monthly Breakdown

Per-Job Failure Rates

Confirmed Flaky Tests

thread/thread_gc1.py

thread/stress_aes.py

cmdline/repl_lock.py

extmod/time_time_ns.py

cmdline/repl_cont.py

thread/stress_schedule.py

Existing Exclusions in tools/ci.sh

Estimated Failure Attribution

Estimated per-test failure rates

Unattributed failures

`thread/thread_gc1.py`

`thread/stress_aes.py`

`cmdline/repl_lock.py`

`extmod/time_time_ns.py`

`cmdline/repl_cont.py`

`thread/stress_schedule.py`

Existing Exclusions in `tools/ci.sh`