Skip to content

Instantly share code, notes, and snippets.

@bjacob
Created January 26, 2026 14:45
Show Gist options
  • Select an option

  • Save bjacob/b967a514744051f9c5a326d15c46bdfc to your computer and use it in GitHub Desktop.

Select an option

Save bjacob/b967a514744051f9c5a326d15c46bdfc to your computer and use it in GitHub Desktop.

Hello,

1. The preexisting potential undefined behavior in FP conversions.

There is an edge case in the specification potentially allowing for value-dependent undefined behavior (henceforth UB) in floating-point conversions:

7.3.10 [conv.double] #2 says:

If the source value can be exactly represented in the destination type, the result of the conversion is that exact representation. If the source value is between two adjacent destination values, the result of the conversion is an implementation-defined choice of either of those values. Otherwise, the behavior is undefined.

Is the final "Otherwise, the behavior is undefined" clause ever reached? If I understand correctly, it can be reached when the destination types lacks NaN of Inf values:

  • If the destination type lacks a NaN value, then converting a NaN source value is undefined behavior.
  • If the destination type lacks a +Inf value, then converting any source value greater than the largest destination type value, is undefined behavior. This may happen already for some large finite source values.
  • Likewise if the destination type lacks a -Info value, for large-negative source values.

2. Why this potential issue is becoming a concrete issue now

These possibilities remained theoretical as long as all floating-point types had full NaN/+Inf/-Inf value (like IEEE-754 types do, as well as some newer types like bfloat16).

The new development that, in my understanding, makes this issue more concrete, is the advent of low-bit-depth new floating point types that lack NaN and/or Inf values. I am thinking in particular about the types described in the OCP Microscaling Formats (MX) Specification, of which:

  • The 8-bit E5M2 type still has full NaN/Inf so is not a concern here.
  • The 8-bit E4M3 type still has Nan, but lacks +/- Inf values, implying the converstion of large (even finite) source values to be UB under the current C++ spec.
  • The 6-bit and 4-bit types lack both Nan and Inf values, implying UB under all of the cases considered above.

It is worth pointing out that these types are implemented in GPUs from multiple vendors, by now in multiple hardware generations, and are the most common data types for LHS/RHS matrices in matrix multiplications in major AI workloads. Moreover, conversion from a wider floating point type into these narrow types is part of the main usage pattern: each matrix multiplication accumulates into a wider floating point type, typically IEEE binary32 (aka float32), requiring a conversion down to the narrow type before the next matrix multiplication.

3. Proposed resolutions

I can think of two possible resolutions:

3.a. Minimalist resolution:

Simply replace undefined behavior by implementation-defined behavior. Rewrite "Otherwise, the behavior is undefined" into "Otherwise, the result value is implementation-defined".

3.b. Maximalist resolution:

In addition to the above minimalist resolution as the final stop-gap (still needed for NaN conversions), insert a new clause before to dictate what the result value should be for large source values exceeding all destination values. The full paragraph would become something like this (with the changes emphasized in bold).

If the source value can be exactly represented in the destination type, the result of the conversion is that exact representation. Otherwise, if the source value is between two adjacent destination values, the result of the conversion is an implementation-defined choice of either of those values. Otherwise, if the source value is greater than all destination values, the result is the maximum destination value. Otherwise, if the source value is less than all destination values, the result is the minimum destination value (not the minimum positive value). Otherwise, the result value is implementation-defined.

Cheers, Benoit

PS: My employer, AMD, is one of the GPU vendors implementing those types. This email is only my personal view and does not represent the views of my employer.

@kuhar
Copy link

kuhar commented Jan 26, 2026

(with the changes emphasized in bold).

I think this got lost during copy + paste

Otherwise, if the source value is greater than all destination values

is greater than all possible destination values? (same for minimum)

Another issue with this wording is that +Inf is technically not greater than NaN (if supported by destination). Maybe rewrite this to a negative: 'Otherwise, if no possible destination value is less than or equal to the source value'?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment