Skip to content

Instantly share code, notes, and snippets.

@camel-cdr
Last active November 30, 2025 11:49
Show Gist options
  • Select an option

  • Save camel-cdr/99a41367d6529f390d25e36ca3e4b626 to your computer and use it in GitHub Desktop.

Select an option

Save camel-cdr/99a41367d6529f390d25e36ca3e4b626 to your computer and use it in GitHub Desktop.
RISC-V Vector Extension for Integer Workloads: An Informal Gap Analysis

RISC-V Vector Extension for Integer Workloads: An Informal Gap Analysis

Note: To verify my RVI membership and idenity on this otherwise semi anonymous account: I'm Olaf Bernstein, you should be able to view my sig-vector profile, if you are a member of the vector SIG.

The goal of this document is to explore gaps in the current RISC-V Vector extensions (standard V, Zvbb, Zvbc, Zvkg, Zvkn, Zvks), and suggest instructions to fill these gaps. My focus lies on application class processors, with the expectation that suggested instructions would be suitable to become mandatory or optional instructions in future profiles.

I'll assume you are already familiar with RVV, if not, here is a great introduction and here the latest RISC-V ISA manual.

To determine which applications aren't currently suitably covered by the RISC-V Vector extensions (RVV) I collected usage examples of the SIMD instructions present in AVX512 and SVE, that don't have an RVV equivalent. You can find my notes at the bottom of this article. I'll use it as a source of truth when arguing for or against the inclusion of certain instructions.

Please be aware that I'm not a hardware designer, so please take anything I say with a grain of salt and feel free to correct my mistakes/wrong assumptions.

Proposal TLDR

  • vmslide1up.m, vmslide1down.m: slide mask up/down by one
  • vtrn1.vv, vtrn2.vv: 2x2 transpose instructions
  • vwzip.vv: zip/interleave two vectors
  • unzip.vv or vnsrl defined for e64
  • vbcompress, vbexpand: element-wise bit-compress/expand
  • vbmatxor: 8x8 bit-matrix multiply, x86 calls it vgf2p8affineqb
  • vscansum.v vd, vs1 / vscanmaxu.v vd, vs1: +-scan / max-scan
  • vscansum.v vd, vs1, vm / vscanmaxu.v vd, vs1, vm: segmented +-scan / unsigned-max-scan:
  • viotar.m: segmented viota ("r" for resetting)
  • vmsxff.m: xor-scan on mask bits ("set-xor-from-first mask bit")

Should be added to RVV

I believe the following instructions should be added to RVV, as they can solve problems that currently don't have a good solution.

Slides for mask registers

Sliding mask registers is a very common operation that is currently not well addressed by RVV, because dedicated instructions are missing.

In many cases, it's possible to use element slides instead of sliding just the mask register. This is however a fundamentally more complex operation and performs a lot worse in practice, especially when the relevant vectors use LMUL>1. It also often requires you to duplicate comparisons or other operations done before the element slide.

Mask slides can be emulated directly, but this is expensive to do portably:

# slide up mask in v0 by 1-bit
vsetvli t1, zero, e8, MF8, ta, ma # MF8 needs to be MX/8
vslide1up.vx v1, v0, a0 # slide in next
vsrl.vi v1, v1, 7  # v2 = v1 >> 7
vadd.vv v2, v0, v0 # v2 = v0 << 1
vor.vv  v0, v1, v2 # v0 = v1<<7 | v0<<1
vsetvli zero, t0, e8, MX, ta, ma

When measured, these 6 instructions still perform better than sliding the elements and repeating the comparison: SpacemiT X60, XuanTie C908, Saturn

If you write vector length specific (VLS) code and know you are working with 64 elements or fewer, then you only need a single vsll.vi including a vtype change.

The current situation is a mess with no good solutions, especially for vector length agnostic (VLA) code.

Most use-cases of mask slides could be addressed using just two one source, one destination instructions: vmslide1up.m, vmslide1down.m Where vmslide1up.m slide the mask register bits up by one and vmslide1down.m down by one.

I don't think a GPR-based slide variant is particularly needed and slides further than one or two are also quite rare. A dedicated slide by 2-bit or 3-bit immediate instruction would however have the added benefit of allowing you to slide an entire LMUL=1 vector register by an immediate, this is sometimes useful for slides smaller than 8 bits. If this doesn't add much additional complexity a full register slide bits by small immediate instruction would also be an option.

Element transpose (vtrn1/vtrn2) and zip (vwzip) instructions

RVV already has a very powerful permutation instruction with vrgather and vcompress, however some permutations are so common that dedicated instructions are required, as they can have lower latency and higher throughput.

The need for dedicated instructions for these very common permutations has already been discussed at length by RISE in the context of video codecs, see: "RISCV64 new vector instructions requirements for video/multimedia" and "Vector Operations Proposals - Draft Document".

Here I'd just like to highlight a few additional aspects:

  • The implementation complexity of a vtrn1/vtrn2 is similar to bitwise operations, as such huge performance gains are possible
  • vwzip can already be implemented with vwmaccu.vx(vwaddu.vv(a, b), -1, b), as suggested by Craig Topper. These are two EMUL=2 instructions vs two EMUL=1 instructions if we had dedicated instruction, which would give us performance parity with existing ISAs.
  • I prefer a vwzip over the vzip2a/vzip2b proposed by RISE, because it works better with vl<VLMAX and fractional LMUL. Using vzip2a/vzip2b would require an additional vslideup when vl<VLMAX or for fractional LMUL, unless you want the result to be split up between two vector registers.
  • unzip1/unzip2 is vnsrl.wi/wx, which already works for everything but 64-bit elements, which currently require a vcompress or vrgather to unzip. Instead of a dedicated unzip instruction, it may be preferable to define vnsrl at SEW=64 (EEW=128), which would result in a more versatile instruction.

Missing bit permutations

With the Zvbb ("Vector Basic Bit-manipulation") extension, RVV gained a set of basic bit-manipulation instructions, as the name suggests. This notable includes byte reverse, bit-reverse, count leading/trailing zeros, popcount and bit-rotates.

RVV is however still missing some advanced bit permutation instructions.

Bit compress and expand (vbcompress / vbexpand)

Some of the most powerful bit manipulation instructions present in today's ISAs are the bit-compress and bit-expand instructions.

  • bit-compress takes two n-bit arguments and moves the bit in the first argument corresponding to the nth set bits in the second argument to the nth position of the destination. (bcompress(0bHGFEDCBA,0b11110010)=0bHGFEB
  • bit-expand takes two n-bit arguments, moves the nth bit in the first argument to the nth set bit in the second one. (bdecompress(0b000HGFEB,0b11110010)=0bHGFE00B0)

Implementing bit-compress/expand with existing RVV instructions is really expensive. You either have to use a large series of shifts and bit operations to shift the bits into place, or convert each bit into a byte and do the vcompress.vm/vrgather on 8-bit elements with an 8 times larger LMUL.

x86 implements these instructions as pext/pdep under BMI2, but they only operate on GPRs, while ARM SVE only defines the bext/bdep instructions which operate on elements of vector registers.

Among others, they can be used for: bit interleaving/packing; base64 de/encoding; LEB128 de/encoding, which is used DWARF, Dalvik, protobuf and WebAssembly; regex matching; genomics; various board games; Space-filling curves; Quantum circuit simulation; ...

Here are is an example to illustrate how these instructions could be used to de- and encode base64:

base64 decode:                                    | base64 encode:
vbrev8     [00dddddd|00cccccc|00bbbbbb|00aaaaaa]  | vrgather [........|ccdddddd|bbbbcccc|aaaaaabb]
vbcompress [00aaaaaa|00bbbbbb|00cccccc|00dddddd]  | vbexpand [........|aaaaaabb|bbbbcccc|ccdddddd]
vrgather   [........|aaaaaabb|bbbbcccc|ccdddddd]  | vbrev8   [00aaaaaa|00bbbbbb|00cccccc|00dddddd]
           [........|ccdddddd|bbbbcccc|aaaaaabb]  |          [00dddddd|00cccccc|00bbbbbb|00aaaaaa]

Bit compress and expand were the two instructions with the most real-world usage of the advanced SIMD instructions surveyed. There is just one problem, historically some implementations really struggled to implement these instructions well:

all intel, haswell to alderlake Zen1/Zen2 Zen3/Zen4 Zen5 Cortex A510/A520 Neoverse N2/V2/V3 Cortex A710/X2/X3/X4 Cortex A715/A720
Latency 3 8 μops per set bit 3 3 14-70 6 4
RThroughput 1 8 μops per set bit 1 0.33 14-70 2 2

The first two generations of Zen used a microcoded implementation with 8 μops per set bits, the Cortex-A510/A520 implementations are similarly bad. On the other hand, all intel processors since haswell managed to implement it with a three-cycle latency and throughput of 1 instruction per cycle. The other ARM cores also have a solid implementation, with a 4-6 cycle latency and a throughput of one instruction every two cycles.

Fortunately, we have multiple papers that give more details on the implementation challenges:

This seems to suggest that the implementation complexity is reasonable, although the decode of the mask to control signals is disproportionately expensive.

These instructions should be added to RVV, since they are very widely applicable, can help accelerate working with very common data formats and the costs seem reasonable.

Several ways to reduce the implementation cost can be explored if feedback from implementors suggests that the implementation cost is too high:

  • Since the mask operand is often a fixed constant, we could decide to only add vbcompress.vx/vbexpand.vx variants, such that only a single mask needs to be decoded.
  • There could be separate extensions for SEW≤32 and SEW≤64 support, as a lot of the interleaving use cases can operate on fewer than 64 bits.

Aside: No need for Sheep-and-goats

The Sheep-and-goats (SAG) operation, or bgrp as it's called in SVE, groups bits to right or left as selected by bitmask ((bcompress(x, ~m) << popc(m)) | bcompress(x, m)). It can perform an arbitrary bit permutation in ceil(log2(nbits)) chained SAG operations.

While this is quite powerful, it's very rarely used and can already be done by combining two vbcompress instructions with a single vslideup instruction.

I would only consider this worth adding if a hardware implementation of vbcompress and vbexpand gets this operation basically for free. This doesn't seem to be the case, here adding SAG increases the area by 50% over just implementing vbcompress/vbexpand`.

Going with this 50% area increase, I'd rather hardware designers use these resources to make vbcompress/vbexpand` faster than adding a dedicated SAG instruction.

Bit-matrix operations (vbmatxor / vgf2p8affineqb / MXOR)

This instruction is present in Knuth's MMIX as MXOR and modern x86 processors with support for GFNI as VGF2P8AFFINEQB. It was also considered for the RISC-V scalar bit manipulation extensions but didn't end up in the final specification.

The instruction interprets a 64-bit number as a 8x8 bit matrix, and multiplies two of such 8x8 bit-matrices together, replacing addition with XOR and multiplication with AND:

Assume that the 64 bits in $Y are numbered as follows (and similar for the bits in register $Z and $X): Y₀₀,Y₀₁...Y₀₇,Y₁₀,Y₁₁...Y₁₇,...,Y₇₀,Y₇₁...Y₇₇

Now bit Xᵢⱼ in register register $X is computed as follows: MXOR: Xᵢⱼ = Y₀ⱼ & Zᵢ₀ ^ Y₁ⱼ & Zᵢ₁ ^ ... ^ Y₇ⱼ & Zᵢ₇

This might not immediately sound useful, but it has a lot of unexpected uses.

Notably, it can permute bits within each byte, or, speaking more generally, replace each bit with an arbitrary XOR of any bit from the source byte (Why Ice Lake is Important (a bit-basher’s perspective)")

Although creating the constants needed for your permutation is non-trivial, the x86 vgf2p8affineqb instruction still sees a lot of real-world use.

Prominent examples include: Arbitrary Modular GF(2ʷ) Multiplication, RAID error correction, PAR2, SM4, emulating missing 8-bit AVX instructions, sub-byte arithmetic, very fast SIMD byte histogram, 8x8 and 16x16 bit matrix transpose, indices to set bits

It can be used to emulate a bunch of 8-bit instructions that already exist in RVV, which may seem redundant. However, if this instruction was added to RVV it should operate on EEW=64 EMUL=LMUL, which would allow for doing the 8-bit operations without a vtype change, removing up to two vsetvlis from certain code paths.

The hardware implementation cost seems manageable.

Claire Wolf shared the following on the isa-dev mailing list:

The HW cost for bmat[x]or is not as big as one might expect. My reference single-cycle implementation of bmat[x]or mapped to a generic 4-LUT FPGA architecture is 706 LUTs in size. That’s less than the size of five 64-bit adders.

"Advanced Bit Manipulation Instructions: Architecture, Implementation and Applications" also covers this instruction and their single cycle implementation takes 2.4K NAND gates.

Existing vgf2p8affineqb implementations corroborate this:

vgf2p8affineqb ymm, ymm, ymm, i8 icelake tigerlake rocketlake alderlake-E alderlake-P Zen4
Latency 5 5 5 5 3
RThroughput 0.5 0.5 0.5 0.5 0.5 0.5

Due to the versatility and reasonable hardware cost, I suggest adding the vbmatxor.vv and vbmatxor.vx instructions. The .vx variant is included, because most uses of vgf2p8affineqb broadcast a 64-bit constant to perform the same 8-bit permutation on all bytes.

Scan operations with segmented variants

Scan operations, also called prefix operations, are reductions that write back the intermediate results, effectively computing all partial reductions.

RVV already has a few "hidden" scan instructions that operate on mask registers:

  • vmsbf(~m) is an and-scan on mask-bits
  • vmsif(m) is an or-scan on mask-bits
  • viota(m) is an +-scan on mask-bits that is widened to SEW

The case for segmented scans

Scan operations on vector elements and segmented variants are very important missing instructions because they allow you to efficiently parallelize an entirely new paradigm of problems. Segmented scans, scan multiple segments of elements separately, as encoded by a mask. A set bit in the mask indicates that a new segment starts at this position. Implementing this using existing RVV instructions is very expensive.

Segmented scans allow you to efficiently process multiple variable-length sequences of input in parallel. This is best illustrated by an example.

Consider vectorizing run-length encoding, that is, turning an input of bytes into pairs of bytes encoding a repetition count and a byte value to repeat. With the classical SIMD/vector model, you can only process one "run" at a time, even though a vector register might contain multiple runs. While this isn't a big problem for smaller vector lengths, it can be a huge source of inefficiency when dealing with longer vectors. Using a segmented max-scan allows you to process all the runs in a vector register independently:

RLE using +-scan:
0: 8 8 7 7 7 8 8 7 8 8 8 8 8 7 8 8 | v0 as input
1: 8 7 7 7 8 8 7 8 8 8 8 8 7 8 8 9 | v1 = slide1up(v0, v0[0]+1)
2: 0 1 0 0 1 0 1 1 0 0 0 0 1 1 0 1 | m0 = v0 != v1     __
3: 0 e 0 0 b 0 9 8 0 0 0 0 3 2 0 0 | v2 = m0 ? vid : 0   \
4: e e b b b 9 9 8 3 3 3 3 3 2 0 0 | v3 = +-scan(v2, 0)  | does seg-viota(m0)
5: 1 0 2 1 0 1 0 0 4 3 2 1 0 0 1 0 | v4 = vid - v3     __/
6: 0 0 1 0 0 1 0 1 1 0 0 0 0 1 1 0 | m1 = m0 >> 1
7: 7 8 7 8 7 8 | val     = compress(v0, m1)
8: 2 1 0 4 0 1 | repeats = compress(v4, m1)

Notice how the above code uses a max-scan to implement a segmented viota. This works for both seg-viota and seg-+-scan:

seg-+-scan:
 6  5  4  3  2  1 | v0
 1  0  1  0  0  1 | m0                      |   seg-viota:
 0  1  0  1  0  0 | m1 = m0 >> 1            |   0 0 1 0 0 1 1 0 0 0 0 1 | m
21 15 10  6  3  1 | v1 = +-scan(v0)         |   0 0 9 0 0 6 5 0 0 0 0 0 | v1 = m ? vid : 0
 0  0  0  0 15  6 | v2 = vcompress(m0,v1)   |   9 9 9 6 6 6 5 0 0 0 0 0 | v2 = +-scan(v1,0)
 0  0  0 15  6  0 | v3 = vslide1up(v2,0)    |   2 1 0 2 1 0 0 4 3 2 1 0 | seg-viota = vid - v2
 2  1  1  0  0  0 | v4 = viota(m1)
15  6  6  0  0  0 | v5 = vrgather(v3,v4)
 6  9  4  6  3  1 | seg-+-scan(v0,m0) = v1 - v5

AVX512 can also decode multiple runs in parallel using the more complex vpconflict instruction from the AVX512CD extension:

0: v0 as input
1: cfss    = _m512_conflict_epi32(v0)
2: lzcnt   = _m512_lzcnt_epi32(cfss)
3: m       = _mm512_movepi32_mask(cfss)
4: scfss   = _m512_sllv_epi32(cfss, lzcnt)
5: nscfss  = _m512_andnot_epi32(scfss, ~0)
6: rl      = _m512_lzcnt_epi32(nscfss)
7: repeats = _mm512_maskz_compress_epi32(m, rl)
8: val     = _mm512_maskz_compress_epi32(m, v0)

This took the same instruction count as the +-scan variant but only works for 32 and 64-bit elements.

Falvuy noted that a slightly different storage format can be vectorized to handle multiple runs at ones using the element compress instruction, as described in this paper.

An example that AFAIK can't be efficiently implemented in AVX512 is parsing a series of integer numbers, in this case up to 16-bit, in parallel. The best x86 implementation I know of uses a permutation lookup table with 2¹⁶ entries to distribute elements of a 128-bit vector into power-of-two-sized chunks, such that they can be summed using multiple horizontal adjacent pair addition instructions.

Using seg-+-scan this becomes possible for arbitrary VLEN, here is a simplified implementation:

   _    5    6    7    8    _    5    6    _    5    6    7    _  | v0 = vsub(input, '0')
   _    7    6    5    _    6    5    _    8    7    6    5    _  | v1 = vrgather(v0, reverse)
   1    0    0    0    1    0    0    1    0    0    0    0    1  | m0 = vmsgt(v1, 9)
   0    3    2    1    0    2    1    0    4    3    2    1    0  | v2 = seg-viota(m0)
   0  100   10    1    0   10    1    0 1000  100   10    1    0  | v3 = vrgather(pow10tbl, v1)
   0  700   60    5    0   60    5    0 8000  700   60    5    0  | v4 = vwmul.wv(v3, v1)
   0  765   65    5    0   65    5    0 8765  765   65    5    0  | v5 = seg-+-scan(m0, v4)
   0    1    0    0    0    1    0    0    1    0    0    0    0  | m1 = m0 >> 1
   765 65 8765 | v6 = vcompress(v5, m1)

This implementation directly widens the elements to 16-bit. An optimized implementation should in place combine and widen adjacent elements (A*10+B), using a seg2 load and vwmul.

There are a lot of problems that benefit from scans and segmented scans: Line-of-Sight Problem, prefix sum, average word length, sparse matrix-vector multiplication (SpMV), sum of values from neighbors in a graph, Minimum Spanning Tree, Maximum Flow, ...

Note that seg-copy-scan or segment-broadcast used in some of the linked problems is just trivially compress + viota + vrgather.

"Vector Models for Data-Parallel Computing" covers even more use cases, if one were to include segmented split (element-wise Sheep-and-goats), which is probably too expensive to implement in practice. Segmented split can be implemented in <15 instructions using segmented viota and a lot more than that without scan instructions.

To reiterate, segmented scans allow you to efficiently vectorize new kinds of problems and are especially important to get good utilization on larger vector registers.

New instructions are only useful if they are implementable at reasonable speed and implementation cost. I'd like to argue that implementing scan counterparts to existing reduction instructions would not provide a significant hardware challenge.

The first half of the work-efficient parallel scan implementation already matches the reduction tree used in reductions. The second half is just a mirrored version of the up-sweep with offset inputs and outputs.

scan

Segmented variants are conceptually a bit more complicated, but don't require much additional hardware. You need to propagate a single additional bit, from the segment starts, through the scan tree, which pins the value, such that subsequent steps don't change it. (see above)

I was not able to find any reports on hardware implementations, especially ones on how difficult it is to reuse reduction circuitry to implement scans would be insightful.

One software implementation reported that implementing segmented variants on top of existing scan instructions is quite straightforward:

On the CRAY Y-MP, the asymptotic running time of the plus-scan is about 2.25 times that of a vector add, and is within 20% of optimal. An important aspect of our implementation is that a set of segmented versions of these scans are only marginally more expensive than the un-segmented versions. (source)

Assuming additional implementation complexity of segmented scan versions of existing reductions is reasonable, such instructions should be added to RVV, because they allow vectorization of new kinds of problems, while efficiently utilizing even very long vector registers.

Possible instruction proposals

While the reduction instructions take a second register source for the initial value, I'd recommend dropping this for the vector scans to conserve encoding space. Similarly, I wouldn't add separate instructions for segmented scans but rather define the masked variants of scans to work as segmented scans. AFAIK, masked scans and masked segmented scans are rare and can be implemented with additional instructions if needed.

As a minimal set of scan instructions, I would propose the following:

  • +-scan/unsigned-max-scan: vscansum.v vd, vs1, vscanmaxu.v vd, vs1
  • segmented +-scan/unsigned-max-scan: vscansum.v vd, vs1, vm, vscanmaxu.v vd, vs1, vm
  • segmented viota: viotar.m, the "r" for resetting, which described the behavior: it counts up until a mask bit is set, at which point it resets to zero and continues

A full set of scan instructions corresponding to existing reductions would include the following additions:

  • unsigned min-scan: vscanminu.v vd, vs1
  • signed min/max-scan: vscanmin.v vd, vs1, vscanmax.v vd, vs1
  • floating-point +-scan: vfscanusum.v vd, vs1
  • including segmented versions: vscanminu.v vd, vs1, vm, vscanmin.v vd, vs1, vm, vscanmax.v vd, vs1, vm, vfscanusum.v vd, vs1, vm

It could also include element-wise and/or/xor scans, but I don't see a need for these bitwise scans.

and-scan and or-scan already exist on mask registers, where they are actually useful:

  • vmsbf(~m) is an and-scan on mask-bits
  • vmsif(m) is an or-scan on mask-bits

Coincidentally, the next section will argue for a xor-scan on mask bits.

xor-scan/prefix-xor for masks

This operation is often used in SIMD parsing to match delimiters and is equivalent to carryless multiply with -1.

Usage: simdjson, json link simdzone, simdcsv

It can be emulated using the following instruction sequence:

# (xor_scan(00100100)>>1) == 00011100, it's offset by one
vsetvli t0, x0, e8, m1, ta, mu # requires `EMUL=LMUL*(8/SEW)`
viota.m  v1, v0
vand.vi  v1, v1, 1
vmseq.vi v0, v1, 1

The above is three LMUL=1 instructions when LMUL*(8/SEW)≤1, but you may need to change the surrounding code to accommodate the offset when porting existing code.

The following variant will likely be faster when LMUL*(8/SEW)>2:

# xor_scan(00100100) == 00011100
li t0, -1
vsetvli t0, x0, e64, m1, ta, mu
vclmul.vx v1, v0, a1 # 64-bit parallel-prefix XOR
vmslt.vi  v0, v1, 0 # top bit set
viota.m   v2, v0 # parallel-prefix XOR of top bits
vand.vi   v2, v2, 1
vmseq.vi  v0, v2, 1
vnot.v    v1, v1, v0.t # apply flip

Expected RThroughput for SEW=8 from SiFive P670 llvm-mca model:

method m1 m2 m4 m8
dedicated instruction 1 1 1 1
viota 3 4 8 16
vclmul+viota 6 6 6 6

Vector length-specific code could have a specialized VLEN≤512 path, that uses a single vclmul.vx to do the prefix-xor at LUML=1 in a single instruction.

The current situation isn't bad however you get the implementation almost for free if your processor already implements vclmul.

Because it's so cheap to implement, uses minimal opcode space and to encourage VLA code, it would be advantageous to add this as a dedicated instruction.

Notes on naming:

While vmxorscan would be a possible name, it doesn't quite fit with the existing mask instructions:

  • or-scan is vmsif(m) "set-including-first mask bit"
  • and-scan is vmsbf(~m) "set-before-first mask bit"

To harmonize the names, vmsxff for "set-xor-from-first mask bit" might be a better alternative.

I'm not sure if they are needed, but they would be nice to have

Mask and element reversal

Reversing elements and mask-bits isn't a common operation, and there are already performance-portable ways of implementing them for application processors:

element-reverse:
vrgather.vv v0, v0, v1 # v1 previously initialized to vl-vid
# To reverse LMUL>1 vrgather at LMUL=1 seperatly and use moves to swap LMUL=1
# vector registers

mask-reverse:
vbrev.v v0, v0
vrgather.vv v0, v0, v1 # v1 previously initialized to vl-vid

However, there are a few caveats that make dedicated instructions attractive.

  • Firstly, both of these reversals require vl=vlmax and require an additional vslidedown when vl<vlmax.
  • Secondly, both approaches don't work for SEW=8 with VL>256, and require vrgatherei16 and 16-bit indices instead, which often performs worse.
  • Thirdly, while using vrgather to reverse elements will perform well on sane application class processors, but on other targets, SEW=8 vrgather performs significantly worse than SEW=64 vrgather. In which case, you should probably use a SEW=64 vrgather with an additional vbrev8, when reversing a SEW=8 vector.

While I don't think this is a very common operation, a dedicated reverse instruction would make code using it easier to write and potentially faster.

Both element and mask-reversal instructions do not take a lot of encoding, as they have only one source and one destination register.

Dedicated within 128-bit lane shuffle / 16x8b LUT

For string algorithms, a fast 16x8b LUT is non-negotiable, and a lot of other integer workloads also require it or generally fast shuffles within 128-bit lanes. This is currently addressed by an LMUL=1 vrgather.vv, however scaling it to larger vector length seems impractical as complexity grows quadratically. If you have the silicon budget for a 256-bit lane vrgather.vv primitive, but a vector length of 1024, then a naive hardware implementation would need to apply the primitive (1024/256)^2=16 times.

We see this impacting existing implementations. The SiFive X280 has VLEN=512 and a native 256-wide data path, but has vl-cycles occupancy for LMUL >= 1, and a special case ~1-cycle occupancy for LMUL ≤ 1/2, the data path width. (See [issue thread](It riscvarchive/riscv-v-spec#929 (comment)))

There is a silver lining however as an implementation could detect when all indices in a lane point to the same other lane, and only execute the primitive once for that lane. This would improve the above example to 4 applications of the primitive for a lot of common cases including the 16x8b LUT, any in-lane permutation, element reversals, ...

I'd imagine that high-performance out-of-order cores would have trouble implementing this optimization, but I don't think it's as relevant for those cores. They are unlikely to use very long vectors and will have the budget to implement a full LMUL=1 vrgather.vv primitive. See existing x86 and ARM cores.

On the other hand, the high-performance out-of-order P470 and P670 cores from SiFive seem to implement this optimization for LMUL>1 vrgathers, where this doesn't even matter as much because software can just unroll into multiple LMUL=1 vrgathers if there is no lane crossing.

Nonetheless, we see in existing AVX512 implementations, that the latency of a in-lane gather is smaller than using a full lane-crossing vrgather:

latency for zmm icelake tigerlake rocketlake alderlake-P Zen4
vpshufb 1 1 1 1 2
vpermb 3 3 3 3 6

Additionally, while SVE originally didn't include 128-bit lane permute, SVE2.1 recently added it as TBLQ.

A separate instruction may simplify implementations, but it would also be an incentive to neglect the LMUL=1 vrgather.vv implementation, which is also an important operation. This should be mostly mitigated because the base V extension only has the full lane-crossing-gather, hence a lot of software will use lane-crossing operations and force vendors to implement it competitively.

So I'm not sure if a separate instruction is needed, but it seems like it would help simplify implementations and enable more performance portable RVV code.

Possible instruction proposals

A) gather within 128-bit lanes, indices are in the corresponding lane of the index vector register

  • pros
    • allows for different shuffles per lane, this seems to be a rare use case however
    • index lanes are next to data lanes, so there may be less data movement in certain architectures.
  • cons
    • requires an additional 128-bit element broadcast instruction to be more easily useful, such an instruction does have general utility though
    • increases register pressure, as a LMUL>1 gather would require an LMUL>1 index register

B) gather within 128-bit lanes, indices are in the lowest 128-bit of the index register, potentially always as 8-bit

  • pros:
    • no 128-bit broadcast needed, thus easier programming model
    • only uses a LMUL=1 index register
    • with 8-bit indices, shuffles for e16 and above can be loaded from a single 64-bit GPR (.vx variant possible, although the semantics would be weird for e8)
  • cons:
    • Use case of different shuffles per lane not supported
    • may require more data movement and increase latency compared to indices in the corresponding lanes

I don't think shuffles from immediate (like vpshufd) are needed to reduce vector register pressure since most other instructions have immediate and GPR variants. It would also require an 8-bit immediate, which would take a lot of encoding space. As such, this would only make sense as a destructive operation using the 5-bit immediate and additional 5-bits from the rs1 encoding. This would use two more bits than necessary, which could either be repurposed as additional opcode space or directly encode the EEW.

Ternary logic

AVX512 introduced the very powerful vpternlogd/vpternlogq instructions, which use an 8-bit immediate to select an arbitrary ternary logic operation.

They can express the following bitwise operations in a single instruction:

  • C xor A xor B
  • (A orn B) and C
  • C ? A : B bitwise merge
  • C ? A nor B : A xor B exactly one bit set
  • C ? A xor B : A and B exactly two bits set
  • Used in MD5: C ? A : B, A xor B xor C, A xor (B orn C)
  • Used in SHA-1: C ? A : B, A xor B xor C, C ? A or B : A and B
  • Used in SHA-2: C ? A : B, C ? A or B : A and B

RVV defines all possible binary operations for mask registers (and, andn, nand, nor, or, orn, xnor, xor), but only and, or, xor, andn for vector elements. The binary operations on vector elements do have a GPR and immediate variant, which can be quite useful.

While a fully generic ternary logic instruction would be a great design point, I don't think it would fit into the current ISA very well, as the existing logic instructions could've been defined on top of it.

Using the existing RVV binary mask operations it takes up to four instructions to simulate what vpternlog can express in a single instruction. That isn't too bad, but adding a ternary conditional selection instruction (C ? A : B) would bring it down to three or fewer instructions with a maximum dependency chain length of two instructions.

I think it would be reasonable to add the missing binary operations for vector elements, as well as a ternary conditional selection (C ? A : B), which should also have a variant for mask operations. The missing binary operations don't really need an immediate variant, and I'd be happy with just defining vector-vector variants to preserve opcode space.

I've placed this under "not sure if needed", because I'm not sure how much real world difference additional instructions would make. I'm also unsure how to quantify the impact of adding a three source instruction to the mask unit, which previously only had two source operand instructions.

Probably not needed

Obviously, this won't cover things instructions are directly available. Instead, it will cover instructions present in other ISAs, that I don't see as needed.

all-to-all comparisons

Both AVX512 and SVE provide instructions that do all-to-all element comparisons for finding and counting equal elements.

AVX512 has vp2intersect and vpconflict, which both do an all-to-all comparison, with 32 and 64-bit variants. SVE has MATCH and HISTSEG for 8 and 16-bit elements, which do an all-to-all comparison within every 128-bit lane, and HISTCNT which does a lane-crossing all-to-all comparison for 32 and 64-bit elements.

Historically vpconflictd and vp2intersect were famously quite slow on Intel:

Latency/RThroughput all intel, icelake to alderlake Zen4 Zen5
vpconflictd ymm, ymm 17/9 6/0.67 6.9/1
vpconflictd zmm, zmm 26/19.5 7/1.33 6.9/1
vp2intersectd ymm, ymm 27/11 (tiger/alderlake only) 6/0.67 ?/1
vp2intersectd zmm, zmm 42/23 (tiger/alderlake only) --- ?/1

The performance of vp2intersect on Intel was so bad, that a faster-than-native alternative software implementation was found. Intel deprecated VP2Intersect from 2023 onward, while AMD re-introduced it on Zen5 with a very performant implementation.

The SVE instructions don't have this problem, presumably because the vector lengths are all just 128-bits.

Latency/RThroughput Cortex A510 Cortex A520 Neoverse N2 Neoverse V2/V3 Cortex A710 Cortex X2/X3/X4 Cortex A715/A720
MATCH 7/4 7/4 2/1 2/1 2/1 2/1 2/0.5
HISTSEG 8/5 8/2.5 2/0.5 2/0.25 2/0.5 2/0.25 2/0.5

I could find almost no usage of the SVE instructions, while vpconflict and vp2intersect were slightly more common.

If RVV was to add such an instruction, it would need to follow SVE in restricting them within lanes, otherwise, implementations of a certain VLEN are just not reasonably possible, due to the quadratic growth of the needed comparisons. This restriction makes the instructions a lot less nice to use and, as I'll argue in the following, not needed to suitably address the most common use cases of these instructions.

Histogram

By far the most common application of these instructions is histogram computation/data binning. This involves filling a vector with indices of buckets that need to be incremented, detecting conflicts, resolve them, and finally gather+increment+scatter the values to the buckets.

Intel advertised vpconflict explicitly for allowing their compiler to auto-vectorize histogram computation. Currently, only vendor compilers are capable of this and only when the arguments are marked as restrict.

The biggest benefit of such vectorizations isn't the histogram increment itself, rather the vectorization of the rest of the calculation. Auto-vectorizing compilers could just vectorize the loop, up until the histogram increment, at which point it could iterate over the vector register entries and increment the histogram using scalar code.

Furthermore, when measured, using vpconflict doesn't seem beneficial for histograms when measured (Zen5 results, benchmark code) and even outperformed by scalar code, extremely so on AMD. This was not the case on current RVV implementations, presumably because they have very weak in-order scalar execution compared to the tested x86 processors. The out-of-order execution resources of today's processors seem to already do a great job at executing load/stores in parallel while avoiding conflicts.

There are many ways to efficiently vectorize SIMD creation without a dedicated conflict detection instruction, which cover almost all use cases:

  • Small histograms can benefit greatly by hand-tuning for the specific size: hand-tuned byte histogram without vpconflict, see the "aptroot" line in the graph below
  • On X86 and presumably, other highly aggressive out-of-order cores, using scalar code to increment the histogram values is faster than using memory gather/scatter.
  • The histogram is small enough to replicate it VL times, forgoing the need for conflict detection entirely.
  • The histogram is too large to replicate, consequently, conflicts are rare. In this case, you can quickly check if a conflict happened by doing an additional gather after the scatter write-back of incremented values to check if the difference between the original values is exactly equal to vl. If a conflict occurs, then conflicting buckets will only be incremented once, and you'll be able to tell that there weren't vl increments as expected.
  • Finally, conflict detection can always be implemented with N/2 iterations of existing instructions: https://riscv.godbolt.org/z/rEY6oM6ne

aptroot

Due to the above alternative implementation strategies covering the histogram use case almost entirely, I don't see it as a good motivation for a dedicated instruction.

Set intersection

The other slightly bigger use case for these all-to-all comparison instructions is computing set intersections and other set operations that follow from that.

I could find two libraries that used these instructions to implement set intersection, tetzank/SIMDSetOperations ashvardanian/SimSIMD. However, I couldn't find any external code using these set intersection functions.

Let's look at some benchmarks comparing implementations using the dedicated instructions with ones that aren't using them:

  • SimSIMD set intersection:
    • The SVE HISTCNT and MATCH implementations were slightly slower than the equivalent NEON variant on Neoverse V2, which has a 128-bit SVE implementation.
    • The AVX512 implementation with vp2intersect was on average only 5% faster then the AVX512 shuffle variant on Zen5. Although, the results varied a lot. The biggest speedup of 1.60x occurred when intersecting 128 with 8292 16-bit elements, while the biggest regression that was 1.25x slower when intersecting 1024 with 1024 16-bit elements.
  • SIMDSetOperations set intersection:
    • On Skylake-X: The vpconflict implementation was 1.6x slower than the AVX2 shuffle implementation.
    • Zen5: The vp2intersect implementation was 1.5x faster than the AVX2 shuffle implementation, while the vpconflict one was 1.2x slower.
    • Pretending we have a fast vp2intersect on other processors: Since I don't have access to a CPU with fast vp2intersect I modified the vp2intersect implementation to use AVX2 and replaced the vp2intersect instruction with a sequence of instructions that have the same latency, as vp2intersect has on Zen5, so 7 cycles. Since AVX2 doesn't have a compressed store, I replaced it with a regular store. This will produce the wrong result, but should give an indication of the expected performance if a fast vp2intersect instruction was available.
      • Skylake-X: fake-vp2intersect was ~1.35x faster
      • Zen1: fake-vp2intersect was ~1.1x faster
      • Coffe Lake: fake-vp2intersect was ~1.3x faster

That's not a large improvement for an instruction of quadratic implementation complexity.

The SIMDSetOperations benchmark intersected two large, equal-sized, sets with ~1/3 intersecting elements. Often this is not the case and one set is a lot larger than the other, e.g. in keyword searches and graph analytics. (source)

For intersecting very large sets with smaller ones, there are alternative algorithms that don't benefit much from such instructions. This paper on fast SIMD set intersection proposes precomputing a bloom filter-like structure to represent the large set. It then uses specialized NxM intersection kernels to resolve matches more efficiently:

FESIA

This approach could be expanded by batching multiple NxM intersections with the same dimensions and resolving them using a VL/K MxN intersection kernel. For instance, the following code could resolve VLEN/64 2x4 intersections of 32-bit elements at once:

# inputs: LMUL=1 v2 # ...BAbaBAba
#         LMUL=2 v4 # ...DCBAdcbaDCBAdcba
# output: v0 # intersecting elements
vsetvli x0, a0, e32, m1
vtrn1.vv v6, v2, v2 # ...AAaaAAaa
vtrn2.vv v8, v2, v2 # ...BBbbBBbb
vwzip.vv v10, v6, v6  # ...AAAAaaaaAAAAaaaa
vwzip.vv v12, v8, v8  # ...BBBBbbbbBBBBbbbb
vsetvli x0, a1, e32, m2
vmseq.vv v0, v4, v10
vmseq.vv v1, v4, v12
vmor.mm v2, v0, v1
vcompress.vm v0, v4, v2
# Note: I'm using vtrn/vwzip as proposed earlier.
#       It could also be implemented two vrgathers instead,
#       but for that you'd want to create the indices outside of the main loop.

This approach allows you to get almost full utilization of your vector length.

A dedicated full-register intersection instruction wouldn't be that beneficial, because the sub-intersections are usually small. What could be beneficial, is a lot of dedicated instructions for some of the larger or harder-to-implement kernels, e.g. for multiple 3x4, 3x5, 4x5, ... intersections.

other uses

The user uses I could identify were:

  • sparse vector dot product, is functionally equivalent to set intersection,
  • Run-length encoding, can more efficiently be solved with segmented scans, see above.
  • there were a handful of uses where vpconflict was directly followed by memory gather/scatter, but I wasn't clearly able to tell if it's a histogram in disguise. Still, in these cases the memory accesses are likely going to be the bottleneck.

The implementation cost of dedicated intersection/conflict detection instructions is very high, and the benefits aren't clear. Since the most common applications can be efficiently vectorized using other methods, I don't consider this good candidate instructions for general-purpose application class processors.

If vendors see a need for dedicated instructions similar to the ones discussed above for specific applications, e.g. database applications, then it's certainly worth considering. Such applications would probably benefit from more specialized instructions that match the exact usage of the application. In the intersection algorithm described above, that may be many instructions that do multiple small intersections of different sizes in parallel.

Please share feedback and contribute suggestions

Feel free to share feedback and suggest additions and alternative implementations.

Advanced SIMD instructions usage survey notes

Expand me
@camel-cdr
Copy link
Author

@nibrunie Yes it should've used the .vv variant, thanks.

For bit-compress/expand I was thinking about vmerge.vim + vrgather.vv + vmseq.vi. For bit-compress you could use a vcompress.vm instead of the vrgather.vv, but bit-expand would still need vrgather.vv

With "a large series of shifts and bit operations" I meant that you may be able to replace a specific compression/expansion mask with a series of bit operations.

I've adjusted the text a bit.

@drom
Copy link

drom commented Nov 11, 2024

@camel-cdr coming from DSP workload (SDR-like) perspective. Would you like to discuss fixed-point implementations with rounding and saturation?

@camel-cdr
Copy link
Author

Hi @drom,
In theory yes, however I don't have an experience in that regard.
Somewhat related, Courmisch mentioned that "signed to unsigned narrowing clip" is currently a pain point.

@ubc-guy
Copy link

ubc-guy commented Dec 7, 2024

the mask slides are an interesting idea.

(1) left/right shifting doesn't make sense, because these are ordered by element number not powers of 2; instead, I recommend using the terms up/down to match the direction of regular vector element slides.

(2) wouldn't the most common usage be to align the mask shift with a regular vector element slide? maybe the regular slides need a variant that also shift the mask at the same time by the same amount. the question remains: what bit value (0 or 1) to be used for the newly introducded bits? this would depend upon how that mask was originally generated, and likely would need to align with the new values being shifted in. [vslide1up/vslide1down introduce a new value from rs1, either from X or F registers; vslideup leaves elements at the bottom unchanged; vslidedn copies values from the source vector as if it was a longer vector until it hits VLMAX, then it copies 0s.] for vslideup/vslidedown, the mask bits would likely follow the full element policy (and introduce 0s on slidedown when going past VLMAX). for vslide1up/1down, a new value is being introduced, so it matters how the mask itself was origiinally generated -- it is not clear that the originally proposed constant 0/1 is the right answer here either.

@camel-cdr
Copy link
Author

camel-cdr commented Dec 7, 2024

@ubc-guy I've added a line at the top to verify my identity.

(1) left/right shifting doesn't make sense, because these are ordered by element number not powers of 2; instead, I recommend using the terms up/down to match the direction of regular vector element slides.

Sounds good, I've adjusted that.

(2) wouldn't the most common usage be to align the mask shift with a regular vector element slide?

I haven't seen that use case yet.

The most common case I've seen is, if you need to check if one or more values follow each other.
This comes up all the time in parsing, e.g. for handling escape characters, or to identify the starts of contiguous set bits (andn(m, m<<1).
It could also be used to propagate carry within a vector register, for bigint calculations within a single vector register.

I don't think there is enough value in an additional source register to determine the bit to shift in, to justify the huge increase in used opcode space.
A fast way to insert and extract the first and last bits to and from GPRs might be a better alternative.

This currently requires a four instruction sequence + vtype change:

1

(from my mastodon post)


As a side note: RVV needs good support for mask operations to be competitive with contemporary fixed length ISAs like AVX512 in VLA code.

Fixed length ISAs can move their mask registers to GPRs and benefit from the immensely optimized scalar instruction execution. On modern processors this means you effectively have 4-6 issue mask operations.
Contrast that with prominent RVV implementations (going with public information):

  • SiFive P470, P670 and P870 all have only a single mask unit, so presumably single issue mask instructions (confirmed for P470/P670 through llvm scheduler model), at most two issue on the P870 as it has two vector execution units
  • Tenstorrent Ascalon: has two vector execution units, so at most dual issue mask instructions
  • Ventana Veyron V2: one mask unit, so presumably single issue mask instructions

1-2 issue mask instructions on RVV implementations vs 4-6 on other ISAs... Why is RVV still competitive? Because of LMUL!

LMUL amortizes scalar and mask instructions, because masks are stored in a single LMUL=1 vector register and usually execute in constant time regardless of LMUL.
Most problems can run at LMUL=4 or LMUL=2, so with just two issue for mask instructions, you get the equivalent of 4-8 issue on most workloads.

This has one big requirement, though: Namely that mask operations need to stay within mask registers.

This is why dedicated mask instructions for shifts are so important.
If it was just LMUL=1, then sliding the source register (or overlapping loads) and doing an additional comparison would work and be fast for most use cases. But then your mask operations wouldn't be competitive.

@dead-claudia
Copy link

@camel-cdr I've been looking into GLSL compilation and what it'd take to re-batch while loops after too many exit. To re-batch those loops (to keep the cores busy), I'd need to do this sequence:

  1. Queue tasks for the continuation of loop-exiting threads
  2. Compress the loop-continuing variables
  3. Grab new threads from other tasks for this specific loop where not enough continued to make it worth looping the task back
  4. Move the new lanes into place
  5. Loop back to the beginning

If that while is itself filtered somehow (say, a partial update or it's in an if), the filtermask would also have to be compressed the same way in 2. And 4, which already needs to do vslideup.vx to merge in the grabbed lanes' loop-continuing variables, would also have to do a vmslideup with the same register offset to merge in the grabbed lanes' own filter mask.

So there is a use case for it, although admittedly a niche one. And adding it won't really require that much more circuitry. (In fact, you can reuse the circuitry for vslideup and vslidedown and just implement the last 0-7 bits after.)

@dead-claudia
Copy link

dead-claudia commented Oct 13, 2025

Separately, a (possibly 48-bit) vperm4.vi vd, vs1, imm12, vm that does a 4-lane-group permute from 2 sources should be enough for the fast permute. I did some investigation into WebAssembly's (rather inefficient) shuffle instruction, and found almost all permutes in it, including each column of a 2x2 matrix transpose, could be done with just this kind of permute with 32-bit lanes: WebAssembly/design#1559 (comment). To put this into context, they represented over 2% of that binary by size and about 0.4% by WebAssembly opcode count. In current RISC-V, the savings would be even more significant: about 25% of the opcodes are just loads and stores of locals 0-15 (almost always in registers already), and the needed tables are a minimum of 8 bytes (with V) + an extra 8-12 bytes just to load it for LMUL=1, for a total of 16-20 bytes. Raise the LMUL to 2, it's up to 24-28 bytes. A 6-byte opcode with no memory requirements obviously would benefit things a lot.

@dead-claudia
Copy link

dead-claudia commented Oct 13, 2025

Edit: I'm abandoning vrid.v per https://gist.github.com/camel-cdr/99a41367d6529f390d25e36ca3e4b626?permalink_comment_id=5885264#gistcomment-5885264.

Old suggestion

Also, have you considered a vrid.v for vd[i] = vl - 1 - i analogous to vid.v's vd[i] = i? Then, vrid.v vtemp, vm; vrgather.vv vd, vs1, vtemp, vm could be fused for vector reversal, with the vrid.v being entirely skippable when vd=vtemp.

I've found myself doing this a lot when finding the first index from list end:

    vsetvli xvl, x0, e32, m8, ta, ma
    vid.v offsets
    addi xt, xvl, -1
    vrsub.vx vo, offsets, xt
loop:
    vsetvli xt, len, e32, m8, ta, ma
    sub base, base, xt
    sub len, len, xt
    vle32.v vinput, (base)
    vrgather.vv vt, vinput, vo
    vmseq.vx vt, vt, item_to_find
    vfirst.vx xt, vt
    bgez xt, found
    bnez len, loop
    j not_found
found:
    vslidedown.vx vo, vo, xt
    vmv.x.s xt, vo
    add result, xt, base

vrev only replaces the vrgather. It obviously helps some:

    vsetvli xt, x0, e32, m8, ta, ma
    vid.v vo
    vrev.v vo, vo
loop:
    vsetvli xt, len, e32, m8, ta, ma
    sub base, base, xt
    sub len, len, xt
    vle32.v vinput, (base)
    vrev.v vt, vinput
    vmseq.vx vt, vt, item_to_find
    vfirst.vx xt, vt
    bgez xt, found
    bnez len, loop
    j not_found
found:
    vslidedown.vx vo, vo, xt
    vmv.x.s xt, vo
    add result, xt, base

What'd be the most compact is a vrid. This could even fuse with a vrgather.vv. Yes, it's an extra op in the loop, but the unfused vrid can be done in the same cycle.

loop:
    vsetvli xt, len, e32, m8, ta, ma
    sub base, base, xt
    sub len, len, xt
    vle32.v vinput, (base)
    vrid.v vo
    vrgather.vv vt, vinput, vo
    vmseq.vx vt, vt, item_to_find
    vfirst.vx xt, vt
    bgez xt, found
    bnez len, loop
    j not_found
found:
    vslidedown.vx vo, vo, xt
    vmv.x.s xt, vo
    add result, xt, base

I've run into other situations where a reversed vid is what I needed. I feel just being able to have that fused would be ideal. And for processors that can't efficiently reverse vectors for some reason, the unfused version probably won't be any slower than what fusing them could accomplish.


Oh, and tail-assigning variants of vmv.x.s and friends (say, vmv.x.t and similar) could make a lot of loops easier to vectorize.

Those could also, when combined with vsetvl, be used to turn vectors into scalar table lookups, something that could be used to drastically accelerate DEFLATE decompression based on my reading of libdeflate's source code. Specifically, the inner loop is somewhat latency-bound, and if this code sequence can produce a latency lower than an L1 cache load hit, it'll provide a measurable perf improvement:

addi rd, off, 1
vsetvli x0, rd, x0
vmv.x.t rd, table

I suspect, if the above sequence gets fused to 1 two-cycle uop, a 50% perf boost for Zvl16384b (currently under research) and a 5% perf boost for Zvl2048b (already in production). And 1 cycle is realistic: increment then concurrently store to vl and index into vector. The circuit would require significant area, but the gate delay of indexing would be tiny: depth of 6 MUX4 + 1 MUX2 + 1 NOT for the worst case of Zvl65536b LMUL=1, and LMUL>1 can just add a second cycle to determine which register to load.


For complete mask parity, I could see RISC-V covering the gap with just these 5 operations combined with an extension to make "tail-agnostic"/"mask-agnostic" zero disabled lanes rather than leaving them completely implementation-defined.

  • vmmv.eqz.m.x vd, rs1, vm: Set vd to ones of rs1 == 0, zeroes otherwise.
  • vmmv.nez.m.x vd, rs1, vm: Set vd to ones of rs1 != 0, zeroes otherwise.
  • vmpmv.m.x vd, rs1, vm: Repeat bits of rs1 into vd, truncating to vl bits.
  • vmpmv.m.i vd, imm4, vm: Repeat bits of imm4 into vd, truncating to vl bits.
  • vmpmv.s.m rd, vs1, vm: Extract the first max(XLEN, vl) bits from mask vs1, zero-extend to XLEN, and return in rd.

WebAssembly's considering much of this for their flexible vectors extension. Their first tier (what they consider "uncontroversial" and plan to unconditionally include) includes the following permutation types, and the vmpmv.v.i opcode and my suggested opcodes from earlier completely fill out the remaining grid:

  • vec.i*.concat_lower_lower: do vslideup.vx vd=vs1, vs2, vl/2 (no new opcodes needed)
  • vec.i*.concat_lower_upper: do vslideup.vx vd, vs1, vl/2; vslidedown.vx vd, vs2, vl/2 with tail undisturbed (no new opcodes needed)
    • This in theory could be fused when vd≠vs2, because the previous lanes of vd below vl are unobservable after this sequence
  • vec.i*.concat_upper_lower: same as vec.i*.concat_lower_upper, but with operands reversed (no new opcodes needed)
  • vec.i*.concat_upper_upper: do vslidedown.vx vd=vs1, vs2, hvl with tail undisturbed (no new opcodes needed)
  • vec.i*.concat_even: do vmpmv.m.i vt, 0x5; vcompress.vm vd, vs1, vt; vcompress.vm vt, vs2, vt; vslideup.vx vd, vt, vl/2
  • vec.i*.concat_odd: do vmpmv.m.i vt, 0xA; vcompress.vm vd, vs1, vt; vcompress.vm vt, vs2, vt; vslideup.vx vd, vt, vl/2
  • vec.i*.oddeven: do vperm4.vi vd/vs1, vs2, 0o2301
  • vec.i*.reverse: do your proposed vrev.v vd, vs1, vm
  • vec.i*.dup_odd: do vperm4.vi vd/vs1, vs2, 0o3311

@camel-cdr
Copy link
Author

camel-cdr commented Oct 21, 2025

@dead-claudia

I've been looking into GLSL compilation and what it'd take to re-batch while loops after too many exit

That's certainly a usecase for mask slides.

The problem kind of sounds similar to http://vectorizer.org/vpsplitsux.html.
I think for long-vector processors this works the best using vcompress and reducing vl, if your hardware implements Ovlt (vl short circuiting).
You could then have the loop loop until vl is smaller than some threshold, maybe VLEN/8 or VLEN/16, and only refill after.

Separately, a (possibly 48-bit) vperm4.vi vd, vs1, imm12, vm that does a 4-lane-group permute from 2 sources should be enough for the fast permute

I was also thinking about if in-lane permutes that can read from vertically adjacent lanes of two sources would be useful, but I think for small lanes dedicated instructions are better.

There is already a fast track proposal in progress for zip/unzip and trn1/trn2 equivalents: https://github.com/ved-rivos/riscv-isa-manual/blob/zvzip/src/zvzip.adoc

Especially vpaire/vpairo will be a lot cheaper to implement than vperm4.vi.

I think all of the WASM flexible vectors cases are already suitably covered.

  • vec.i*.concat_even: This is a simple vnsrl at double the element width
  • vec.i*.concat_odd: same as above but different shift ammount
  • vec.i*.oddeven: Should be a vmerge with evenodd mask
  • vec.i*.dup_odd: vpairo v8, v9, v9 in the proposal, in base "V" it would be vmacc.vx v8, v8, t0 with t0 = 1<<(SEW/2)

an extension to make "tail-agnostic"/"mask-agnostic" zero disabled lanes rather than leaving them completely implementation-defined

That would be incompatible because currently only retaining the destination or filling all 1s is allowed (an arbitrary choice of both for every element).

For complete mask parity, I could see RISC-V covering the gap with just these 5 operations combined with

I think we already have decent support for creating masks from GPRs, because you can just use vmv and reduce vtype accordingly, but I think the basic vmv.m.x and maybe vmv.m.i would be good to reduce vtype changes.

I don't think vmmv.eqz.m.x and vmmv.nez.m.x are needed. You already can just do sub t0, t1, t0, vmv.v.x v0, t0 (with t1=1).


I'm not sure how I feel about vrid.v. I like the idea of more vid/viota like instructions, but there already are two good ways of doing element reversal:

  • -1 strided load/store: This would work in your example above, but doesn't work if you need to reverse between non load/store operations
  • vadd.vv vdst, vrid, vl. If you have a register spare, you can simply precompute vrid for VL=VLMAX using vid+vsub and offset that using vl in the loop.

tail-assigning variants of vmv.x.s and friends

For the scalar LUT example, you can already do vsetivli x0, 1, ..., vrgather.vx v1, v2, t0, vmv.s.x t0, v1.
But I can see how a tail move is useful, as it could remove an vtype changes.

@dead-claudia
Copy link

@camel-cdr

I think for long-vector processors this works the best using vcompress and reducing vl, if your hardware implements Ovlt (vl short circuiting).
You could then have the loop loop until vl is smaller than some threshold, maybe VLEN/8 or VLEN/16, and only refill after.

That was exactly the same idea I had. I just didn't want to go too deep into the weeds right away with it.

I was also thinking about if in-lane permutes that can read from vertically adjacent lanes of two sources would be useful, but I think for small lanes dedicated instructions are better.

There is already a fast track proposal in progress for zip/unzip and trn1/trn2 equivalents: https://github.com/ved-rivos/riscv-isa-manual/blob/zvzip/src/zvzip.adoc

Especially vpaire/vpairo will be a lot cheaper to implement than vperm4.vi.

I wouldn't say "a lot". My idea only requires 8-to-1 MUXes and a bunch of wire crossings. It's cheaper, yes, but not by a lot.

  • vec.i*.concat_even: This is a simple vnsrl at double the element width

Not if you're at EEW=64.

I don't think vmmv.eqz.m.x and vmmv.nez.m.x are needed. You already can just do sub t0, t1, t0, vmv.v.x v0, t0 (with t1=1).

You'd generally have to set and reset LMUL as well if it's greater than 1. So in those cases, you're looking at a 5-opcode sequence for vmmv.eqz.m.x: the 3 opcodes for what you listed above, plus 2 more vsetvli opcodes to adjust LMUL.

  • -1 strided load/store: This would work in your example above, but doesn't work if you need to reverse between non load/store operations

This is true, but I've never seen a CPU that optimized for specific runtime strides in their strided loads and stores.

  • vadd.vv vdst, vrid, vl. If you have a register spare, you can simply precompute vrid for VL=VLMAX using vid+vsub and offset that using vl in the loop.

This doesn't work when vl < vlenb. You have to use the vl itself.

tail-assigning variants of vmv.x.s and friends

For the scalar LUT example, you can already do vsetivli x0, 1, ..., vrgather.vx v1, v2, t0, vmv.s.x t0, v1. But I can see how a tail move is useful, as it could remove an vtype changes.

vrgather.vv is extremely expensive for large vl because the area-delay product for it grows by O(n^2). You need to execute n muxes, and those muxes need n bits. For VLEN=128, LMUL=1, you need to execute 16 16-to-1 muxes. For VLEN=2048, LMUL=8, you need to execute 2048 2048-to-1 muxes. Conversely, simple reversal can be done simply, at complete parallelism.

@vogma
Copy link

vogma commented Nov 22, 2025

Hi Olaf,

thanks for putting this gap analysis together. I really like the proposed “missing” instructions, especially vbcompress / vbexpand and the base64 encode/decode example you use to motivate them. Since ARM SVE already has bext / bdep, I was particularly curious what kind of speedup a vbexpand-style instruction actually buys in practice for base64 encoding.

So I took my SVE implementation of base64 encoding and changed how I build the lookup indices. The original version used 16-bit multiplications for shifting the indices ref, and the new one integrates a bdep-style operation for constructing them ref.

Running both versions on an AWS Graviton 4 CPU, I get the following timings:

Implementation 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 524288 2097152
base64sve_bdep 0.0508 0.0529 0.0617 0.0776 0.1118 0.1967 0.3387 0.6198 1.1893 2.3455 4.7261 9.3389 77.8463 333.9653
base64sve_shift 0.0510 0.0527 0.0615 0.0780 0.1139 0.2054 0.3618 0.6737 1.2968 2.5360 5.0386 10.2053 84.0573 353.5420

First row is input size in bytes, numbers are times in milliseconds. As you can see, there isn’t a dramatic improvement over the shift/multiply-based implementation.
In the RISC-V Vector SIG you mentioned that you have pseudocode for a base64 encoder using vbexpand (ref: https://lists.riscv.org/g/sig-vector/message/520). Do you have any performance results for that code on real hardware or in simulation?

I implemented the vbexpand-style version in SVE mainly to get a feel for how much improvement such an instruction might yield in this kind of workload. I’d be very interested in how my measurements line up with what you’ve seen or expect.

Thanks again
~Marco

EDIT: With your code changes the results look better:

Implementation 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 524288 2097152
base64sve_bdep 0.0514 0.0522 0.0613 0.0748 0.1064 0.1835 0.3326 0.5777 1.0741 2.1023 4.1814 8.2810 69.1002 281.1501
base64sve_shift 0.0518 0.0527 0.0627 0.0792 0.1153 0.2055 0.3642 0.6732 1.2840 2.5238 4.9137 9.8511 80.5594 329.4242

@camel-cdr
Copy link
Author

camel-cdr commented Nov 28, 2025

@vogma

I’d be very interested in how my measurements line up with what you’ve seen or expect.
EDIT: With your code changes the results look better:

Thanks for running the benchmark.
These numbers are rather interesting, I expect my proposed version of the instructions to perform a bit better.

Neoverse-V2 has 6 cycle latency and 2 cycle RThroughput on BDEP.
In my proposal, the vbexpand.vx variant should be able to improve latency and throughput, especially for LMUL>1, because the mask only needs to be decoded once.
The decoding of the mask is the most expensive part. Once the mask is decoded, applying it to the butterfly networks can be completed in a single cycle.
An implementation may even be able to cache the last decoded vbexpand.vx mask.

Do you have any performance results for that code on real hardware or in simulation?

I haven't done measurements yet, but my plan is to evaluate it with RVV by replacing the new instruction with an existing instruction with the expected latency/throughput characteristics. This will produce the wrong result, but since the control flow isn't impacted, the performance should be roughly equivalent.

I'll probably put the code in https://github.com/camel-cdr/rvv-playground/

base64 was the easiest example, but I expect the biggest gain in parsing the LEB128/varint formats.

@dead-claudia
Copy link

dead-claudia commented Nov 29, 2025

Edit: fix some mistakes (vsub -> vrsub and some copy/paste errors)

I took a step back and am now re-thinking some of https://gist.github.com/camel-cdr/99a41367d6529f390d25e36ca3e4b626?permalink_comment_id=5800299#gistcomment-5800299 and https://gist.github.com/camel-cdr/99a41367d6529f390d25e36ca3e4b626?permalink_comment_id=5816109#gistcomment-5816109.

  • vadd.vv vdst, vrid, vl. If you have a register spare, you can simply precompute vrid for VL=VLMAX using vid+vsub and offset that using vl in the loop.

This doesn't work when vl < vlenb. You have to use the vl itself.

Sorry, this is misleading by me. You're also right in that it can be simulated, but you're not quite right on how. vrid.v vd is equivalent to the following code sequence:

# Without caching, for `vle*ff.v` loops
csrr rvl, vl # Usually available for free
addi rvl, rvl, -1
vid.v vd
vsub.vx vd, vd, rvl

# With caching, for all other loops
# Pre-loop, vl=vlmax
vid.v vt
vadd.vi vt, vt, -1

# In-loop, vl<=vlmax
csrr rvl, vl # Usually available for free
vsub.vx vd, vt, rvl

So, given that there is an efficient sequence to simulate vrid.v in just 1-2 opcodes in most loops, and that only two relatively cheap vector opcodes are needed in the general case, vrid.v doesn't technically need to exist.

As for reverse unit stride loads, they can be efficiently emulated by pre-decrementing the load pointer and then reversing what you load.

# Forward load
vle8.v vd, (rs)
add rs, rs, rvl

# Reverse load
# Add 1 to `rs` to simulate a -1 stride load
sub rs, rs, rvl
vle8.v vd, (rs)
vrev.v vd

# Forward store
vse8.v vd, (rs)
add rs, rs, rvl

# Reverse store
# Add 1 to `rs` to simulate a -1 stride store
vrev.v vd
sub rs, rs, rvl
vse8.v vd, (rs)

So a vrev.v is all that's really needed in practice. Technically, this can be done with a vrgather.vv and the above vrid simulation:

# Without caching, for `vle*ff.v` loops
csrr rvl, vl
addi rvl, rvl, -1
vid.v vd
vrsub.vx vd, vd, rvl
vrgather.vv vd, vs1, vd

# With caching, for all other loops
# Pre-loop, vl=vlmax
vid.v vt
vadd.vi vt, vt, -1

# In-loop, vl<=vlmax
csrr rvl, vl # Usually available for free
vrsub.vx vd, vt, rvl
vrgather.vv vd, vs1, vd

This in theory could be fused away into an efficient sequence, but it'd need to (awkwardly) carry state in the register renaming unit to do so:

  • csrr rd, vl; addi rd, rd, -1; vid.v vd would be fused into a single uop. Its rd needs stored with an "is vl-1" bit.
  • vset*vl* rd, ... and csr* rd, vl need their rd stored with an "is vl" bit.
  • vid.v vd; vadd.vi vd, vd, -1 would be fused into a single uop, storing its vd with "is vid.v-1" bits for both vl=vlmax and vl<=vlmax as applicable.
  • vrsub.vx vd, vs2, rs1; vrgather.vv vd, vs1, vd, when vs2 has either vid.v-1 bit set and rs1 has its vl bit set, can be fused into a bit reversal.
  • vid.v vd; vrsub.vx vd, vd, rs1; vrgather.vv vd, vs1, vd, when rs1 has its vl-1 bit, can be fused into a bit reversal.

And while this seems simple, this isn't just a decoder thing. Any interrupt (say, to page in data) could potentially ruin this and force it back into the slow, general case, and so you'd have to, on mret and sret, recalculate the vl/vl-1 bits for all changed integer registers and the vid.v-1 bits for all changed vector registers. You can amortize this cost with memory renaming, but that still doesn't prevent the general case. (Fortunately, mret and sret cost enough cycles that you can still do this in parallel anyways, but it's still a lot of extra hardware cost, and it's awkward to design.)

So, a vrev.v is probably the best way to go. (It also doesn't require fusion.)

@dead-claudia
Copy link

dead-claudia commented Nov 30, 2025

vrgather.vv is extremely expensive for large vl because the area-delay product for it grows by O(n^2). You need to execute n muxes, and those muxes need n bits. For VLEN=128, LMUL=1, you need to execute 16 16-to-1 muxes. For VLEN=2048, LMUL=8, you need to execute 2048 2048-to-1 muxes. Conversely, simple reversal can be done simply, at complete parallelism.

To give a real-world example for how the latency of this balloons with larger vl: per this page, the Tenstorrent Ascalon can do two e8m1 or one e8m2 vrgather.vv in a single cycle, but takes about 8 cycles before it can do a second e8m4 and 30 cycles before it can do a second e8m8.

But looking back, I misread what you originally wrote:

For the scalar LUT example, you can already do vsetivli x0, 1, ..., vrgather.vx v1, v2, t0, vmv.s.x t0, v1.
But I can see how a tail move is useful, as it could remove an vtype changes.

Per the same above link, two LMUL=1 vrgather.vx ops can be done in a single cycle, but 4 cycles are required for LMUL=8. Conversely, vmv.x.s can be done in one, regardless of LMUL.

My hope is to match tail-undisturbed vmv.s.x performance with vmv.t.x and general vmv.x.s performance with vmv.t.x, so loops like this are reasonably efficient:

# v1 = atomicAdd(v0 mask, inout int *a0=base, int v1=data)
atomicAdd_ufm_dyn:
    csrr a1, v1
    beqz a1, 2b
    li a2, 1
1:  vsetvli a3, x0, e32, m1, ta, ma
    vmv.x.s a3, v0
    andi a3, a3, 1
    beqz a2, 0f
    vslide1down.m v0
    vsetvli x0, a2, e32, m1, ta, ma
    vmv.x.t a3, v1
    amoadd.w a3, (a0), a3
    vmv.t.x v2, a3
0:  addi a2, a2, 1
    bnez a2, a1, 1b
    vmv1r.v v1, v2
2:  ret

This loop can actually be accelerated a bit with existing instructions, though (and raises questions on whether slides are even useful for masks):

# v0, v1 = atomicAdd(v0 mask, inout int *a0=base, int v1=data)
atomicAdd_ufm_dyn:
    csrr a1, v1
    beqz a1, 2b
    vfirst.m a2, v0
    bltz a2, 0f
    vmmv.m v3, v0
1:  vmsif.m v2, v3
    addi a2, a2, 1
    vmandn.mm v3, v3, v2
    vsetvli x0, a2, e32, m1, ta, ma
    vmv.x.t a2, v1
    amoadd.w a2, (a0), a2
    vmv.t.x v2, a2
    vsetvli x0, a1, e32, m1, ta, ma
    vfirst.m a2, v3
    bgez a2, 1b
0:  vmv1r.v v1, v2
    ret

This loop isn't just a toy. It's what you'd need to implement GLES 3.1's atomicAdd when its first parameter is uniform but its second parameter isn't.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment