Research Summary

Current baseline was the temporal PARSeq-Small/parseq setup (e.g., embed_dim=384, enc_depth=12, ~23.8M-class model family).

Two approaches are proposed:

Dimitri's temporal PARSeq modifications were reused from the existing pipeline, but the model was switched from PARSeq-Small to PARSeq-Tiny and retrained from the Hugging Face/PARSeq-Tiny checkpoint.
PARSeq-Tiny should be sufficient because the target vocabulary is only digits (0-9) plus control token(s), unlike broader OCR character sets.
Unseen-number handling via digit-level recognition: supervision remained token-level (0-9 + EOS), not 100 jersey classes. The model learned digit identities and sequence order, so unseen combinations were compositional.
PARSeq also addresses position in sequence via autoregressive decoding/permutation-based training behavior, so digit understanding is not hard-coded to one fixed two-digit class.
Example: if training contained 2, 3, and 23, inference could still produce 32 by predicting first token 3 and second token 2, even when 32 was absent from training labels.

A Video ViT backbone was used for temporal/global context.
A secondary Length Head (binary: single vs double digit) operated in parallel.
Instead of separate tens/ones heads, a single shared 10-class digit head was queried with a position token (pos=0 first digit, pos=1 second digit), appended to the temporal feature embedding.
Training used pure classification losses (digit + length).
Unseen-number handling via digit-level recognition: shared digit-classifier weights transferred knowledge across positions.
Inference first predicted length, then queried digit-1; digit-2 was queried only if length predicted double-digit, reusing temporal features where possible.
This supported unseen combinations: if query-0 predicted 3 and query-1 predicted 2, output 32 was produced even when 32 was absent from training labels.

Rough Computational Cost Comparison on GCP T4 (vs Current PARSeq Setup)

Context:

Current inference is not one-shot per tracklet: it creates multiple temporal sequences per tracklet, runs the model on each sequence, then aggregates with voting.
Current PARSeq setup uses a large ViT-style encoder with autoregressive decoding/refinement, which is compute-heavy.
PARSeq family reference point: ~23.8M params, ~3.255G FLOPs (single-sample compute reference).

Model	Params (rough)	FLOPs (rough)	Typical T4 Speed (rough)	Relative vs Current PARSeq
Current PARSeq baseline	~23.8M	~3.26G	~12-25 ms/sequence (~40-80 seq/s)	1.0x
Approach 1: PARSeq-Tiny transfer	~6.0M	~0.9-1.3G	~7-14 ms/sequence (~70-140 seq/s)	~1.7x-2.5x faster
Approach 2: Video ViT + shared query digit head + length head	~5.5-7.0M	~0.8-1.4G	~5-11 ms/sequence (~90-180 seq/s)	~2.3x-3.8x faster

Notes:

Tracklet-level latency scales with number of temporal sequences evaluated per tracklet, so these per-sequence gains compound in full pipeline runtime.

For the next proof-of-concept, Approach 1 (PARSeq-Tiny transfer learning) should be prioritized.

Rationale:

It is the lowest-risk path: it preserves the current OCR formulation and most of the existing pipeline while reducing model size/compute.
It directly targets the current bottleneck (slow, compute-heavy inference) with a realistic speedup expectation on T4.
It keeps the same compositional digit-sequence behavior, so unseen two-digit combinations remain supported via digit-level prediction.

Practical hedge:

This recommendation assumes Approach 1 can deliver a meaningful latency reduction without unacceptable regression in full-number accuracy/unseen-combination behavior.
If those conditions are not met, Approach 2 should become the immediate follow-up PoC, since it has higher expected efficiency headroom by replacing autoregressive decoding with a shared query-based digit classifier + length gatekeeper.