Current baseline was the temporal PARSeq-Small/parseq setup (e.g., embed_dim=384, enc_depth=12, ~23.8M-class model family).
Two approaches are proposed:
- Approach 1: PARSeq-Tiny transfer learning
- Dimitri's temporal PARSeq modifications were reused from the existing pipeline, but the model was switched from PARSeq-Small to PARSeq-Tiny and retrained from the Hugging Face/PARSeq-Tiny checkpoint.
- PARSeq-Tiny should be sufficient because the target vocabulary is only digits (
0-9) plus control token(s), unlike broader OCR character sets. - Unseen-number handling via digit-level recognition: supervision remained token-level (
0-9 + EOS), not 100 jersey classes. The model learned digit identities and sequence order, so unseen combinations were compositional. - PARSeq also addresses position in sequence via autoregressive decoding/permutation-based training behavior, so digit understanding is not hard-coded to one fixed two-digit class.
- Example: if training contained
2,3, and23, inference could still produce32by predicting first token3and second token2, even when32was absent from training labels.
- Approach 2: Shared query-based digit head + gatekeeper
- A Video ViT backbone was used for temporal/global context.
- A secondary Length Head (binary: single vs double digit) operated in parallel.
- Instead of separate tens/ones heads, a single shared 10-class digit head was queried with a position token (
pos=0first digit,pos=1second digit), appended to the temporal feature embedding. - Training used pure classification losses (digit + length).
- Unseen-number handling via digit-level recognition: shared digit-classifier weights transferred knowledge across positions.
- Inference first predicted length, then queried digit-1; digit-2 was queried only if length predicted double-digit, reusing temporal features where possible.
- This supported unseen combinations: if query-0 predicted
3and query-1 predicted2, output32was produced even when32was absent from training labels.
Context:
- Current inference is not one-shot per tracklet: it creates multiple temporal sequences per tracklet, runs the model on each sequence, then aggregates with voting.
- Current PARSeq setup uses a large ViT-style encoder with autoregressive decoding/refinement, which is compute-heavy.
- PARSeq family reference point: ~23.8M params, ~3.255G FLOPs (single-sample compute reference).
| Model | Params (rough) | FLOPs (rough) | Typical T4 Speed (rough) | Relative vs Current PARSeq |
|---|---|---|---|---|
| Current PARSeq baseline | ~23.8M | ~3.26G | ~12-25 ms/sequence (~40-80 seq/s) | 1.0x |
| Approach 1: PARSeq-Tiny transfer | ~6.0M | ~0.9-1.3G | ~7-14 ms/sequence (~70-140 seq/s) | ~1.7x-2.5x faster |
| Approach 2: Video ViT + shared query digit head + length head | ~5.5-7.0M | ~0.8-1.4G | ~5-11 ms/sequence (~90-180 seq/s) | ~2.3x-3.8x faster |
Notes:
- Tracklet-level latency scales with number of temporal sequences evaluated per tracklet, so these per-sequence gains compound in full pipeline runtime.
For the next proof-of-concept, Approach 1 (PARSeq-Tiny transfer learning) should be prioritized.
Rationale:
- It is the lowest-risk path: it preserves the current OCR formulation and most of the existing pipeline while reducing model size/compute.
- It directly targets the current bottleneck (slow, compute-heavy inference) with a realistic speedup expectation on T4.
- It keeps the same compositional digit-sequence behavior, so unseen two-digit combinations remain supported via digit-level prediction.
Practical hedge:
- This recommendation assumes Approach 1 can deliver a meaningful latency reduction without unacceptable regression in full-number accuracy/unseen-combination behavior.
- If those conditions are not met, Approach 2 should become the immediate follow-up PoC, since it has higher expected efficiency headroom by replacing autoregressive decoding with a shared query-based digit classifier + length gatekeeper.