Skip to content

Instantly share code, notes, and snippets.

@ehzawad
Created February 27, 2026 19:41
Show Gist options
  • Select an option

  • Save ehzawad/de4d92db109135dabb897e410385e020 to your computer and use it in GitHub Desktop.

Select an option

Save ehzawad/de4d92db109135dabb897e410385e020 to your computer and use it in GitHub Desktop.

Research Summary

Current baseline was the temporal PARSeq-Small/parseq setup (e.g., embed_dim=384, enc_depth=12, ~23.8M-class model family).

Two approaches are proposed:

  1. Approach 1: PARSeq-Tiny transfer learning
  • Dimitri's temporal PARSeq modifications were reused from the existing pipeline, but the model was switched from PARSeq-Small to PARSeq-Tiny and retrained from the Hugging Face/PARSeq-Tiny checkpoint.
  • PARSeq-Tiny should be sufficient because the target vocabulary is only digits (0-9) plus control token(s), unlike broader OCR character sets.
  • Unseen-number handling via digit-level recognition: supervision remained token-level (0-9 + EOS), not 100 jersey classes. The model learned digit identities and sequence order, so unseen combinations were compositional.
  • PARSeq also addresses position in sequence via autoregressive decoding/permutation-based training behavior, so digit understanding is not hard-coded to one fixed two-digit class.
  • Example: if training contained 2, 3, and 23, inference could still produce 32 by predicting first token 3 and second token 2, even when 32 was absent from training labels.
  1. Approach 2: Shared query-based digit head + gatekeeper
  • A Video ViT backbone was used for temporal/global context.
  • A secondary Length Head (binary: single vs double digit) operated in parallel.
  • Instead of separate tens/ones heads, a single shared 10-class digit head was queried with a position token (pos=0 first digit, pos=1 second digit), appended to the temporal feature embedding.
  • Training used pure classification losses (digit + length).
  • Unseen-number handling via digit-level recognition: shared digit-classifier weights transferred knowledge across positions.
  • Inference first predicted length, then queried digit-1; digit-2 was queried only if length predicted double-digit, reusing temporal features where possible.
  • This supported unseen combinations: if query-0 predicted 3 and query-1 predicted 2, output 32 was produced even when 32 was absent from training labels.

Rough Computational Cost Comparison on GCP T4 (vs Current PARSeq Setup)

Context:

  • Current inference is not one-shot per tracklet: it creates multiple temporal sequences per tracklet, runs the model on each sequence, then aggregates with voting.
  • Current PARSeq setup uses a large ViT-style encoder with autoregressive decoding/refinement, which is compute-heavy.
  • PARSeq family reference point: ~23.8M params, ~3.255G FLOPs (single-sample compute reference).
Model Params (rough) FLOPs (rough) Typical T4 Speed (rough) Relative vs Current PARSeq
Current PARSeq baseline ~23.8M ~3.26G ~12-25 ms/sequence (~40-80 seq/s) 1.0x
Approach 1: PARSeq-Tiny transfer ~6.0M ~0.9-1.3G ~7-14 ms/sequence (~70-140 seq/s) ~1.7x-2.5x faster
Approach 2: Video ViT + shared query digit head + length head ~5.5-7.0M ~0.8-1.4G ~5-11 ms/sequence (~90-180 seq/s) ~2.3x-3.8x faster

Notes:

  • Tracklet-level latency scales with number of temporal sequences evaluated per tracklet, so these per-sequence gains compound in full pipeline runtime.

Final Recommendation (Next PoC)

For the next proof-of-concept, Approach 1 (PARSeq-Tiny transfer learning) should be prioritized.

Rationale:

  • It is the lowest-risk path: it preserves the current OCR formulation and most of the existing pipeline while reducing model size/compute.
  • It directly targets the current bottleneck (slow, compute-heavy inference) with a realistic speedup expectation on T4.
  • It keeps the same compositional digit-sequence behavior, so unseen two-digit combinations remain supported via digit-level prediction.

Practical hedge:

  • This recommendation assumes Approach 1 can deliver a meaningful latency reduction without unacceptable regression in full-number accuracy/unseen-combination behavior.
  • If those conditions are not met, Approach 2 should become the immediate follow-up PoC, since it has higher expected efficiency headroom by replacing autoregressive decoding with a shared query-based digit classifier + length gatekeeper.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment