This directory contains the pipeline for converting MobileCLIP2-S4 from PyTorch to Core ML for semantic search in Screenshot Manager.
cd embeddings
./setup.sh # Create venv and install deps
source venv/bin/activate
python scripts/convert_all.py # Run full conversionmodels/
├── MobileCLIP2-S4-Image.mlpackage # 614 MB - Image encoder
├── MobileCLIP2-S4-Text.mlpackage # 236 MB - Text encoder
├── tokenizer.json # 1.9 MB - BPE tokenizer
├── metadata.json # Swift integration info
└── mobileclip2_s4.pt # 1.7 GB - Original checkpoint (input)
| Property | Value |
|---|---|
| Input | 256×256 RGB image |
| Output | 768-dim L2-normalized embedding |
| Preprocessing | CLIP normalization baked in |
| Precision | FLOAT16 |
| Property | Value |
|---|---|
| Input | Token IDs (1, 77) int32 |
| Output | 768-dim L2-normalized embedding |
| Vocab Size | 49,408 tokens |
| Special Tokens | SOT=49406, EOT=49407 |
| Precision | FLOAT16 |
Both encoders output L2-normalized vectors, so similarity is a simple dot product:
let similarity = zip(imageEmbed, textEmbed).map(*).reduce(0, +)
// Range: -1 to 1, higher = more similarConverting MobileCLIP2-S4 to Core ML required solving several non-trivial problems. This section documents what we encountered and how we fixed it.
Problem: The image encoder uses FastViT with an Attention class that
extracts tensor dimensions dynamically:
# In timm/models/fastvit.py line 545
def forward(self, x):
B, C, H, W = x.shape # ❌ Dynamic - breaks coremltools
N = H * W
...CoreMLTools traces the PyTorch graph statically and cannot handle operations where shapes are extracted at runtime.
Error:
TypeError: only 0-dimensional arrays can be converted to Python scalars
Location: visual/trunk/3/blocks/0/token_mixer
Solution: Monkeypatch Attention.forward() with hardcoded spatial
dimensions before tracing:
def create_fixed_attention_forward(height: int, width: int):
def fixed_forward(self, x):
B = x.shape[0]
C = x.shape[1]
H = height # Hardcoded: 8 for stage 3, 4 for stage 4
W = width
...The spatial dimensions at each attention stage are deterministic based on the input size:
- Input: 256×256
- After stem + stages 0-2: 8×8 feature maps at stage 3
- After stage 3 downsample: 4×4 feature maps at stage 4
See scripts/convert_fixed.py for the full implementation.
Problem: The text encoder uses nn.MultiheadAttention, which PyTorch
implements with _native_multi_head_attention - an optimized CUDA kernel that
coremltools doesn't support.
Error:
RuntimeError: PyTorch convert function for op '_native_multi_head_attention' not implemented.
Solution: Replace all nn.MultiheadAttention modules with a manual
implementation using basic ops:
class ManualMultiheadAttention(nn.Module):
def forward(self, query, key, value, attn_mask=None):
# QKV projection
qkv = F.linear(query, self.in_proj_weight, self.in_proj_bias)
q, k, v = qkv.chunk(3, dim=-1)
# Reshape for multi-head
q = q.view(B, seq_len, num_heads, head_dim).transpose(1, 2)
...
# Scaled dot-product attention
attn = torch.matmul(q, k.transpose(-2, -1)) * self.scale
attn = F.softmax(attn, dim=-1)
out = torch.matmul(attn, v)
...We copy weights from the original nn.MultiheadAttention to preserve model
accuracy.
Problem: CLIP text encoders extract features at the EOT (end-of-text) token position:
# Original code
eot_idx = text.argmax(dim=-1) # Find EOT position
x = x[torch.arange(B), eot_idx] # ❌ Dynamic indexingThe argmax followed by advanced indexing creates a dynamic graph that can't be
traced.
Solution: Use mask multiplication instead of indexing:
# Fixed code
eot_mask = (input_ids == EOT_TOKEN_ID).float() # (1, 77)
eot_mask = eot_mask.unsqueeze(-1) # (1, 77, 1)
x = (x * eot_mask).sum(dim=1) # Extract via masked sumSince there's exactly one EOT token per sequence, the masked sum is equivalent to indexing.
Problem: coremltools 8.0 installed from source distribution was missing native ARM64 libraries:
WARNING: Fail to import BlobWriter from libmilstoragepython
Core ML conversion failed: BlobWriter not loaded
Solution: Use coremltools 8.3.0+ which provides pre-built wheels with native libraries:
pip install coremltools>=8.3.0The wheel filename should include the platform:
coremltools-8.3.0-cp312-none-macosx_11_0_arm64.whl
Problem: MobileCLIP uses MobileOne blocks with multi-branch architecture during training. Without reparameterization, the model is slow and may not run on ANE.
Training architecture:
Input → [3×3 Conv + 1×1 Conv + Identity] → Sum → Output
After reparameterization:
Input → Single Fused Conv → Output
Solution: Call reparameterize_model() before tracing:
from mobileclip.modules.common.mobileone import reparameterize_model
model = reparameterize_model(model)This algebraically fuses the parallel branches into a single convolution with equivalent weights.
coremltools>=8.3.0 # Must be 8.3+ for ARM64 native libs
torch==2.3.1 # Pinned for compatibility
torchvision==0.18.1
huggingface_hub>=0.19.0
open-clip-torch # Provides model loading
ml-mobileclip # Apple's MobileCLIP (has reparameterize_model)
- macOS 13.0+ (Ventura) for ML Program format
- Apple Silicon recommended for ANE testing
- ~4GB disk space for models and checkpoint
Converts the image encoder with fixes for dynamic shapes.
Key steps:
- Load MobileCLIP2-S4 checkpoint
- Monkeypatch Attention modules with fixed H, W
- Reparameterize MobileOne blocks
- Wrap with L2 normalization
- Trace with
torch.jit.trace - Convert to Core ML with FLOAT16 precision
Preprocessing baked into model:
ct.ImageType(
scale=1.0 / 255.0,
bias=[-mean[i]/std[i] for i in range(3)],
color_layout=ct.colorlayout.RGB,
)Converts the text encoder and exports the BPE tokenizer.
Key steps:
- Load MobileCLIP2-S4 checkpoint
- Export tokenizer vocab and merges to JSON
- Replace nn.MultiheadAttention with manual implementation
- Use mask-based EOT extraction
- Trace and convert to Core ML
Tokenizer output (tokenizer.json):
{
"vocab": {"!": 0, "\"": 1, ...},
"merges": ["i n", "t h", "a n", ...],
"byte_encoder": {...},
"special_tokens": {"sot": 49406, "eot": 49407},
"context_length": 77
}Orchestrates the complete conversion:
python scripts/convert_all.pyRuns download → image conversion → text conversion → metadata generation.
import CoreML
let config = MLModelConfiguration()
config.computeUnits = .all // Prefer ANE
let imageEncoder = try MobileCLIP2_S4_Image(configuration: config)
let textEncoder = try MobileCLIP2_S4_Text(configuration: config)import Vision
func encodeImage(_ cgImage: CGImage) throws -> [Float] {
let model = try VNCoreMLModel(for: imageEncoder.model)
let request = VNCoreMLRequest(model: model)
let handler = VNImageRequestHandler(cgImage: cgImage)
try handler.perform([request])
guard let result = request.results?.first as? VNCoreMLFeatureValueObservation,
let array = result.featureValue.multiArrayValue else {
throw EncodingError.failed
}
// Convert MLMultiArray to [Float]
return (0..<768).map { Float(truncating: array[$0]) }
}Requires implementing BPE tokenization using tokenizer.json:
func encodeText(_ text: String) throws -> [Float] {
// 1. Tokenize (implement BPE using tokenizer.json)
let tokens = tokenize(text) // -> [Int32] of length 77
// 2. Create MLMultiArray input
let input = try MLMultiArray(shape: [1, 77], dataType: .int32)
for (i, token) in tokens.enumerated() {
input[i] = NSNumber(value: token)
}
// 3. Run inference
let output = try textEncoder.prediction(input_ids: input)
// 4. Extract embedding
return (0..<768).map { Float(truncating: output.embedding[$0]) }
}func cosineSimilarity(_ a: [Float], _ b: [Float]) -> Float {
// Vectors are already L2-normalized, so dot product = cosine similarity
zip(a, b).map(*).reduce(0, +)
}
// Find top matches
func search(query: String, screenshots: [(id: Int, embedding: [Float])]) -> [Int] {
let queryEmbed = try encodeText(query)
return screenshots
.map { (id: $0.id, score: cosineSimilarity(queryEmbed, $0.embedding)) }
.sorted { $0.score > $1.score }
.prefix(10)
.map { $0.id }
}pip uninstall coremltools
pip install coremltools>=8.3.0 # Get pre-built wheelCheck for dynamic operations in your model. Common culprits:
x.shape[i]followed by arithmetictorch.arange()with tensor argumentsargmax()used for indexing
Measured on Apple M1 Pro:
| Operation | Time |
|---|---|
| Image encoding | ~15ms |
| Text encoding | ~5ms |
| Similarity (768-dim) | <0.1ms |
Memory usage: ~850MB for both models loaded.