Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save coreyward/9efb053d46e795df961a810067ab6110 to your computer and use it in GitHub Desktop.

Select an option

Save coreyward/9efb053d46e795df961a810067ab6110 to your computer and use it in GitHub Desktop.

MobileCLIP2-S4 Core ML Conversion

This directory contains the pipeline for converting MobileCLIP2-S4 from PyTorch to Core ML for semantic search in Screenshot Manager.

Quick Start

cd embeddings
./setup.sh                    # Create venv and install deps
source venv/bin/activate
python scripts/convert_all.py # Run full conversion

Output Files

models/
├── MobileCLIP2-S4-Image.mlpackage   # 614 MB - Image encoder
├── MobileCLIP2-S4-Text.mlpackage    # 236 MB - Text encoder
├── tokenizer.json                   # 1.9 MB - BPE tokenizer
├── metadata.json                    # Swift integration info
└── mobileclip2_s4.pt                # 1.7 GB - Original checkpoint (input)

Model Specifications

Image Encoder

Property Value
Input 256×256 RGB image
Output 768-dim L2-normalized embedding
Preprocessing CLIP normalization baked in
Precision FLOAT16

Text Encoder

Property Value
Input Token IDs (1, 77) int32
Output 768-dim L2-normalized embedding
Vocab Size 49,408 tokens
Special Tokens SOT=49406, EOT=49407
Precision FLOAT16

Similarity Computation

Both encoders output L2-normalized vectors, so similarity is a simple dot product:

let similarity = zip(imageEmbed, textEmbed).map(*).reduce(0, +)
// Range: -1 to 1, higher = more similar

Conversion Challenges & Solutions

Converting MobileCLIP2-S4 to Core ML required solving several non-trivial problems. This section documents what we encountered and how we fixed it.

Challenge 1: Dynamic Shapes in FastViT Attention

Problem: The image encoder uses FastViT with an Attention class that extracts tensor dimensions dynamically:

# In timm/models/fastvit.py line 545
def forward(self, x):
    B, C, H, W = x.shape  # ❌ Dynamic - breaks coremltools
    N = H * W
    ...

CoreMLTools traces the PyTorch graph statically and cannot handle operations where shapes are extracted at runtime.

Error:

TypeError: only 0-dimensional arrays can be converted to Python scalars
Location: visual/trunk/3/blocks/0/token_mixer

Solution: Monkeypatch Attention.forward() with hardcoded spatial dimensions before tracing:

def create_fixed_attention_forward(height: int, width: int):
    def fixed_forward(self, x):
        B = x.shape[0]
        C = x.shape[1]
        H = height  # Hardcoded: 8 for stage 3, 4 for stage 4
        W = width
        ...

The spatial dimensions at each attention stage are deterministic based on the input size:

  • Input: 256×256
  • After stem + stages 0-2: 8×8 feature maps at stage 3
  • After stage 3 downsample: 4×4 feature maps at stage 4

See scripts/convert_fixed.py for the full implementation.

Challenge 2: Unsupported Native Multi-Head Attention

Problem: The text encoder uses nn.MultiheadAttention, which PyTorch implements with _native_multi_head_attention - an optimized CUDA kernel that coremltools doesn't support.

Error:

RuntimeError: PyTorch convert function for op '_native_multi_head_attention' not implemented.

Solution: Replace all nn.MultiheadAttention modules with a manual implementation using basic ops:

class ManualMultiheadAttention(nn.Module):
    def forward(self, query, key, value, attn_mask=None):
        # QKV projection
        qkv = F.linear(query, self.in_proj_weight, self.in_proj_bias)
        q, k, v = qkv.chunk(3, dim=-1)

        # Reshape for multi-head
        q = q.view(B, seq_len, num_heads, head_dim).transpose(1, 2)
        ...

        # Scaled dot-product attention
        attn = torch.matmul(q, k.transpose(-2, -1)) * self.scale
        attn = F.softmax(attn, dim=-1)
        out = torch.matmul(attn, v)
        ...

We copy weights from the original nn.MultiheadAttention to preserve model accuracy.

Challenge 3: Dynamic EOT Token Indexing

Problem: CLIP text encoders extract features at the EOT (end-of-text) token position:

# Original code
eot_idx = text.argmax(dim=-1)  # Find EOT position
x = x[torch.arange(B), eot_idx]  # ❌ Dynamic indexing

The argmax followed by advanced indexing creates a dynamic graph that can't be traced.

Solution: Use mask multiplication instead of indexing:

# Fixed code
eot_mask = (input_ids == EOT_TOKEN_ID).float()  # (1, 77)
eot_mask = eot_mask.unsqueeze(-1)                # (1, 77, 1)
x = (x * eot_mask).sum(dim=1)                    # Extract via masked sum

Since there's exactly one EOT token per sequence, the masked sum is equivalent to indexing.

Challenge 4: Missing Native Libraries in coremltools

Problem: coremltools 8.0 installed from source distribution was missing native ARM64 libraries:

WARNING: Fail to import BlobWriter from libmilstoragepython
Core ML conversion failed: BlobWriter not loaded

Solution: Use coremltools 8.3.0+ which provides pre-built wheels with native libraries:

pip install coremltools>=8.3.0

The wheel filename should include the platform: coremltools-8.3.0-cp312-none-macosx_11_0_arm64.whl

Challenge 5: MobileOne Reparameterization

Problem: MobileCLIP uses MobileOne blocks with multi-branch architecture during training. Without reparameterization, the model is slow and may not run on ANE.

Training architecture:

Input → [3×3 Conv + 1×1 Conv + Identity] → Sum → Output

After reparameterization:

Input → Single Fused Conv → Output

Solution: Call reparameterize_model() before tracing:

from mobileclip.modules.common.mobileone import reparameterize_model
model = reparameterize_model(model)

This algebraically fuses the parallel branches into a single convolution with equivalent weights.


Requirements

Python Dependencies

coremltools>=8.3.0      # Must be 8.3+ for ARM64 native libs
torch==2.3.1            # Pinned for compatibility
torchvision==0.18.1
huggingface_hub>=0.19.0
open-clip-torch         # Provides model loading
ml-mobileclip           # Apple's MobileCLIP (has reparameterize_model)

System Requirements

  • macOS 13.0+ (Ventura) for ML Program format
  • Apple Silicon recommended for ANE testing
  • ~4GB disk space for models and checkpoint

Script Reference

convert_fixed.py - Image Encoder

Converts the image encoder with fixes for dynamic shapes.

Key steps:

  1. Load MobileCLIP2-S4 checkpoint
  2. Monkeypatch Attention modules with fixed H, W
  3. Reparameterize MobileOne blocks
  4. Wrap with L2 normalization
  5. Trace with torch.jit.trace
  6. Convert to Core ML with FLOAT16 precision

Preprocessing baked into model:

ct.ImageType(
    scale=1.0 / 255.0,
    bias=[-mean[i]/std[i] for i in range(3)],
    color_layout=ct.colorlayout.RGB,
)

convert_text.py - Text Encoder + Tokenizer

Converts the text encoder and exports the BPE tokenizer.

Key steps:

  1. Load MobileCLIP2-S4 checkpoint
  2. Export tokenizer vocab and merges to JSON
  3. Replace nn.MultiheadAttention with manual implementation
  4. Use mask-based EOT extraction
  5. Trace and convert to Core ML

Tokenizer output (tokenizer.json):

{
  "vocab": {"!": 0, "\"": 1, ...},
  "merges": ["i n", "t h", "a n", ...],
  "byte_encoder": {...},
  "special_tokens": {"sot": 49406, "eot": 49407},
  "context_length": 77
}

convert_all.py - Full Pipeline

Orchestrates the complete conversion:

python scripts/convert_all.py

Runs download → image conversion → text conversion → metadata generation.


Swift Integration

Loading Models

import CoreML

let config = MLModelConfiguration()
config.computeUnits = .all  // Prefer ANE

let imageEncoder = try MobileCLIP2_S4_Image(configuration: config)
let textEncoder = try MobileCLIP2_S4_Text(configuration: config)

Image Encoding

import Vision

func encodeImage(_ cgImage: CGImage) throws -> [Float] {
    let model = try VNCoreMLModel(for: imageEncoder.model)
    let request = VNCoreMLRequest(model: model)

    let handler = VNImageRequestHandler(cgImage: cgImage)
    try handler.perform([request])

    guard let result = request.results?.first as? VNCoreMLFeatureValueObservation,
          let array = result.featureValue.multiArrayValue else {
        throw EncodingError.failed
    }

    // Convert MLMultiArray to [Float]
    return (0..<768).map { Float(truncating: array[$0]) }
}

Text Encoding

Requires implementing BPE tokenization using tokenizer.json:

func encodeText(_ text: String) throws -> [Float] {
    // 1. Tokenize (implement BPE using tokenizer.json)
    let tokens = tokenize(text)  // -> [Int32] of length 77

    // 2. Create MLMultiArray input
    let input = try MLMultiArray(shape: [1, 77], dataType: .int32)
    for (i, token) in tokens.enumerated() {
        input[i] = NSNumber(value: token)
    }

    // 3. Run inference
    let output = try textEncoder.prediction(input_ids: input)

    // 4. Extract embedding
    return (0..<768).map { Float(truncating: output.embedding[$0]) }
}

Similarity Search

func cosineSimilarity(_ a: [Float], _ b: [Float]) -> Float {
    // Vectors are already L2-normalized, so dot product = cosine similarity
    zip(a, b).map(*).reduce(0, +)
}

// Find top matches
func search(query: String, screenshots: [(id: Int, embedding: [Float])]) -> [Int] {
    let queryEmbed = try encodeText(query)

    return screenshots
        .map { (id: $0.id, score: cosineSimilarity(queryEmbed, $0.embedding)) }
        .sorted { $0.score > $1.score }
        .prefix(10)
        .map { $0.id }
}

Troubleshooting

"BlobWriter not loaded" Error

pip uninstall coremltools
pip install coremltools>=8.3.0  # Get pre-built wheel

Conversion Fails at MIL Ops Stage

Check for dynamic operations in your model. Common culprits:

  • x.shape[i] followed by arithmetic
  • torch.arange() with tensor arguments
  • argmax() used for indexing

Performance

Measured on Apple M1 Pro:

Operation Time
Image encoding ~15ms
Text encoding ~5ms
Similarity (768-dim) <0.1ms

Memory usage: ~850MB for both models loaded.


References

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment