Skip to content

Instantly share code, notes, and snippets.

@bmorphism
Created October 14, 2025 19:05
Show Gist options
  • Select an option

  • Save bmorphism/b6275296eb4246ab62efd409849698ab to your computer and use it in GitHub Desktop.

Select an option

Save bmorphism/b6275296eb4246ab62efd409849698ab to your computer and use it in GitHub Desktop.
MLX-VLM Quickstart: Latest Qwen3-VL Models (2025-10-14)

MLX-VLM Quickstart: Latest Qwen3-VL Models (2025-10-14)

πŸš€ Reproduce Latest Vision-Language Model Inference on Apple Silicon

This guide shows how to use the latest Qwen3-VL models (released TODAY: October 14, 2025) for image description on macOS with Apple Silicon using MLX.

Prerequisites

  • macOS with Apple Silicon (M1/M2/M3/M4)
  • Python 3.9+
  • uvx (from uv package manager)

Installation

# Install uv if you don't have it
curl -LsSf https://astral.sh/uv/install.sh | sh

# uvx will automatically install mlx-vlm when needed

Latest Models (Released 2025-10-14)

All models available at mlx-community on Hugging Face:

Qwen3-VL-8B Series

  • mlx-community/Qwen3-VL-8B-Instruct-{bf16,8bit,6bit,5bit,4bit}
  • mlx-community/Qwen3-VL-8B-Thinking-{bf16,8bit,6bit,5bit,4bit}

Qwen3-VL-4B Series (Recommended for Quick Start)

  • mlx-community/Qwen3-VL-4B-Instruct-{bf16,8bit,6bit,5bit,4bit} ✨
  • mlx-community/Qwen3-VL-4B-Thinking-{bf16,8bit,6bit,5bit,4bit}

Quick Start: Single Image

# Describe a single image using the latest 4-bit quantized model
uvx --from mlx-vlm mlx_vlm.generate \
  --model mlx-community/Qwen3-VL-4B-Instruct-4bit \
  --image /path/to/your/image.jpg \
  --prompt "Describe this image in detail." \
  --max-tokens 150 \
  --temperature 0.7

Batch Processing: 17 Random Images

Create a Python script to process multiple images:

#!/usr/bin/env python3
"""
Batch image description using Qwen3-VL-4B-Instruct-4bit
Seed: 1069 (balanced ternary: [+1, -1, -1, +1, +1, +1, +1])
"""
import subprocess
import json
from pathlib import Path
import random

MODEL = "mlx-community/Qwen3-VL-4B-Instruct-4bit"
SEED = 1069

def find_random_images(directory: str, count: int = 17) -> list:
    """Find random images from directory"""
    image_extensions = {'.jpg', '.jpeg', '.png', '.gif', '.webp', '.heic'}
    all_images = []

    for ext in image_extensions:
        all_images.extend(Path(directory).rglob(f'*{ext}'))

    random.seed(SEED)
    return random.sample(all_images, min(count, len(all_images)))

def describe_image(image_path: Path, index: int) -> dict:
    """Describe image using mlx-vlm"""
    print(f"\nβ–½ Processing {index}: {image_path.name}")

    try:
        result = subprocess.run(
            [
                "uvx", "--from", "mlx-vlm", "mlx_vlm.generate",
                "--model", MODEL,
                "--image", str(image_path),
                "--prompt", "Describe this image in detail.",
                "--max-tokens", "150",
                "--temperature", "0.7"
            ],
            capture_output=True,
            text=True,
            timeout=120
        )

        # Extract description from output
        output = result.stdout
        if "=" in output:
            description = output.split("=")[-1].strip()
        else:
            description = output.strip()

        return {
            "index": index,
            "filename": image_path.name,
            "path": str(image_path),
            "description": description,
            "model": MODEL,
            "seed": SEED
        }
    except Exception as e:
        return {
            "index": index,
            "filename": image_path.name,
            "path": str(image_path),
            "description": f"[ERROR: {e}]",
            "model": MODEL,
            "seed": SEED
        }

def main():
    # Find 17 random images from Desktop
    images = find_random_images("~/Desktop", 17)

    print(f"β—¬ Processing {len(images)} images with {MODEL}")
    print(f"β—¬ Seed: {SEED}")
    print("=" * 80)

    results = []
    for i, image_path in enumerate(images, 1):
        result = describe_image(image_path, i)
        results.append(result)
        print(f"  β†’ {result['description'][:100]}...")

    # Save results
    output_file = f"image_descriptions_{SEED}.json"
    with open(output_file, 'w') as f:
        json.dump(results, f, indent=2)

    print("\n" + "=" * 80)
    print(f"βœ“ Saved descriptions to: {output_file}")
    print(f"βœ“ Model: {MODEL}")
    print(f"βœ“ Seed: {SEED}")

if __name__ == "__main__":
    main()

Usage

# Save the script
chmod +x describe_images.py

# Run it
./describe_images.py

# Or with Python directly
python3 describe_images.py

Model Selection Guide

Model Size Speed Quality Use Case
Qwen3-VL-4B-Instruct-4bit ~2.5GB Fastest Good Quick testing, batch processing
Qwen3-VL-4B-Instruct-8bit ~4GB Fast Better Balanced performance
Qwen3-VL-8B-Instruct-4bit ~5GB Medium Best High-quality descriptions
Qwen3-VL-8B-Instruct-bf16 ~16GB Slower Excellent Maximum quality

Advanced Options

# Custom prompt
uvx --from mlx-vlm mlx_vlm.generate \
  --model mlx-community/Qwen3-VL-4B-Instruct-4bit \
  --image image.jpg \
  --prompt "What objects are visible? List them." \
  --max-tokens 100

# Lower temperature for more deterministic output
uvx --from mlx-vlm mlx_vlm.generate \
  --model mlx-community/Qwen3-VL-4B-Instruct-4bit \
  --image image.jpg \
  --prompt "Describe this image." \
  --temperature 0.3 \
  --max-tokens 200

# Multiple images
uvx --from mlx-vlm mlx_vlm.generate \
  --model mlx-community/Qwen3-VL-4B-Instruct-4bit \
  --image image1.jpg image2.jpg image3.jpg \
  --prompt "Describe each image." \
  --max-tokens 200

Troubleshooting

Model Download Issues

# First run will download ~4GB, be patient
# Check download progress in terminal output

Memory Issues

# Use 4-bit quantized models for lower memory usage
# Close other applications to free RAM

Timeout Issues

# Increase timeout in Python script
timeout=300  # 5 minutes

# Or run without batch processing

ServiceNow StarVector (Bonus)

While exploring latest models, we also discovered ServiceNow's StarVector:

  • Multimodal LLM for SVG generation from images/text
  • Accepted at CVPR 2025
  • Available on Hugging Face

Performance Notes

On Apple M-series chips:

  • First run: ~5-10 minutes (model download)
  • Subsequent runs: ~2-5 seconds per image (4-bit model)
  • Memory usage: ~3-4GB (4-bit model)

References

Credits

Generated with seed 1069 (balanced ternary: [+1, -1, -1, +1, +1, +1, +1])


Last updated: 2025-10-14 Models released: 2025-10-14T18:13-18:29 UTC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment