Skip to content

Instantly share code, notes, and snippets.

@bedwards
Last active September 25, 2025 16:41
Show Gist options
  • Select an option

  • Save bedwards/2fe3d8dc4bcd0b9fe99c6819f28dab8d to your computer and use it in GitHub Desktop.

Select an option

Save bedwards/2fe3d8dc4bcd0b9fe99c6819f28dab8d to your computer and use it in GitHub Desktop.
ML Framework Performance Comparison on Mac M1: Tabular Data Benchmark

ML Framework Performance Comparison on Mac M1: Tabular Data Benchmark

Objective

Compare training performance of popular machine learning frameworks on Mac Studio M1 for tabular regression tasks, focusing on CPU vs GPU acceleration capabilities.

Raw timing results:

10k samples

1. CatBoost         0.2053s
2. DIY torch        0.5782s
3. torch            0.8769s
4. LightGBM         1.3193s
5. XGBoost          2.5680s
6. Lightning       26.3053s
7. TabNet          29.1417s

Test Setup

  • Hardware: Mac Studio M1 with MPS (Metal Performance Shaders) GPU support
  • Dataset: 10,000 samples, 10 features, regression task (synthetic data)
  • Training iterations: 100 boosting rounds for tree methods, 50 epochs for neural networks
  • Frameworks tested:
    • Tree-based: XGBoost, LightGBM, CatBoost
    • Neural networks: PyTorch, TabNet, Custom TabularNet, PyTorch Lightning

Key Hardware Constraints

  • GPU Support: Only PyTorch-based models can use M1's MPS GPU acceleration
  • CPU-only: Tree-based models (XGBoost, LightGBM, CatBoost) cannot access MPS
  • No CUDA/OpenCL: Traditional GPU acceleration frameworks don't work on Apple Silicon

Results

Rank Framework Time (seconds) Acceleration
1 CatBoost 0.205 CPU
2 DIY torch 0.578 🟢 MPS GPU
3 torch (simple) 0.877 🟢 MPS GPU
4 LightGBM 1.319 CPU
5 XGBoost 2.568 CPU
6 Lightning 26.305 🟢 MPS GPU
7 TabNet 29.142 🟢 MPS GPU

Analysis

🏆 CatBoost Dominates

CatBoost emerged as the clear winner, proving that highly optimized CPU algorithms can outperform GPU-accelerated neural networks for tabular data at this scale. Its efficient gradient boosting implementation and superior memory management make it ideal for medium-sized datasets.

🚀 Simple PyTorch Shows Promise

Custom PyTorch models with MPS acceleration performed well, demonstrating that Apple's Metal Performance Shaders provide real acceleration benefits. Simple architectures (578ms) outperformed complex ones, suggesting overhead penalties for unnecessary complexity.

💥 TabNet's Scaling Problem

TabNet, despite being designed for tabular data and having GPU acceleration, showed catastrophic scaling performance (29+ seconds). This suggests poor algorithmic complexity that makes it unsuitable for datasets beyond toy examples.

⚡ Framework Overhead Matters

PyTorch Lightning's 26-second training time reveals significant framework overhead. While excellent for complex ML pipelines, it's overkill for simple tabular tasks where raw PyTorch suffices.

🌟 Tree Methods Hold Strong

Despite lacking GPU acceleration, traditional gradient boosting methods remained competitive. XGBoost and LightGBM's slower performance likely reflects less optimization for Apple Silicon compared to CatBoost.

Key Insights

  1. Algorithm efficiency trumps hardware acceleration for tabular data at 10K+ scale
  2. CatBoost's optimization for modern CPUs makes it the go-to choice for tabular ML
  3. Simple neural networks can compete when GPU-accelerated, but complex frameworks add prohibitive overhead
  4. Mac M1 MPS acceleration is real but doesn't overcome fundamental algorithmic limitations
  5. TabNet's poor scaling makes it unsuitable for production tabular workloads

Recommendations

For Mac M1 tabular ML:

  • First choice: CatBoost for reliability, speed, and excellent defaults
  • GPU experiments: Simple PyTorch models when you need neural networks
  • Avoid: TabNet for anything beyond small experiments
  • Production: Stick with gradient boosting (CatBoost > LightGBM > XGBoost on M1)

The bottom line: Despite GPU acceleration hype, well-optimized CPU algorithms still rule tabular machine learning on modern Apple Silicon.

import os
from datetime import datetime
import numpy as np
import catboost as cb # CPU - no Apple MPS support
import lightgbm as lgb # CPU - no Apple MPS support
import xgboost as xgb # CPU - no Apple MPS support
import torch
import torch.nn as nn
import torch.optim as optim
from pytorch_tabnet.tab_model import TabNetRegressor
import pytorch_lightning as lightning
os.environ['OMP_NUM_THREADS'] = str(os.cpu_count())
# Initial mps sanity check
assert torch.backends.mps.is_available()
print(f'torch mps is available: {torch.backends.mps.is_available()}')
device_name = 'mps'
device = torch.device(device_name)
# The data
n_samples = 10_000
X = np.random.random((n_samples, 10)).astype(np.float32)
n_features = X.shape[1]
y = np.random.random(n_samples).astype(np.float32)
X_tensor = torch.tensor(X, dtype=torch.float32).to(device)
y_tensor = torch.tensor(y, dtype=torch.float32).to(device)
timings = {}
# Gradient Boosting models ==================================================
# number of boosting rounds
# individual decision trees added sequentially to the ensemble model
n_boost_rounds = 100
# CatBoost ------------------------------------------------------------------
model = cb.CatBoostRegressor(
task_type='CPU',
thread_count=os.cpu_count(),
iterations=n_boost_rounds,
# verbose=1,
)
start = datetime.now()
model.fit(X, y, verbose=False)
timings['CatBoost'] = (datetime.now()-start).total_seconds()
print(f'CatBoost training: {timings["CatBoost"]}')
# LightGBM ------------------------------------------------------------------
model = lgb.LGBMRegressor(
n_estimators=n_boost_rounds,
device='cpu',
n_jobs=os.cpu_count(),
)
lgb_device_name = model.get_params()['device']
print(f'LightGBM device: {lgb_device_name}')
start = datetime.now()
model.fit(X, y)
timings['LightGBM'] = (datetime.now()-start).total_seconds()
print(f'LightGBM training: {timings["LightGBM"]}')
# XGBoost (no mps support) --------------------------------------------------
xgb_params = {
'tree_method': 'hist', # or for very large datasets use 'approx'
# 'sketch_eps': 0.1, # used with 'approx'
'nthread': os.cpu_count(),
'max_bin': 512,
'grow_policy': 'lossguide',
'subsample': 0.8,
'colsample_bytree': 0.8,
}
print(f'XGBoost version: {xgb.__version__}')
print(f'XGBoost built with: {xgb.build_info()}')
dtrain = xgb.DMatrix(X, label=y)
start = datetime.now()
print(f'training start')
model = xgb.train(xgb_params, dtrain, num_boost_round=n_boost_rounds)
timings['XGBoost'] = (datetime.now()-start).total_seconds()
print(f'XGBoost training: {timings["XGBoost"]}')
# PyTorch-based Neural Networks =============================================
n_epochs = n_boost_rounds // 2 # roughly a fair performance comparison
def torch_training(model):
criterion = nn.MSELoss()
optimizer = optim.Adam(
model.parameters(),
lr=0.001,
)
for epoch in range(n_epochs):
optimizer.zero_grad()
outputs = model(X_tensor).squeeze()
loss = criterion(outputs, y_tensor)
loss.backward()
optimizer.step()
# torch ---------------------------------------------------------------------
model = nn.Sequential(
nn.Linear(n_features, 128),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(128, 64),
nn.ReLU(),
nn.Linear(64, 1)
).to(device)
start = datetime.now()
torch_training(model)
timings['torch'] = (datetime.now()-start).total_seconds()
print(f'torch training: {timings["torch"]}')
# TabNet --------------------------------------------------------------------
model = TabNetRegressor(device_name=device_name)
start = datetime.now()
model.fit(X, y.reshape(-1, 1), max_epochs=n_epochs, patience=10)
timings['TabNet'] = (datetime.now()-start).total_seconds()
print(f'TabNet training: {timings["TabNet"]}')
# DIY torch TabularNet ------------------------------------------------------
class TabularNet(nn.Module):
def __init__(self, hidden_dims=(128, 64, 32)):
super().__init__()
layers = []
prev_dim = n_features
for dim in hidden_dims:
layers.extend([
nn.Linear(prev_dim, dim),
nn.BatchNorm1d(dim),
nn.ReLU(),
nn.Dropout(0.2)
])
prev_dim = dim
layers.append(nn.Linear(prev_dim, 1)) # Output layer
self.network = nn.Sequential(*layers)
def forward(self, x):
return self.network(x)
model = TabularNet().to(device)
start = datetime.now()
torch_training(model)
timings['DIY torch'] = (datetime.now()-start).total_seconds()
print(f'DIY torch training: {timings["DIY torch"]}')
# Lightning -----------------------------------------------------------------
class TabularLightning(lightning.LightningModule):
def __init__(self):
super().__init__()
self.model = nn.Sequential(
nn.Linear(n_features, 128),
nn.ReLU(),
nn.Linear(128, 64),
nn.ReLU(),
nn.Linear(64, 1)
)
def forward(self, x):
return self.model(x)
def training_step(self, batch, batch_idx):
x, y = batch
y_hat = self(x).squeeze()
loss = nn.MSELoss()(y_hat, y)
return loss
def configure_optimizers(self):
return optim.Adam(self.parameters())
trainer = lightning.Trainer(
accelerator=device_name,
devices=1,
max_epochs=n_epochs,
enable_progress_bar=False,
enable_checkpointing=False,
)
train_loader = torch.utils.data.DataLoader(
list(zip(X_tensor, y_tensor)),
batch_size=32,
)
model = TabularLightning()
start = datetime.now()
trainer.fit(model, train_loader)
timings['Lightning'] = (datetime.now()-start).total_seconds()
print(f'Lightning training: {timings["Lightning"]}')
# Timing report =============================================================
print('\n' + '='*50)
print('TRAINING TIME REPORT')
print('='*50)
sorted_times = sorted(timings.items(), key=lambda x: x[1])
for i, (name, time) in enumerate(sorted_times, 1):
print(f'{i}. {name:<15} {time:.4f}s')

Lightning and TabNet - implementation issues + MPS optimization problems:

Lightning's Issues:

Framework overhead is REAL:

  • Model inspection, parameter counting, logging setup
  • DataLoader warnings about num_workers=23 (but using 0) ***see next file
  • Checkpoint/logger initialization even when disabled
  • 26 seconds for simple training is inexcusable

TabNet's Issues:

Genuinely poor scaling:

  • Each epoch takes ~0.6 seconds (50 epochs × 0.6s ≈ 30s)
  • Compare: Simple PyTorch does 50 epochs in 0.87s total
  • 50x slower per epoch suggests fundamental inefficiency

Potential MPS Optimization Problems:

TabNet likely issues:

# TabNet probably does inefficient things like:
# - Frequent CPU↔GPU transfers
# - Non-vectorized attention mechanisms  
# - Suboptimal batch processing for MPS
# - Memory fragmentation on Metal

Lightning overhead:

# Lightning does unnecessary work:
# - Model summary generation
# - Metric computation/logging
# - Progress bar updates (even disabled)
# - Distributed training setup (not needed)

Quick Test to Confirm:

# Test TabNet with fewer features/simpler config
model = TabNetRegressor(
    device_name='mps',
    n_d=8, n_a=8,  # Reduce complexity
    n_steps=3,     # Fewer steps
    gamma=1.3,
    n_independent=1,
    n_shared=1
)

# Test Lightning with minimal config
trainer = lightning.Trainer(
    accelerator='mps',
    devices=1, 
    max_epochs=10,  # Fewer epochs
    enable_progress_bar=False,
    enable_checkpointing=False,
    enable_model_summary=False,  # Disable summary
    logger=False  # No logging
)

Verdict: Both frameworks have real performance issues on MPS - combination of poor optimization + excessive overhead. Simple PyTorch proves MPS works fine when used efficiently.

DataLoader num_workers Explanation:

# Current script (slow):
train_loader = torch.utils.data.DataLoader(
    list(zip(X_tensor, y_tensor)),
    batch_size=32,
    # num_workers defaults to 0 = single-threaded data loading
)

# Lightning suggests:
train_loader = torch.utils.data.DataLoader(
    list(zip(X_tensor, y_tensor)),
    batch_size=32,
    num_workers=23,  # Use 23 CPU cores for data loading
)

What num_workers Does:

  • num_workers=0: Main thread loads data (blocking GPU during data prep)
  • num_workers=23: 23 background processes load data in parallel (GPU can compute while data loads)

Is This Causing Lightning's Slowness?

Probably NOT the main culprit because:

  • Tiny dataset: 10K samples fit entirely in memory
  • Simple data: Already converted to tensors, minimal preprocessing
  • Small batches: 32 samples × ~300 batches = minimal data loading work

But it COULD help:

# Fix the DataLoader:
train_loader = torch.utils.data.DataLoader(
    list(zip(X_tensor, y_tensor)),
    batch_size=32,
    num_workers=4,  # Don't need all 23, use 4-8
    pin_memory=True,  # Faster CPU→GPU transfers
)

Mac M1 caveat: Sometimes multiprocessing DataLoaders are slower on Apple Silicon due to process spawn overhead. Try num_workers=0,2,4 and benchmark.

Bottom line: Worth trying, but Lightning's 26s is mostly framework overhead, not data loading.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment