Compute obtained per dollar varies significantly by GPU and arithmetic intensity. According to Runpod's pricing, when pre-training LLMs with `batch_size=1024` (tokens), the L4 offers superior cost-performance for models under 0.5B parameters, while the H100 dominates for larger scales.

The Poor Man’s Guide to Cloud GPU Selection

Just like time and money, you can never have enough "compute." This holds true whether you are pre-training models or running inference.

While the internet is full of generic "GPU selection guides" (e.g., ¹ ² ³), unfortunately, none of them answer the critical economic question: "which GPU delivers the most FLOPs per dollar for my specific AI workload?"

In practice, many researchers and engineers select GPUs by convention.often defaulting to the A100. Consequently, they pay for hardware that is poorly matched to their arithmetic intensity, effectively burning potential FLOPs that could have been utilized with the same budget. Can't turn a blind eye to that waste, can we?

To this end, I present an empirical method to estimate the best GPU for your deep learning workload. I developed this approach to financially optimize my own scaling law research. As a case study, we will scale a Transformer LLM across different GPUs to measure cost-effectiveness (FLOPS/$).

Experiment

To ensure fair comparison, measurements were conducted under the following unified conditions:

Architecture: Qwen3
Optimization: AdamW optimizer, autocast to bfloat16
Data: Dummy data generated via torch.randint (eliminating CPU/IO bottlenecks)
Batch Size: 1024 tokens per update step (4 seq $\times$ 256 tokens/seq)
Model Scaling: Width and depth increased according to the scaling law $\text{depth} \approx 0.77... \cdot \log_2(\text{width}) - 2.5...$. Width was kept as a multiple of 256 for tiling optimization.
GPUs: L4, A100-80GB, H100
- Pricing is based on Runpod's "Secure-Cloud" rates ⁴ (which includes vCPU costs).

This setup allows us to observe empirical performance and cost efficiency under a realistic training configuration.

Results

1. FLOPS Increase with Arithmetic Intensity

First, consider the raw processing performance (FLOPS) for each GPU.

As expected, higher-end GPUs yield faster processing speeds. However, as we scale the model, the arithmetic intensity increases, leading to improved FLOPS utilization across the board.

Tip

Because the critical factor is increasing the intensity of matrix multiplication operations, you can expect a similar efficiency pattern by scaling batch_size instead of the model parameters.

2. The Cost-Optimal GPU Switches Based on Intensity

Our objective is to maximize the computational volume obtained per dollar invested (FLOPs / $). The data reveals that the "optimal" GPU changes distinctly depending on the model size.

From the perspective of compute-per-cost, L4 and H100 appear as the dominant choices.

Small Scale: For models under 500M parameters ($~2^{29}$ FLOPs/step), the L4 is the most wallet-friendly choice for estimating scaling laws or conducting small-scale verification.
Large Scale: Above that crossover point, the H100 yields the highest return on investment.

Interestingly, A100---often treated as the de facto standard in LLM research papers---offers little advantage in terms of pure cost efficiency in this setup. It remains a valid middle-ground option only if your workload requires more VRAM than an L4 provides but lacks the intensity ($2^{41}$ FLOPs/step) to justify an H100.

Conclusion

I hope this article helps the "GPU-poor" squeeze every possible FLOP out of their budget.

Running these benchmarks does not cost much, so I highly recommend verifying this for your specific deep learning workloads. Simply choosing the appropriate GPU could save you hundreds of dollars on a small project---or perhaps tens of thousands at scale.

Note

The optimal GPU changes based on several conditions. E.g.:

Pricing Structure: If using providers like Modal⁵ instead of Runpod, pricing differences might create a specific zone where A100 becomes optimal.
Architecture: Non-Transformer architectures (e.g., MLP, CNN) will exhibit significantly different arithmetic intensities.
Multi-GPU: In distributed workloads, memory bandwidth and interconnect speed can become the primary bottlenecks, alterting the dynamic of optimization problem.

kyo-takano/the-poor-mans-guide-to-cloud-gpu-selection.md

Select an option

No results found

Select an option

No results found

The Poor Man’s Guide to Cloud GPU Selection

Experiment

Results

1. FLOPS Increase with Arithmetic Intensity

2. The Cost-Optimal GPU Switches Based on Intensity

Conclusion

References

kyo-takano commented Jan 26, 2026

Uh oh!

kyo-takano/the-poor-mans-guide-to-cloud-gpu-selection.md

The Poor Man’s Guide to Cloud GPU Selection

Experiment

Results

1. FLOPS Increase with Arithmetic Intensity

2. The Cost-Optimal GPU Switches Based on Intensity

Conclusion

References

Footnotes

kyo-takano commented Jan 26, 2026

Uh oh!