We're back on track again with quicker releases. This time, we have a few interesting changes:
- Better Async Cancellation: Request terminations on the async engine (and the API) are a lot more streamlined now. Mostly a dev experience improvement.
- RSLoRA: We now support RSLoRA adapters. Loaded the same way as any other regular LoRA.
- Pipeline Parallel for LoRA: You can now use pipeline parallelism with LoRA! You should've been able to, but there was a bug preventing it.
- API server health check improvements: If you ping
/healthand the engine is dead, it'll terminate the server too. - Remove max_num_batched_tokens limitation for LoRA: A leftover guard from before we switched to Triton LoRA kernels.
- INT8 quantization for TPU: You can now load FP16 models in INT8 on-the-fly for TPUs. Just launch your model with
-q tpu_int8.