We're back on track again with quicker releases. This time, we have a few interesting changes:
- Better Async Cancellation: Request terminations on the async engine (and the API) are a lot more streamlined now. Mostly a dev experience improvement.
- RSLoRA: We now support RSLoRA adapters. Loaded the same way as any other regular LoRA.
- Pipeline Parallel for LoRA: You can now use pipeline parallelism with LoRA! You should've been able to, but there was a bug preventing it.
- API server health check improvements: If you ping
/healthand the engine is dead, it'll terminate the server too. - Remove max_num_batched_tokens limitation for LoRA: A leftover guard from before we switched to Triton LoRA kernels.
- INT8 quantization for TPU: You can now load FP16 models in INT8 on-the-fly for TPUs. Just launch your model with
-q tpu_int8. - Encoder-Decoder support in the API: You can now launch and serve Encoder-Decoder models (currently BART only) via the OpenAI API server.
- Optimization: Massive E2E overhead reduction: By caching some python object allocations, we've got some huge performance boosts: ~20% throughput boost for an 8B, on a single H100.
- FP8 KV Cache + Chunked Prefill: It's finally here, but only works for Ada Lovelace (4090, L40, etc) and Hopper (H100, etc) GPUs. Triton limitation :(
- Temperatue Last: It's finally here - you can apply temperature scaling last, after all the other samplers. Just pass
temperature_last=Truein your request;"temperature_last": truefor the API. This should help with min_p sampling quite a bit. - Mamba Model Support: We now support loading Mamba models! RWKV5/6 soon to follow.
- Embeddings for the batched endpoint: You can now run embedding models via the batched endpoint. Please see
examples/openai_api/batched_inference.md - Fix: chunked prefill with v2 block manager: We use v2 block manager for spec decoding.
- Image embeddings input for VLMs: You can now directly pass image embeddings for vision language models.
- Better progress bar: We now have a progress bar for loading individual modules within a model.
- Logit softcapping on FA2: Logit softcapping works on FA2 now. Won't work with gemma2 yet because current FA2 implementation doesn't support SWA.
- LoRA loading/unloading endpoints: We have API endpoints for loading (
/v1/lora/load) and unloading (/v1/lora/unload) LoRA adapters on-the-fly. - Soft Prompt loading/unloading endpoints: Same as above, but at
/v1/soft_prompt/{load,unload} - Tensor Parallel Chameleon: The chameleon model had an issue when loading at TP>1. It's been fixed now.
- Initial support for Audio Modality: Only the framework is in place now, but we should be able to support models like Whisper pretty soon.
- Solar model support: There's a new model for upstage with a new arch,
SolarForCausalLM. That's supported now. - Aphrodite Plugin System: We now have a plugin system! It's a bit rudimentary for now, but I'll add docs later.
- Multi-host TPU: You can now run Aphrodite on multi-host TPUs.
- Mistral Tokenizer Mode: When loading Mistral models, it's recommend to use
--tokenizer-mode mistral. This will use their own native tokenizer implementation. Might get garbage output otherwise if using mistral-specific tokens for their newer models. - Quantization for draft models: You can now pass quantization configs to draft models in speculative decoding, to load them in fp8 for example. Just pass
--speculative-model-quantization=fp8for example. - Prompt logprobs in OpenAI API: You can now request prompt logprobs in the API.
https://github.com/PygmalionAI/aphrodite-engine/releases/tag/v0.6.1