Aphrodite Engine - v0.6.1

We're back on track again with quicker releases. This time, we have a few interesting changes:

Better Async Cancellation: Request terminations on the async engine (and the API) are a lot more streamlined now. Mostly a dev experience improvement.
RSLoRA: We now support RSLoRA adapters. Loaded the same way as any other regular LoRA.
Pipeline Parallel for LoRA: You can now use pipeline parallelism with LoRA! You should've been able to, but there was a bug preventing it.
API server health check improvements: If you ping /health and the engine is dead, it'll terminate the server too.
Remove max_num_batched_tokens limitation for LoRA: A leftover guard from before we switched to Triton LoRA kernels.
INT8 quantization for TPU: You can now load FP16 models in INT8 on-the-fly for TPUs. Just launch your model with -q tpu_int8.
Encoder-Decoder support in the API: You can now launch and serve Encoder-Decoder models (currently BART only) via the OpenAI API server.
Optimization: Massive E2E overhead reduction: By caching some python object allocations, we've got some huge performance boosts: ~20% throughput boost for an 8B, on a single H100.
FP8 KV Cache + Chunked Prefill: It's finally here, but only works for Ada Lovelace (4090, L40, etc) and Hopper (H100, etc) GPUs. Triton limitation :(
Temperatue Last: It's finally here - you can apply temperature scaling last, after all the other samplers. Just pass temperature_last=True in your request; "temperature_last": true for the API. This should help with min_p sampling quite a bit.
Mamba Model Support: We now support loading Mamba models! RWKV5/6 soon to follow.
Embeddings for the batched endpoint: You can now run embedding models via the batched endpoint. Please see examples/openai_api/batched_inference.md
Fix: chunked prefill with v2 block manager: We use v2 block manager for spec decoding.
Image embeddings input for VLMs: You can now directly pass image embeddings for vision language models.
Better progress bar: We now have a progress bar for loading individual modules within a model.
Logit softcapping on FA2: Logit softcapping works on FA2 now. Won't work with gemma2 yet because current FA2 implementation doesn't support SWA.
LoRA loading/unloading endpoints: We have API endpoints for loading (/v1/lora/load) and unloading (/v1/lora/unload) LoRA adapters on-the-fly.
Soft Prompt loading/unloading endpoints: Same as above, but at /v1/soft_prompt/{load,unload}
Tensor Parallel Chameleon: The chameleon model had an issue when loading at TP>1. It's been fixed now.
Initial support for Audio Modality: Only the framework is in place now, but we should be able to support models like Whisper pretty soon.
Solar model support: There's a new model for upstage with a new arch, SolarForCausalLM. That's supported now.
Aphrodite Plugin System: We now have a plugin system! It's a bit rudimentary for now, but I'll add docs later.
Multi-host TPU: You can now run Aphrodite on multi-host TPUs.
Mistral Tokenizer Mode: When loading Mistral models, it's recommend to use --tokenizer-mode mistral. This will use their own native tokenizer implementation. Might get garbage output otherwise if using mistral-specific tokens for their newer models.
Quantization for draft models: You can now pass quantization configs to draft models in speculative decoding, to load them in fp8 for example. Just pass --speculative-model-quantization=fp8 for example.
Prompt logprobs in OpenAI API: You can now request prompt logprobs in the API.

For the full list of changes, see here:

https://github.com/PygmalionAI/aphrodite-engine/releases/tag/v0.6.1

AlpinDale/release_v0.6.1_notes.md

Select an option

No results found

Select an option

No results found

Aphrodite Engine - v0.6.1

For the full list of changes, see here: