Skip to content

Instantly share code, notes, and snippets.

@AlpinDale
Created September 12, 2024 04:12
Show Gist options
  • Select an option

  • Save AlpinDale/db9fbff12e5580fd541a10429c95391c to your computer and use it in GitHub Desktop.

Select an option

Save AlpinDale/db9fbff12e5580fd541a10429c95391c to your computer and use it in GitHub Desktop.
Aphrodite Engine v0.6.1 Release Notes

Aphrodite Engine - v0.6.1

We're back on track again with quicker releases. This time, we have a few interesting changes:

  • Better Async Cancellation: Request terminations on the async engine (and the API) are a lot more streamlined now. Mostly a dev experience improvement.
  • RSLoRA: We now support RSLoRA adapters. Loaded the same way as any other regular LoRA.
  • Pipeline Parallel for LoRA: You can now use pipeline parallelism with LoRA! You should've been able to, but there was a bug preventing it.
  • API server health check improvements: If you ping /health and the engine is dead, it'll terminate the server too.
  • Remove max_num_batched_tokens limitation for LoRA: A leftover guard from before we switched to Triton LoRA kernels.
  • INT8 quantization for TPU: You can now load FP16 models in INT8 on-the-fly for TPUs. Just launch your model with -q tpu_int8.
  • Encoder-Decoder support in the API: You can now launch and serve Encoder-Decoder models (currently BART only) via the OpenAI API server.
  • Optimization: Massive E2E overhead reduction: By caching some python object allocations, we've got some huge performance boosts: ~20% throughput boost for an 8B, on a single H100.
  • FP8 KV Cache + Chunked Prefill: It's finally here, but only works for Ada Lovelace (4090, L40, etc) and Hopper (H100, etc) GPUs. Triton limitation :(
  • Temperatue Last: It's finally here - you can apply temperature scaling last, after all the other samplers. Just pass temperature_last=True in your request; "temperature_last": true for the API. This should help with min_p sampling quite a bit.
  • Mamba Model Support: We now support loading Mamba models! RWKV5/6 soon to follow.
  • Embeddings for the batched endpoint: You can now run embedding models via the batched endpoint. Please see examples/openai_api/batched_inference.md
  • Fix: chunked prefill with v2 block manager: We use v2 block manager for spec decoding.
  • Image embeddings input for VLMs: You can now directly pass image embeddings for vision language models.
  • Better progress bar: We now have a progress bar for loading individual modules within a model.
  • Logit softcapping on FA2: Logit softcapping works on FA2 now. Won't work with gemma2 yet because current FA2 implementation doesn't support SWA.
  • LoRA loading/unloading endpoints: We have API endpoints for loading (/v1/lora/load) and unloading (/v1/lora/unload) LoRA adapters on-the-fly.
  • Soft Prompt loading/unloading endpoints: Same as above, but at /v1/soft_prompt/{load,unload}
  • Tensor Parallel Chameleon: The chameleon model had an issue when loading at TP>1. It's been fixed now.
  • Initial support for Audio Modality: Only the framework is in place now, but we should be able to support models like Whisper pretty soon.
  • Solar model support: There's a new model for upstage with a new arch, SolarForCausalLM. That's supported now.
  • Aphrodite Plugin System: We now have a plugin system! It's a bit rudimentary for now, but I'll add docs later.
  • Multi-host TPU: You can now run Aphrodite on multi-host TPUs.
  • Mistral Tokenizer Mode: When loading Mistral models, it's recommend to use --tokenizer-mode mistral. This will use their own native tokenizer implementation. Might get garbage output otherwise if using mistral-specific tokens for their newer models.
  • Quantization for draft models: You can now pass quantization configs to draft models in speculative decoding, to load them in fp8 for example. Just pass --speculative-model-quantization=fp8 for example.
  • Prompt logprobs in OpenAI API: You can now request prompt logprobs in the API.

For the full list of changes, see here:

https://github.com/PygmalionAI/aphrodite-engine/releases/tag/v0.6.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment