Skip to content

Instantly share code, notes, and snippets.

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>MNIST</title>
<style>
:root { --bg:#0f1116; --fg:#e6e6e6; --muted:#a3a3a3; --panel:#1b1f2a; --border:#333; }
html, body { height: 100%; }
body { margin: 0; font-family: ui-sans-serif, system-ui, -apple-system, Segoe UI, Roboto, Helvetica, Arial; background: var(--bg); color: var(--fg); }
@AlpinDale
AlpinDale / release_v0.6.1_notes.md
Created September 12, 2024 04:12
Aphrodite Engine v0.6.1 Release Notes

Aphrodite Engine - v0.6.1

We're back on track again with quicker releases. This time, we have a few interesting changes:

  • Better Async Cancellation: Request terminations on the async engine (and the API) are a lot more streamlined now. Mostly a dev experience improvement.
  • RSLoRA: We now support RSLoRA adapters. Loaded the same way as any other regular LoRA.
  • Pipeline Parallel for LoRA: You can now use pipeline parallelism with LoRA! You should've been able to, but there was a bug preventing it.
  • API server health check improvements: If you ping /health and the engine is dead, it'll terminate the server too.
  • Remove max_num_batched_tokens limitation for LoRA: A leftover guard from before we switched to Triton LoRA kernels.
  • INT8 quantization for TPU: You can now load FP16 models in INT8 on-the-fly for TPUs. Just launch your model with -q tpu_int8.
@AlpinDale
AlpinDale / merge_lora.py
Last active March 11, 2024 19:36
Merging a LoRA adpater into a model
"""
Script for merging PEFT LoRA weights with the base model. Uses code from https://github.com/eugenepentland/landmark-attention-qlora/blob/main/llama/merge_peft.py
Usage: python merge_peft.py [-h] [--base_model_name_or_path BASE_MODEL_NAME_OR_PATH] [--peft_model_path PEFT_MODEL_PATH] [--output_dir OUTPUT_DIR] [--device DEVICE]
[--push_to_hub]
"""
import torch
import os
import logging
import argparse
from tqdm import tqdm