AlpinDale

Aphrodite Engine - v0.6.1

We're back on track again with quicker releases. This time, we have a few interesting changes:

Better Async Cancellation: Request terminations on the async engine (and the API) are a lot more streamlined now. Mostly a dev experience improvement.
RSLoRA: We now support RSLoRA adapters. Loaded the same way as any other regular LoRA.
Pipeline Parallel for LoRA: You can now use pipeline parallelism with LoRA! You should've been able to, but there was a bug preventing it.
API server health check improvements: If you ping /health and the engine is dead, it'll terminate the server too.
Remove max_num_batched_tokens limitation for LoRA: A leftover guard from before we switched to Triton LoRA kernels.
INT8 quantization for TPU: You can now load FP16 models in INT8 on-the-fly for TPUs. Just launch your model with -q tpu_int8.

	<!DOCTYPE html>
	<html lang="en">
	<head>
	<meta charset="utf-8" />
	<meta name="viewport" content="width=device-width, initial-scale=1" />
	<title>MNIST</title>
	<style>
	:root { --bg:#0f1116; --fg:#e6e6e6; --muted:#a3a3a3; --panel:#1b1f2a; --border:#333; }
	html, body { height: 100%; }
	body { margin: 0; font-family: ui-sans-serif, system-ui, -apple-system, Segoe UI, Roboto, Helvetica, Arial; background: var(--bg); color: var(--fg); }

	"""
	Script for merging PEFT LoRA weights with the base model. Uses code from https://github.com/eugenepentland/landmark-attention-qlora/blob/main/llama/merge_peft.py
	Usage: python merge_peft.py [-h] [--base_model_name_or_path BASE_MODEL_NAME_OR_PATH] [--peft_model_path PEFT_MODEL_PATH] [--output_dir OUTPUT_DIR] [--device DEVICE]
	[--push_to_hub]
	"""
	import torch
	import os
	import logging
	import argparse
	from tqdm import tqdm