Last active
April 17, 2025 17:10
-
-
Save laszlo91/c287bee9656dc876728a548ef678310b to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| { | |
| "cells": [ | |
| { | |
| "cell_type": "markdown", | |
| "id": "d6015308", | |
| "metadata": {}, | |
| "source": [ | |
| "Phone calls are often stored in mono sound with low frame rate, because of space requirements. I had some audio files of this kind (.wav, 8 kHz, mono), but to process them I needed to have stereo sounds instead, namely to have a speaker for each channel.\n", | |
| "\n", | |
| "To achieve that, after having converted these audio files from 8 kHz to **16 kHz**, I used **pyannote-audio** for *speaker diarization*. This python library allows me to separate the two speakers in two mono channels, then I merged them in a unique stereo .wav track using **pydub**." | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "6556eba5", | |
| "metadata": {}, | |
| "source": [ | |
| "For pyannote installation: \n", | |
| "https://github.com/pyannote/pyannote-audio\n", | |
| "\n", | |
| "If you have problems with conda environment (dead kernel on Jupyter / intel mkl-fatal-error ) these instructions solved my issue. \n", | |
| "https://stackoverflow.com/questions/71306262/conda-env-broken-after-installing-pytorch-on-m1-intel-mkl-fatal-error" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "id": "ee0c3fe2", | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "!pip install -q speechbrain\n", | |
| "!pip install pydub" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 1, | |
| "id": "4c5e7541", | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "import os\n", | |
| "\n", | |
| "from pyannote.core import Segment\n", | |
| "from pyannote.audio import Audio\n", | |
| "from pyannote.audio import Pipeline\n", | |
| "\n", | |
| "from pydub import AudioSegment\n", | |
| "\n", | |
| "OWN_FILE = {'audio': f'./my-mono-file.wav'}\n", | |
| "\n", | |
| "#frame rate\n", | |
| "FR = 16000" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "49108bc7", | |
| "metadata": {}, | |
| "source": [ | |
| "pyannote.audio uses a transformer model to perform speaker segmentation and diarization. I will use the model as it is, without fine-tuning. Having GPU is highly recommended, otherwise is going to be quite slow for long audios. An alternative can be to trim them into several batches, but remind that you will have no correspondance between the speaker names across the batches. " | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 3, | |
| "id": "ff6b852a", | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "image/png": "iVBORw0KGgoAAAANSUhEUgAABG0AAACtCAYAAAAKyYJgAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAAAQIElEQVR4nO3da6xld1kG8Oe1hahQ5NJCmqEyUkmhIhQYoRUlLRcBQYpYIoiREBK/IHLxhpfYKZEEFQrGKAmUS+VSQJRLUCiltECMqFMolFKQlrS2Q2lBVEAJt75+OGvS08OcM3P22TP7v8/8fsnJ3vu/197nXWu966w9z6y1dnV3AAAAABjLDyy6AAAAAAC+n9AGAAAAYEBCGwAAAIABCW0AAAAABiS0AQAAABiQ0AYAAABgQEIbAAAAgAEJbQAAAAAGJLQBAAAAGJDQBgAAAGBAQhsAAACAAS1daFNVf1hVV1bVp6rq8qp6WFVdWlWfq6pPVtU/VdVJ07T7xi+fft6x5r0+WVUXrBl7Q1WdNd2/a1V9oqqeVVU7q+qbq97r8qr6tWm6a6vqiqmmD1fVvQ4wD4+b6rq6ql60avyuVXVRVX1+ur3LvJbbCLb5unvqNG+3VNWueS0zAAAAjlxLFdpU1WlJnpjkwd39gCSPTnL99PQzuvuBSc5P8uerXvaM7j5l+jlr1XvdLyvz/4iqusN+ftePJLkwyau7+/XT8DWr3uuU7v6bVS85Y6rp0iR/tME8HJXkr5I8PsnJSZ5eVSdPT78oycXdfZ8kF0+Pt4UjYN19OslTknzkYJYHAAAAHMhShTZJjk/yle7+VpJ091e6+4trpvlIkh8/iPf6lSRvTPKBJE9a89wdk7wvyVu6+1WbrPGfk+zY4PmHJrm6u7/Q3d9O8tYkZ07PnZmV4CLT7ZM3+btHtq3XXXdf1d2f2+TvAwAAgHUdvZUX791xwu4kZ8+nlCTJOTv2Xr97g+c/kOSPq+rfk3wwydu6+8NrpvmFJFesevzmqvrmdP+i7v6d6f4vJ3lMkpOS/EaS1afanJvkvO5+xZr3PrGqLl/1+Lnd/dE10zwuybs2mIcdufUIkyS5IcnDpvv36O4bk6S7b6yqu2/wPjM79ewLd2fO6+1j5zx29wGm2e7rDgAAAOZqS6HN4dbd36iqhyT52SRnJHnbquuK7PsH/rVJnrvqZc/o7j2r36eqfirJl7v7uqq6Icnrquou3f1f0yQfSnJmVb2su29e9dJruvuUdcq7pKrukeTmbHCKTZLa36xtMP22YN0BAADA5izb6VHp7u9196XdfXZWjrL4pempfdc/eXJ3X7/BWyTJ05Pct6quTXJNkjutep9k5bSXVyX5x6o65iBLOyPJvZJcmeTFG0x3Q5ITVj2+Z5J9pwndVFXHJ8l0e3O2kW2+7gAAAGCutnSkzXQq0+65VHIQpm8WuqW7Pz8NnZLkuiT338R7/ECSpyZ5QHfvncbOyMoRFuftm667XzkFJ++sqp8/mPfu7m9W1fOTXFFVf9LdX93PZP+W5D5V9WNJ9iZ5Wlau0ZIk70nyzCQvnW7ffbDztRnTqUy7D8V7r+cIWHcAAAAwV8t2pM0dk5xfVZ+pqk9l5Rt8dh/gNW9e9TXPH0zyiCR79/2jf/KRJCfvO8pln+7+vaxcw+SNWVlWJ6752ujfXPvLpmvSXJDkOfsrpru/m5WjTC5MclWSt3f3ldPTL03ymKr6fFau2fLSA8zbMtnW666qfnE6Xeu0JP9QVRceYN4AAABgQ9XtkhwAAAAAo1m2I20AAAAAjghL9e1Ry6Sq7pbk4v089aju/s/DXQ8Hz7oDAABgBE6PAgAAABiQ06MAAAAABiS0AQAAABjQpq5pc+yxx/bOnTsPUSkAAAAAR57LLrvsK9193NrxTYU2O3fuzJ49e+ZXFQAAAMARrqqu29+406MAAAAABiS0AQAAABiQ0AYAAABgQEIbAAAAgAEJbQAAAAAGJLQBAAAAGJDQBgAAAGBAQhsAAACAAQltAAAAAAYktAEAAAAYkNAGAAAAYEBCGwAAAIABCW0AAAAABiS0AQAAABiQ0AYAAABgQEIbAAAAgAEJbQAAAAAGJLQBAAAAGJDQBgAAAGBAQhsAAACAAQltAAAAAAYktAEAAAAYkNAGAAAAYEBCGwAAAIABCW0AAAAABiS0AQAAABiQ0AYAAABgQEIbAAAAgAEJbQAAAAAGNFNo87WXnzvvOuZmM7W95pKrD2Ely2W9ZWEZwdas3YYO5zZlux7H/pb5Mq2HWWod+bPCeuZR8zLON7e13jo80HbwmkuuHnL9r1fTVmodcT5HsazLZpa+38y+YbstFzicZgptvn7uK+Zdx9xsprbXXnrNIaxkuay3LCwj2Jq129Dh3KZs1+PY3zJfpvUwS60jf1ZYzzxqXsb55rbWW4cH2g5ee+k1Q67/9WraSq0jzucolnXZzNL3m9k3bLflAoeT06MAAAAABiS0AQAAABjQ0bO+cO+OE+ZZx8KcevaFiy5heJYRzNcI29QINbD918N2+aywWUfqfB8JDmabXab1v0y1LpPttlznta/abssFDhdH2gAAAAAMSGgDAAAAMKCZT4/asff6edYxN5s97O5j5zz2EFWyXDY67NEygtntb9s6XNuU7Xoc662LZVkPsx4aP+pnhfXM69D9ZZtvbmujPthom923nYy2/jean1lrdZrLxkbrgYMxS99vdt+w3ZYLHC6OtAEAAAAYkNAGAAAAYEBCGwAAAIABzXRNm2Ne+IJ51zE3m6nt2aefeAgrWS7rLQvLCLZm7TZ0OLcp2/U49rfMl2k9zFLryJ8V1jOPmpdxvrmt9dbhgbaDZ59+Yo6503jrf7352Uqv6vP1LeuymaXvN7Nv2G7LBQ6n6u6DnnjXrl29Z8+eQ1gOAAAAwJGlqi7r7l1rx50eBQAAADAgoQ0AAADAgIQ2AAAAAAMS2gAAAAAMSGgDAAAAMCChDQAAAMCAhDYAAAAAAxLaAAAAAAxIaAMAAAAwIKENAAAAwICENgAAAAADEtoAAAAADEhoAwAAADAgoQ0AAADAgIQ2AAAAAAMS2gAAAAAMSGgDAAAAMCChDQAAAMCAhDYAAAAAAxLaAAAAAAxIaAMAAAAwIKENAAAAwICENgAAAAADEtoM6msvP3dT4wAwDwfaz9gPwfbxmkuuXnQJc7Od5gVgNaHNoL5+7is2NQ4A83Cg/Yz9EGwfr730mkWXMDfbaV4AVhPaAAAAAAxIaAMAAAAwIKENAAAAwICOXnQBrG/vjhMWXQIARyD7HzhynHr2hYsuAYANONIGAAAAYEBCGwAAAIABOT1qYDv2Xv99Yw5ZB+BQ29/+Zx/7IdhePnbOYxddwlw4zQvYrhxpAwAAADAgoQ0AAADAgIQ2gzrmhS/Y1DgAzMOB9jP2Q7B9PPv0Exddwtxsp3kBWK26+6An3rVrV+/Zs+cQlgMAAABwZKmqy7p719pxR9oAAAAADEhoAwAAADAgoQ0AAADAgIQ2AAAAAAMS2gAAAAAMSGgDAAAAMCChDQAAAMCAhDYAAAAAAxLaAAAAAAxIaAMAAAAwIKENAAAAwICENgAAAAADEtoAAAAADEhoAwAAADAgoQ0AAADAgIQ2AAAAAAMS2gAAAAAMSGgDAAAAMCChDQAAAMCAhDYAAAAAAxLaAAAAAAxIaAMAAAAwIKENAAAAwICENgAAAAADEtoAAAAADEhoAwAAADAgoQ0AAADAgIQ2AAAAAAMS2gAAAAAMqLr74Ceu+nKS6w5dOWwTxyb5yqKLYNvQT8yLXmKe9BPzpJ+YJ/3EvOilw+te3X3c2sFNhTZwMKpqT3fvWnQdbA/6iXnRS8yTfmKe9BPzpJ+YF700BqdHAQAAAAxIaAMAAAAwIKENh8KrF10A24p+Yl70EvOkn5gn/cQ86SfmRS8NwDVtAAAAAAbkSBsAAACAAQlt2JKqel1V3VxVn141dtequqiqPj/d3mWRNbIcquqEqrqkqq6qqiur6nnTuH5i06rqB6vqX6vqk1M/nTON6ydmUlVHVdUnquq902O9xEyq6tqquqKqLq+qPdOYfmImVXXnqnpHVX12+gx1mn5iFlV10vR3ad/P16rq+fpp8YQ2bNUbkjxuzdiLklzc3fdJcvH0GA7ku0l+q7vvl+TUJM+pqpOjn5jNt5I8srsfmOSUJI+rqlOjn5jd85JcteqxXmIrzujuU1Z9la5+YlZ/keT93X3fJA/Myt8p/cSmdffnpr9LpyR5SJL/S/LO6KeFE9qwJd39kSRfXTN8ZpLzp/vnJ3ny4ayJ5dTdN3b3x6f7X8/Kh44d0U/MoFd8Y3p4u+mno5+YQVXdM8kTkpy3algvMU/6iU2rqjsleUSS1yZJd3+7u/87+omte1SSa7r7uuinhRPacCjco7tvTFb+IZ7k7guuhyVTVTuTPCjJv0Q/MaPpdJbLk9yc5KLu1k/M6pVJfjfJLavG9BKz6iQfqKrLqurXpzH9xCzuneTLSV4/nb55XlXdIfqJrXtakgum+/ppwYQ2wFCq6o5J/i7J87v7a4uuh+XV3d+bDvG9Z5KHVtX9F1wSS6iqnpjk5u6+bNG1sG08vLsfnOTxWTkV+BGLLoildXSSByd5VXc/KMn/xqkrbFFV3T7Jk5L87aJrYYXQhkPhpqo6Pkmm25sXXA9Loqpul5XA5s3d/ffTsH5iS6ZDxS/NyvW39BOb9fAkT6qqa5O8Nckjq+pN0UvMqLu/ON3enJXrRTw0+onZ3JDkhulI0iR5R1ZCHP3EVjw+yce7+6bpsX5aMKENh8J7kjxzuv/MJO9eYC0siaqqrJyTfVV3n7vqKf3EplXVcVV15+n+DyV5dJLPRj+xSd39+919z+7emZXDxT/U3b8avcQMquoOVXXMvvtJfi7Jp6OfmEF3fynJ9VV10jT0qCSfiX5ia56eW0+NSvTTwlV3L7oGllhVXZDk9CTHJrkpydlJ3pXk7Ul+NMl/JHlqd6+9WDHcRlX9TJKPJrkit1434g+ycl0b/cSmVNUDsnKxvKOy8h8Ub+/uF1fV3aKfmFFVnZ7kt7v7iXqJWVTVvbNydE2ycmrLW7r7JfqJWVXVKVm5SPrtk3whybMy7fein9ikqvrhJNcnuXd3/8805u/TggltAAAAAAbk9CgAAACAAQltAAAAAAYktAEAAAAYkNAGAAAAYEBCGwAAAIABCW0AgOFV1d2q6vLp50tVtXe6/42q+utF1wcAcCj4ym8AYKlU1e4k3+july26FgCAQ8mRNgDA0qqq06vqvdP93VV1flV9oKquraqnVNWfVdUVVfX+qrrdNN1DqurDVXVZVV1YVccvdi4AAPZPaAMAbCcnJnlCkjOTvCnJJd39k0m+meQJU3Dzl0nO6u6HJHldkpcsqlgAgI0cvegCAADm6H3d/Z2quiLJUUneP41fkWRnkpOS3D/JRVWVaZobF1AnAMABCW0AgO3kW0nS3bdU1Xf61ov33ZKVzz2V5MruPm1RBQIAHCynRwEAR5LPJTmuqk5Lkqq6XVX9xIJrAgDYL6ENAHDE6O5vJzkryZ9W1SeTXJ7kpxdaFADAOnzlNwAAAMCAHGkDAAAAMCChDQAAAMCAhDYAAAAAAxLaAAAAAAxIaAMAAAAwIKENAAAAwICENgAAAAADEtoAAAAADOj/ARjvnfZ68e2rAAAAAElFTkSuQmCC\n", | |
| "text/plain": [ | |
| "<pyannote.core.annotation.Annotation at 0x7f8bf03afb50>" | |
| ] | |
| }, | |
| "execution_count": 3, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "pipeline = Pipeline.from_pretrained('pyannote/speaker-diarization')\n", | |
| " \n", | |
| "diarization = pipeline(OWN_FILE)\n", | |
| "diarization" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "1ba95d6c", | |
| "metadata": {}, | |
| "source": [ | |
| "In the following function I trim the chunks represented above, for a speaker at a time, adding silent segments when s/he doesn't speak" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 4, | |
| "id": "b4afb22a", | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "def trimming(diario):\n", | |
| " \n", | |
| " diario = [segment for segment in diario.itersegments()]\n", | |
| " \n", | |
| " speaker = []\n", | |
| " for i, turn in enumerate(diario):\n", | |
| " \n", | |
| " trim = sound[turn.start*1000:turn.end*1000]\n", | |
| " \n", | |
| " if i==0:\n", | |
| " silence = AudioSegment.silent(duration= (turn.start*1000), frame_rate=FR)\n", | |
| " trim = silence + trim\n", | |
| " \n", | |
| " try:\n", | |
| " silence = AudioSegment.silent(duration=(diario[i+1].start*1000 - turn.end*1000), frame_rate=FR)\n", | |
| " trim = trim + silence\n", | |
| "\n", | |
| " except IndexError:\n", | |
| " pass\n", | |
| " \n", | |
| " speaker.append(trim)\n", | |
| " \n", | |
| " return sum(speaker)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 6, | |
| "id": "38923e1d", | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "sound = AudioSegment.from_file(OWN_FILE['audio'])\n", | |
| "\n", | |
| "dsp0 = diarization.subset(set([\"SPEAKER_00\"]))\n", | |
| "right = trimming(dsp0)\n", | |
| "\n", | |
| "dsp1 = diarization.subset(set([\"SPEAKER_01\"]))\n", | |
| "left = trimming(dsp1)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "25e1efcc", | |
| "metadata": {}, | |
| "source": [ | |
| "Now we have the left channel (speaker 1) and the right channel (speaker 0). We have no guarantee that the length is exactly the same, since for segmentation we relied on a probabilistic tool. However, to merge mono sounds into a stereo one we need them to have not only the same length but also the same exact count of frame. " | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 7, | |
| "id": "45153e7d", | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "1050912.0\n", | |
| "1050912.0\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "def reframing(right, left):\n", | |
| " \n", | |
| " l_count = left.frame_count()\n", | |
| " r_count = right.frame_count()\n", | |
| " \n", | |
| " if l_count > r_count:\n", | |
| " remaind = ((l_count - r_count) / FR )*1000\n", | |
| " silence = AudioSegment.silent(duration=remaind, frame_rate=FR)\n", | |
| " right = right + silence\n", | |
| " else:\n", | |
| " scarto = ((r_count - l_count) / FR )*1000\n", | |
| " silence = AudioSegment.silent(duration=scarto, frame_rate=FR)\n", | |
| " left = left + silence\n", | |
| " \n", | |
| " print(right.frame_count())\n", | |
| " print(left.frame_count())\n", | |
| " return(right, left)\n", | |
| " \n", | |
| "right, left = reframing(right, left)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "id": "28b09df4", | |
| "metadata": {}, | |
| "source": [ | |
| "The reframing function above adds some silence at the end of the shorter segment. This way to equalize the nb of frame is very empirical, if it doesn't work please consider to use .overlay pydub method (https://github.com/jiaaro/pydub/blob/master/API.markdown#audiosegmentoverlay)\n", | |
| "\n", | |
| "Finally, we overlap the two channels and we export the new stereo .wav!" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 9, | |
| "id": "fac5580a", | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "<_io.BufferedRandom name='./new-stereo-file.wav'>" | |
| ] | |
| }, | |
| "execution_count": 9, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "stereo_sound = AudioSegment.from_mono_audiosegments(left, right)\n", | |
| "stereo_sound.export(f\"./new-stereo-file.wav\", format=\"wav\")" | |
| ] | |
| } | |
| ], | |
| "metadata": { | |
| "kernelspec": { | |
| "display_name": "Python [conda env:pyannote_bis] *", | |
| "language": "python", | |
| "name": "conda-env-pyannote_bis-py" | |
| }, | |
| "language_info": { | |
| "codemirror_mode": { | |
| "name": "ipython", | |
| "version": 3 | |
| }, | |
| "file_extension": ".py", | |
| "mimetype": "text/x-python", | |
| "name": "python", | |
| "nbconvert_exporter": "python", | |
| "pygments_lexer": "ipython3", | |
| "version": "3.8.13" | |
| } | |
| }, | |
| "nbformat": 4, | |
| "nbformat_minor": 5 | |
| } |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment