Skip to content

Instantly share code, notes, and snippets.

@AstraBert
Created December 5, 2024 02:23
Show Gist options
  • Select an option

  • Save AstraBert/d4c32e9c6e45986700eb38a70268e435 to your computer and use it in GitHub Desktop.

Select an option

Save AstraBert/d4c32e9c6e45986700eb38a70268e435 to your computer and use it in GitHub Desktop.
SenTrEv_Practical_Showcase.ipynb
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": [],
"gpuType": "T4",
"authorship_tag": "ABX9TyOGsy5L0AvxdyBR+IifDWUJ",
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
},
"accelerator": "GPU"
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/AstraBert/d4c32e9c6e45986700eb38a70268e435/sentrev_practical_showcase.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"source": [
"## What is SenTrEv?\n",
"\n",
"<div align=\"center\">\n",
" <img src=\"https://raw.githubusercontent.com/AstraBert/SenTrEv/main/logo.png\" alt=\"SenTrEv Logo\">\n",
"</div>\n",
"\n",
"**SenTrEv** (**Sen**tence **Tr**ansformers **Ev**aluator) is a python package that is aimed at running simple evaluation tests to help you choose the best embedding model for Retrieval Augmented Generation (RAG) with your PDF documents."
],
"metadata": {
"id": "04pjNncbFpwI"
}
},
{
"cell_type": "markdown",
"source": [
"## Applicability\n",
"\n",
"- Text encoders/embedders loaded through the class `SentenceTransformer` in the python package [`sentence_transformers`](https://sbert.net/)\n",
"- PDF documents (single and multiple uploads supported)\n",
"- [Qdrant](https://qdrant.tech) vector databases (both local and on cloud)"
],
"metadata": {
"id": "Mcpmdg0FGxJV"
}
},
{
"cell_type": "markdown",
"source": [
"## Installation\n",
"\n",
"You can install the package using `pip`:"
],
"metadata": {
"id": "UPFLES7nG9hp"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "knuoPKVdFlW9"
},
"outputs": [],
"source": [
"! python3 -m pip install --quiet sentrev"
]
},
{
"cell_type": "markdown",
"source": [],
"metadata": {
"id": "7kLuCytkHqkj"
}
},
{
"cell_type": "markdown",
"source": [
"## Evaluation workflow\n",
"\n",
"1. The PDFs text is extracted via a [PyPDF](https://pypdf.readthedocs.io/en/stable/) wrapper offered by [LangChain](https://langchain.com).\n",
"2. The extracted text is chunked according to an (optionally) user-specified chunking size (default is 1000)\n",
"3. After the PDF text extraction and chunking phase, the chunks are reduced according to an (optionally) user-defined percentage (default is 25%), which is randomly extracted at any point of each chunk.\n",
"4. The reduced chunks are mapped to their original ones in a dictionary\n",
"5. Each model encodes the original chunks and uploads the vectors to the Qdrant vector storage\n",
"6. The reduced chunks are then used as queries for retrieval\n",
"7. Starting from retrieval results, accuracy, time and carbon emissions statistics are calculated and plotted.\n",
"\n",
"![workflow](https://raw.githubusercontent.com/AstraBert/SenTrEv-case-study/main/imgs/SenTrEv_Eval_Workflow.png)\n",
"\n",
"## Evaluation metrics\n",
"\n",
"- **Success rate**: defined as the number retrieval operation in which the correct context was retrieved ranking top among all the retrieved contexts, out of the total retrieval operations:\n",
"\n",
" $SR = \\frac{Ncorrect}{Ntot}$\n",
"\n",
"- **Mean Reciprocal Ranking (MRR)**: MRR defines how high in ranking the correct context is placed among the retrieved results. MRR@10 will be used, meaning that for each retrieval operation 10 items are returned and an evaluation is carried out for the ranking of the correct context, which will be then normalized between 0 and 1 (already implemented in SenTrEv). An MRR of 1 means that the correct context was ranked first, whereas an MRR of 0 means that it wasn't retrieved. MRR is calculated with the following general equation:\n",
"\n",
" $MRR = 1 - \\frac{ranking - 1}{Nretrieved}$\n",
"\n",
" > _Keep in mind that **ranking is 1-based** (meaning that the first position is 1, the second is 2 and so on...)_\n",
"\n",
" When the correct context is not retrieved, MRR is automatically set to 0. MRR is calculated for each retrieval operation, then the average and standard deviation are calculated and reported.\n",
"- **Time performance**: for each retrieval operation the time performance in seconds is calculated: the average and standard deviation are then reported.\n",
"- **Carbon emissions**: Carbon emissions are calculated in gCO2eq (grams of CO2 equivalent) through the Python library [`codecarbon`](https://codecarbon.io/) and were evaluated for the Austrian region. They are reported for the global computational load of all the retrieval operations."
],
"metadata": {
"id": "9XQ10f61HXpz"
}
},
{
"cell_type": "markdown",
"source": [
"## Let's get our hands dirty with some code!🚀"
],
"metadata": {
"id": "BSmdRj89Jh-W"
}
},
{
"cell_type": "code",
"source": [
"# IMPORT NECESSARY LIBRARIES\n",
"from sentence_transformers import SentenceTransformer\n",
"from qdrant_client import QdrantClient\n",
"from sentrev.evaluator import evaluate_rag"
],
"metadata": {
"id": "lsvFvDWeJgyR"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# CREATE QDRANT CLIENT\n",
"\n",
"## Local option (uncomment if you are running Qdrant locally)\n",
"## client = QdrantClient(\"http://localhost:6333\")\n",
"\n",
"## On-cloud option:\n",
"from google.colab import userdata\n",
"qdrant_url = userdata.get(\"iuss_QDRANT_URL\")\n",
"qdrant_api = userdata.get(\"iuss_QDRANT_API_KEY\")\n",
"client = QdrantClient(url=qdrant_url, api_key=qdrant_api)"
],
"metadata": {
"id": "ECyGL4bwJV3b"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# LOAD SENTENCE-TRANSFORMERS TEXT ENCODERS\n",
"\n",
"encoder1 = SentenceTransformer('sentence-transformers/all-mpnet-base-v2', device=\"cuda\")\n",
"encoder2 = SentenceTransformer('sentence-transformers/all-MiniLM-L12-v2', device=\"cuda\")\n",
"encoder3 = SentenceTransformer('sentence-transformers/LaBSE', device=\"cuda\")\n",
"\n",
"## Create a list of the encoders\n",
"encoders = [encoder1, encoder2, encoder3]\n",
"\n",
"## Create a dictionary that maps each encoder to its name\n",
"\n",
"encoder_to_names = {\n",
" encoder1: 'all-mpnet-base-v2',\n",
" encoder2: 'all-MiniLM-L12-v2',\n",
" encoder3: 'LaBSE',\n",
"}"
],
"metadata": {
"id": "gAM5I3rHLtDY"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"### WHAT ARE THESE MODELS?\n",
"\n",
"| Model | Base Model | Number of Parameters | Reference |\n",
"| ----------------- | ------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------- |\n",
"| all-MiniLM-L12-v2 | MiniLM-L12-H384-uncased by Microsoft | 1B | **HuggingFace**: https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2; **Paper**: Wang et al., 2020 |\n",
"| all-mpnet-base-v2 | mpnet-base by Microsoft | 1B | **HuggingFace**: https://huggingface.co/sentence-transformers/all-mpnet-base-v2; **Paper**: Song et al., 2020 |\n",
"| LaBSE | LaBSE by Google | 17B English sentence pairs + 6B Multi-Lingual Sentence Pairs | **HuggingFace**: https://huggingface.co/sentence-transformers/LaBSE; **Paper**: Feng et al., 2020 |"
],
"metadata": {
"id": "PiSdWNdYM0Z9"
}
},
{
"cell_type": "code",
"source": [
"# GET THE DATA\n",
"! wget https://raw.githubusercontent.com/AstraBert/SenTrEv-case-study/main/data/attention_is_all_you_need.pdf\n",
"! wget https://raw.githubusercontent.com/AstraBert/SenTrEv-case-study/main/data/generative_adversaria_nets.pdf\n",
"! mkdir data\n",
"! mv *.pdf data"
],
"metadata": {
"id": "cS40VlqGM-EM"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# LIST YOUR PDFs\n",
"pdfs = [\"data/attention_is_all_you_need.pdf\", \"data/generative_adversaria_nets.pdf\"]"
],
"metadata": {
"id": "AaPVRrkQNx8U"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# KILL RUNNING CODECARBON INSTANCES TO AVOID ERRORS\n",
"\n",
"! rm -rf /tmp/.codecarbon.lock"
],
"metadata": {
"id": "JFVliUvHOjsi"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# EVALUATE RETRIEVAL\n",
"\n",
"evaluate_rag(pdfs, encoders, encoder_to_names, client, csv_path = \"output.csv\", chunking_size = 1500, text_percentage=0.6, distance=\"euclid\", mrr=10, carbon_tracking=\"AUT\", plot=True)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 1000
},
"id": "-iR7qEDKOGHU",
"outputId": "e74517fd-b531-43ce-b251-8b988b755075"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stderr",
"text": [
"/usr/local/lib/python3.10/dist-packages/codecarbon/output_methods/file.py:52: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.\n",
" df = pd.concat([df, pd.DataFrame.from_records([dict(total.values)])])\n"
]
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<Figure size 1000x500 with 1 Axes>"
],
"image/png": "\n"
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<Figure size 1000x500 with 1 Axes>"
],
"image/png": "\n"
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<Figure size 1000x500 with 1 Axes>"
],
"image/png": "\n"
},
"metadata": {}
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<Figure size 1000x500 with 1 Axes>"
],
"image/png": "\n"
},
"metadata": {}
}
]
},
{
"cell_type": "code",
"source": [
"import pandas\n",
"\n",
"df = pandas.read_csv(\"output.csv\")\n",
"df"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 143
},
"id": "s1oMfJTISTiu",
"outputId": "2874dcb6-bf82-4ed2-ae48-6a159b452454"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" encoder average_time stdev_time success_rate average_mrr \\\n",
"0 all-mpnet-base-v2 0.088780 0.008858 0.563910 0.863963 \n",
"1 all-MiniLM-L12-v2 0.084532 0.006249 0.618779 0.899888 \n",
"2 LaBSE 0.084858 0.007063 0.788931 0.922809 \n",
"\n",
" stdev_mrr carbon_emissions(g_CO2eq) \n",
"0 0.247843 0.300420 \n",
"1 0.204298 0.573104 \n",
"2 0.203723 0.858473 "
],
"text/html": [
"\n",
" <div id=\"df-ad879b3b-a134-457a-a0c7-7ee8d7be502f\" class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>encoder</th>\n",
" <th>average_time</th>\n",
" <th>stdev_time</th>\n",
" <th>success_rate</th>\n",
" <th>average_mrr</th>\n",
" <th>stdev_mrr</th>\n",
" <th>carbon_emissions(g_CO2eq)</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>all-mpnet-base-v2</td>\n",
" <td>0.088780</td>\n",
" <td>0.008858</td>\n",
" <td>0.563910</td>\n",
" <td>0.863963</td>\n",
" <td>0.247843</td>\n",
" <td>0.300420</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>all-MiniLM-L12-v2</td>\n",
" <td>0.084532</td>\n",
" <td>0.006249</td>\n",
" <td>0.618779</td>\n",
" <td>0.899888</td>\n",
" <td>0.204298</td>\n",
" <td>0.573104</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>LaBSE</td>\n",
" <td>0.084858</td>\n",
" <td>0.007063</td>\n",
" <td>0.788931</td>\n",
" <td>0.922809</td>\n",
" <td>0.203723</td>\n",
" <td>0.858473</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <div class=\"colab-df-buttons\">\n",
"\n",
" <div class=\"colab-df-container\">\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-ad879b3b-a134-457a-a0c7-7ee8d7be502f')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
"\n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
" <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
" </svg>\n",
" </button>\n",
"\n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" .colab-df-buttons div {\n",
" margin-bottom: 4px;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-ad879b3b-a134-457a-a0c7-7ee8d7be502f button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-ad879b3b-a134-457a-a0c7-7ee8d7be502f');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
"\n",
"\n",
"<div id=\"df-15a429d4-99ce-4a6b-8bb6-524d449b5228\">\n",
" <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-15a429d4-99ce-4a6b-8bb6-524d449b5228')\"\n",
" title=\"Suggest charts\"\n",
" style=\"display:none;\">\n",
"\n",
"<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <g>\n",
" <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
" </g>\n",
"</svg>\n",
" </button>\n",
"\n",
"<style>\n",
" .colab-df-quickchart {\n",
" --bg-color: #E8F0FE;\n",
" --fill-color: #1967D2;\n",
" --hover-bg-color: #E2EBFA;\n",
" --hover-fill-color: #174EA6;\n",
" --disabled-fill-color: #AAA;\n",
" --disabled-bg-color: #DDD;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-quickchart {\n",
" --bg-color: #3B4455;\n",
" --fill-color: #D2E3FC;\n",
" --hover-bg-color: #434B5C;\n",
" --hover-fill-color: #FFFFFF;\n",
" --disabled-bg-color: #3B4455;\n",
" --disabled-fill-color: #666;\n",
" }\n",
"\n",
" .colab-df-quickchart {\n",
" background-color: var(--bg-color);\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: var(--fill-color);\n",
" height: 32px;\n",
" padding: 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-quickchart:hover {\n",
" background-color: var(--hover-bg-color);\n",
" box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: var(--button-hover-fill-color);\n",
" }\n",
"\n",
" .colab-df-quickchart-complete:disabled,\n",
" .colab-df-quickchart-complete:disabled:hover {\n",
" background-color: var(--disabled-bg-color);\n",
" fill: var(--disabled-fill-color);\n",
" box-shadow: none;\n",
" }\n",
"\n",
" .colab-df-spinner {\n",
" border: 2px solid var(--fill-color);\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" animation:\n",
" spin 1s steps(1) infinite;\n",
" }\n",
"\n",
" @keyframes spin {\n",
" 0% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" border-left-color: var(--fill-color);\n",
" }\n",
" 20% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 30% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 40% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 60% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 80% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" 90% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" }\n",
"</style>\n",
"\n",
" <script>\n",
" async function quickchart(key) {\n",
" const quickchartButtonEl =\n",
" document.querySelector('#' + key + ' button');\n",
" quickchartButtonEl.disabled = true; // To prevent multiple clicks.\n",
" quickchartButtonEl.classList.add('colab-df-spinner');\n",
" try {\n",
" const charts = await google.colab.kernel.invokeFunction(\n",
" 'suggestCharts', [key], {});\n",
" } catch (error) {\n",
" console.error('Error during call to suggestCharts:', error);\n",
" }\n",
" quickchartButtonEl.classList.remove('colab-df-spinner');\n",
" quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
" }\n",
" (() => {\n",
" let quickchartButtonEl =\n",
" document.querySelector('#df-15a429d4-99ce-4a6b-8bb6-524d449b5228 button');\n",
" quickchartButtonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
" })();\n",
" </script>\n",
"</div>\n",
"\n",
" <div id=\"id_07774427-35a9-45fa-9ec5-f9aa1833d372\">\n",
" <style>\n",
" .colab-df-generate {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-generate:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-generate {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-generate:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
" <button class=\"colab-df-generate\" onclick=\"generateWithVariable('df')\"\n",
" title=\"Generate code using this dataframe.\"\n",
" style=\"display:none;\">\n",
"\n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <path d=\"M7,19H8.4L18.45,9,17,7.55,7,17.6ZM5,21V16.75L18.45,3.32a2,2,0,0,1,2.83,0l1.4,1.43a1.91,1.91,0,0,1,.58,1.4,1.91,1.91,0,0,1-.58,1.4L9.25,21ZM18.45,9,17,7.55Zm-12,3A5.31,5.31,0,0,0,4.9,8.1,5.31,5.31,0,0,0,1,6.5,5.31,5.31,0,0,0,4.9,4.9,5.31,5.31,0,0,0,6.5,1,5.31,5.31,0,0,0,8.1,4.9,5.31,5.31,0,0,0,12,6.5,5.46,5.46,0,0,0,6.5,12Z\"/>\n",
" </svg>\n",
" </button>\n",
" <script>\n",
" (() => {\n",
" const buttonEl =\n",
" document.querySelector('#id_07774427-35a9-45fa-9ec5-f9aa1833d372 button.colab-df-generate');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" buttonEl.onclick = () => {\n",
" google.colab.notebook.generateWithVariable('df');\n",
" }\n",
" })();\n",
" </script>\n",
" </div>\n",
"\n",
" </div>\n",
" </div>\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"variable_name": "df",
"summary": "{\n \"name\": \"df\",\n \"rows\": 3,\n \"fields\": [\n {\n \"column\": \"encoder\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"all-mpnet-base-v2\",\n \"all-MiniLM-L12-v2\",\n \"LaBSE\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"average_time\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.0023638000490553636,\n \"min\": 0.0845320963523757,\n \"max\": 0.0887795502978159,\n \"num_unique_values\": 3,\n \"samples\": [\n 0.0887795502978159,\n 0.0845320963523757,\n 0.084858046016371\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"stdev_time\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.001334747383517496,\n \"min\": 0.0062489901557313,\n \"max\": 0.0088576465265527,\n \"num_unique_values\": 3,\n \"samples\": [\n 0.0088576465265527,\n 0.0062489901557313,\n 0.007062568213633\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"success_rate\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.11732893738172075,\n \"min\": 0.5639097744360902,\n \"max\": 0.7889305816135085,\n \"num_unique_values\": 3,\n \"samples\": [\n 0.5639097744360902,\n 0.6187793427230047,\n 0.7889305816135085\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"average_mrr\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.02966144175636733,\n \"min\": 0.8639629200463499,\n \"max\": 0.922808764940239,\n \"num_unique_values\": 3,\n \"samples\": [\n 0.8639629200463499,\n 0.8998880179171332,\n 0.922808764940239\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"stdev_mrr\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.02530814109952007,\n \"min\": 0.2037232676292717,\n \"max\": 0.2478427735214869,\n \"num_unique_values\": 3,\n \"samples\": [\n 0.2478427735214869,\n 0.2042979579129214,\n 0.2037232676292717\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"carbon_emissions(g_CO2eq)\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.2790506155520329,\n \"min\": 0.3004195612948157,\n \"max\": 0.8584727451612353,\n \"num_unique_values\": 3,\n \"samples\": [\n 0.3004195612948157,\n 0.5731041423229954,\n 0.8584727451612353\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 9
}
]
},
{
"cell_type": "markdown",
"source": [
"## Other use cases for SenTrEv\n",
"\n",
"### 1. Upload PDFs to Qdrant\n",
"\n",
"You can use SenTrEv to chunk, vectorize and upload your PDFs to a Qdrant database.\n",
"\n",
"You can also play around with the `chunking_size` argument (default is 1000) and with the `distance` argument (default is `cosine`).\n"
],
"metadata": {
"id": "HThUNcMGSd7N"
}
},
{
"cell_type": "code",
"source": [
"from sentrev.evaluator import upload_pdfs\n",
"\n",
"pdfs.reverse()\n",
"\n",
"data, collection_name = upload_pdfs(pdfs=pdfs, encoder=encoder1, client=client)"
],
"metadata": {
"id": "ehL4IlecS1Ox"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"#### 2. Implement semantic search on a Qdrant collection\n",
"\n",
"You can also search already-existent collections in a Qdrant database with SenTrEv.\n",
"\n",
"The results will be returned as a list of payloads (the metadata you uploaded to the Qdrant collection along with the vector points).\n",
"\n",
"If you used SenTrEv `upload_pdfs` function, you should be able to access the results in this way:\n",
"\n",
"```python\n",
"text = res[0][\"text\"]\n",
"source = res[0][\"source\"]\n",
"page = res[0][\"page\"]\n",
"```\n"
],
"metadata": {
"id": "c20ATJ8JUNF1"
}
},
{
"cell_type": "code",
"source": [
"from sentrev.utils import NeuralSearcher\n",
"text = \"\"\" Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two\n",
"sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network\"\"\"\n",
"searcher = NeuralSearcher(client=client, model=encoder1, collection_name=collection_name)\n",
"res = searcher.search(text, limit=5)"
],
"metadata": {
"id": "n43He0u5ULQo"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"text = res[0][\"text\"]\n",
"source = res[0][\"source\"]\n",
"page = res[0][\"page\"]"
],
"metadata": {
"id": "T1VCARZQU8Lv"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"print(\"TEXT: \"+text+\"\\nSOURCE: \"+source+\"\\nPAGE: \"+page)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "_B3uvhFiVDXB",
"outputId": "a8491b86-af0f-485c-f2ec-bdb6526ebceb"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"TEXT: Encoder: The encoder is composed of a stack of N= 6 identical layers. Each layer has two\n",
"SOURCE: data/generative_adversaria_nets_results.pdf\n",
"PAGE: 2\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"### And that's all: Thanks for your attention!😊"
],
"metadata": {
"id": "mpEm2aitTXXb"
}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment