{ "cells": [ { "cell_type": "markdown", "id": "a0f21cb1-fbc8-4282-b902-f47d92974df8", "metadata": {}, "source": [ "# Pre-requisites" ] }, { "cell_type": "markdown", "id": "5f625807-0707-4e2f-a0e0-8fbcdf08c865", "metadata": {}, "source": [ "## Why TEI\n", "There are 2 **unsung** challenges with RAG at scale:\n", "1. Getting the embeddings efficiently\n", "1. Efficient ingestion into the vector DB\n", "\n", "The issue with `1.` is that there are techniques but they are not widely *applied*. TEI solves a number of aspects:\n", "- Token Based Dynamic Batching\n", "- Using latest optimizations (Flash Attention, Candle and cuBLASLt)\n", "- Fast loading with safetensors\n", "\n", "The issue with `2.` is that it takes a bit of planning. We wont go much into that side of things here though." ] }, { "cell_type": "markdown", "id": "3102abce-ea42-4da6-8c98-c6dd4edf7f0b", "metadata": {}, "source": [ "## Start TEI Locally\n", "Run [TEI](https://github.com/huggingface/text-embeddings-inference#docker), I have this running in a nvidia-docker container, but you can install as you like. Note that I ran this in a different terminal for monitoring and seperation. \n", "\n", "Note that as its running, its always going to pull the latest. Its at a very early stage at the time of writing. \n", "\n", "I chose [sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2) based on the STS ar-ar performance on [mteb/leaderboard](https://huggingface.co/spaces/mteb/leaderboard), its the top performer and half the size of second place! TEI is fast, but this will make our life easier for storage and retrieval.\n", "\n", "I use the `revision=refs/pr/8` because this has the pull request with [safetensors](https://github.com/huggingface/safetensors) which is required by TEI. Check out the [pull request](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2/discussions/8) if you want to use a different embedding model and it doesnt have safetensors." ] }, { "cell_type": "code", "execution_count": 1, "id": "7e873652-8257-4aae-92bc-94e1bac54b73", "metadata": { "tags": [] }, "outputs": [], "source": [ "%%bash\n", "\n", "# volume=$pwd/tei\n", "# model=sentence-transformers/paraphrase-multilingual-minilm-l12-v2\n", "# revision=refs/pr/8\n", "# docker run \\\n", "# --gpus all \\\n", "# -p 8080:80 \\\n", "# -v $volume:/data \\\n", "# -v /home/ec2-user/.cache/huggingface/token:/root/.cache/huggingface/token \\\n", "# --pull always \\\n", "# ghcr.io/huggingface/text-embeddings-inference:latest \\\n", "# --model-id $model \\\n", "# --revision $revision \\\n", "# --pooling mean \\\n", "# --max-batch-tokens 65536" ] }, { "cell_type": "markdown", "id": "51959ef4-186e-4a32-826a-731813eaf593", "metadata": {}, "source": [ "### Test Endpoint" ] }, { "cell_type": "code", "execution_count": 2, "id": "52edfc97-5b6f-44f9-8d89-8578cf79fae9", "metadata": { "tags": [] }, "outputs": [], "source": [ "%%bash\n", "\n", "# response_code=$(curl -s -o /dev/null -w \"%{http_code}\" 127.0.0.1:8080/embed \\\n", "# -X POST \\\n", "# -d '{\"inputs\":\"What is Deep Learning?\"}' \\\n", "# -H 'Content-Type: application/json')\n", "\n", "# if [ \"$response_code\" -eq 200 ]; then\n", "# echo \"passed\"\n", "# else\n", "# echo \"failed\"\n", "# fi" ] }, { "cell_type": "markdown", "id": "e9d6b54a-02bd-49aa-b180-27a7ab90154e", "metadata": {}, "source": [ "## Start TEI with Inference Endpoints\n", "Another option is to run TEI on [Inference Endpoints](https://huggingface.co/inference-endpoints). Its cheap and fast. It took me less than 5 minutes to get it up and running!\n", "\n", "Check here for a [comprehensive guide](https://huggingface.co/blog/inference-endpoints-embeddings#3-deploy-embedding-model-as-inference-endpoint). Make sure to set these options **IN ORDER**:\n", "1. Model Repository = `transformers/paraphrase-multilingual-minilm-l12-v2`\n", "1. Name your endpoint\n", "1. Choose a GPU, I chose `Nvidia A10G` which is **$1.3/hr**.\n", "1. Advanced Configuration\n", " 1. Task = `Sentence Embeddings`\n", " 1. Revision (based on [this pull request for safetensors](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2/discussions/8) = `a21e6630`\n", " 1. Container Type = `Text Embeddings Inference`\n", " \n", "Set the other options as you prefer." ] }, { "cell_type": "markdown", "id": "ec78c98a-6b7b-4689-8ef8-582c3fcdf66e", "metadata": {}, "source": [ "### Test Endpoint" ] }, { "cell_type": "code", "execution_count": 3, "id": "a69e2ee1-67f2-4f0a-b496-02f5415a52ca", "metadata": { "tags": [] }, "outputs": [ { "name": "stdin", "output_type": "stream", "text": [ "What is your API_URL? ········\n", "What is your BEARER TOKEN? Check your endpoint. ········\n" ] } ], "source": [ "import getpass\n", "API_URL = getpass.getpass(prompt='What is your API_URL?')\n", "bearer_token = getpass.getpass(prompt='What is your BEARER TOKEN? Check your endpoint.')" ] }, { "cell_type": "code", "execution_count": 4, "id": "949d6bf8-804f-496b-a59a-834483cc7073", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Constants\n", "HEADERS = {\n", "\t\"Authorization\": f\"Bearer {bearer_token}\",\n", "\t\"Content-Type\": \"application/json\"\n", "}\n", "MAX_WORKERS = 512" ] }, { "cell_type": "code", "execution_count": 5, "id": "d00b4af1-8fbc-4f7a-8a78-e1c52dd77a66", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0.0047912598, -0.03164673, -0.018051147, -0.057739258, -0.04498291]...\n" ] } ], "source": [ "import requests\n", "\n", "\n", "def query(payload):\n", "\tresponse = requests.post(API_URL, headers=HEADERS, json=payload)\n", "\treturn response.json()\n", "\t\n", "output = query({\n", "\t\"inputs\": \"This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music!\",\n", "})\n", "print(f'{output[0][:5]}...')" ] }, { "cell_type": "markdown", "id": "b1b28232-b65d-41ce-88de-fd70b93a528d", "metadata": {}, "source": [ "# Imports" ] }, { "cell_type": "code", "execution_count": 6, "id": "abb5186b-ee67-4e1e-882d-3d8d5b4575d4", "metadata": { "tags": [] }, "outputs": [], "source": [ "import asyncio\n", "from pathlib import Path\n", "import json\n", "import time\n", "\n", "\n", "from aiohttp import ClientSession, ClientTimeout\n", "from tqdm.notebook import tqdm" ] }, { "cell_type": "code", "execution_count": 7, "id": "c4b82ea2-8b30-4c2e-99f0-9a30f2f1bfb7", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/home/ec2-user/arabic-wiki\n" ] } ], "source": [ "proj_dir = Path.cwd().parent\n", "print(proj_dir)" ] }, { "cell_type": "markdown", "id": "76119e74-f601-436d-a253-63c5a19d1c83", "metadata": {}, "source": [ "# Config" ] }, { "cell_type": "code", "execution_count": 8, "id": "f6f74545-54a7-4f41-9f02-96964e1417f0", "metadata": { "tags": [] }, "outputs": [], "source": [ "files_in = list((proj_dir / 'data/processed/').glob('*.ndjson'))\n", "folder_out = proj_dir / 'data/embedded/'\n", "folder_out_str = str(folder_out)" ] }, { "cell_type": "markdown", "id": "5e73235d-6274-4958-9e57-977afeeb5f1b", "metadata": {}, "source": [ "# Embed\n", "## Strategy\n", "TEI allows multiple concurrent requests, so its important that we dont waste the potential we have. I used the default `max-concurrent-requests` value of `512`, so I want to use that many `MAX_WORKERS`.\n", "\n", "Im using an `async` way of making requests that uses `aiohttp` as well as a nice progress bar. " ] }, { "cell_type": "markdown", "id": "cf3da8cc-1651-4704-9091-39c2a1b835be", "metadata": {}, "source": [ "Note that Im using `'truncate':True` as even with our `350` word split earlier, there are always exceptions. Its important that as this scales we have as few issues as possible when embedding. " ] }, { "cell_type": "code", "execution_count": 9, "id": "e455dd52-aad3-4313-8738-03141ee5152a", "metadata": { "tags": [] }, "outputs": [], "source": [ "async def request(document, semaphore):\n", " # Semaphore guard\n", " async with semaphore:\n", " payload = {\n", " \"inputs\": document['content'],\n", " \"truncate\": True\n", " }\n", " \n", " timeout = ClientTimeout(total=10) # Set a timeout for requests (10 seconds here)\n", "\n", " async with ClientSession(timeout=timeout, headers=HEADERS) as session:\n", " async with session.post(API_URL, json=payload) as resp:\n", " if resp.status != 200:\n", " raise RuntimeError(await resp.text())\n", " result = await resp.json()\n", " \n", " document['embedding'] = result[0] # Assuming the API's output can be directly assigned\n", " return document\n", "\n", "async def main(documents):\n", " # Semaphore to limit concurrent requests. Adjust the number as needed.\n", " semaphore = asyncio.BoundedSemaphore(512)\n", "\n", " # Creating a list of tasks\n", " tasks = [request(document, semaphore) for document in documents]\n", " \n", " # Using tqdm to show progress. It's been integrated into the async loop.\n", " for f in tqdm(asyncio.as_completed(tasks), total=len(documents)):\n", " await f\n" ] }, { "cell_type": "code", "execution_count": 10, "id": "f0d17264-72dc-40be-aa46-17cde38c8189", "metadata": { "tags": [] }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "c4b7384336ad4c39a417a54a5a00a4ad", "version_major": 2, "version_minor": 0 }, "text/plain": [ "0it [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "0b034dc636df440594550f56dc152c8b", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/243068 [00:00