Spaces:

derek-thomas
/

arabic-RAG

Build error

File size: 24,373 Bytes

c94e693

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "a0f21cb1-fbc8-4282-b902-f47d92974df8",
   "metadata": {},
   "source": [
    "# Pre-requisites"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5f625807-0707-4e2f-a0e0-8fbcdf08c865",
   "metadata": {},
   "source": [
    "## Why TEI\n",
    "There are 2 **unsung** challenges with RAG at scale:\n",
    "1. Getting the embeddings efficiently\n",
    "1. Efficient ingestion into the vector DB\n",
    "\n",
    "The issue with `1.` is that there are techniques but they are not widely *applied*. TEI solves a number of aspects:\n",
    "- Token Based Dynamic Batching\n",
    "- Using latest optimizations (Flash Attention, Candle and cuBLASLt)\n",
    "- Fast loading with safetensors\n",
    "\n",
    "The issue with `2.` is that it takes a bit of planning. We wont go much into that side of things here though."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3102abce-ea42-4da6-8c98-c6dd4edf7f0b",
   "metadata": {},
   "source": [
    "## Start TEI Locally\n",
    "Run [TEI](https://github.com/huggingface/text-embeddings-inference#docker), I have this running in a nvidia-docker container, but you can install as you like. Note that I ran this in a different terminal for monitoring and seperation. \n",
    "\n",
    "Note that as its running, its always going to pull the latest. Its at a very early stage at the time of writing. \n",
    "\n",
    "I chose [sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2) based on the STS ar-ar performance on [mteb/leaderboard](https://huggingface.co/spaces/mteb/leaderboard), its the top performer and half the size of second place! TEI is fast, but this will make our life easier for storage and retrieval.\n",
    "\n",
    "I use the `revision=refs/pr/8` because this has the pull request with [safetensors](https://github.com/huggingface/safetensors) which is required by TEI. Check out the [pull request](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2/discussions/8) if you want to use a different embedding model and it doesnt have safetensors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7e873652-8257-4aae-92bc-94e1bac54b73",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "# volume=$pwd/tei\n",
    "# model=sentence-transformers/paraphrase-multilingual-minilm-l12-v2\n",
    "# revision=refs/pr/8\n",
    "# docker run \\\n",
    "#     --gpus all \\\n",
    "#     -p 8080:80 \\\n",
    "#     -v $volume:/data \\\n",
    "#     -v /home/ec2-user/.cache/huggingface/token:/root/.cache/huggingface/token \\\n",
    "#     --pull always \\\n",
    "#     ghcr.io/huggingface/text-embeddings-inference:latest \\\n",
    "#     --model-id $model \\\n",
    "#     --revision $revision \\\n",
    "#     --pooling mean \\\n",
    "#     --max-batch-tokens 65536"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "51959ef4-186e-4a32-826a-731813eaf593",
   "metadata": {},
   "source": [
    "### Test Endpoint"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "52edfc97-5b6f-44f9-8d89-8578cf79fae9",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "# response_code=$(curl -s -o /dev/null -w \"%{http_code}\" 127.0.0.1:8080/embed \\\n",
    "#     -X POST \\\n",
    "#     -d '{\"inputs\":\"What is Deep Learning?\"}' \\\n",
    "#     -H 'Content-Type: application/json')\n",
    "\n",
    "# if [ \"$response_code\" -eq 200 ]; then\n",
    "#     echo \"passed\"\n",
    "# else\n",
    "#     echo \"failed\"\n",
    "# fi"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e9d6b54a-02bd-49aa-b180-27a7ab90154e",
   "metadata": {},
   "source": [
    "## Start TEI with Inference Endpoints\n",
    "Another option is to run TEI on Inference Endpoints. Its cheap and fast. It took me less than 5 minutes to get it up and running!\n",
    "\n",
    "Check here for a [guide](https://huggingface.co/blog/inference-endpoints-embeddings#3-deploy-embedding-model-as-inference-endpoint). Make sure to set these options in order:\n",
    "1. Model Repository = transformers/paraphrase-multilingual-minilm-l12-v2\n",
    "1. Name your endpoint\n",
    "1. Choose a GPU\n",
    "1. Advanced Configuration\n",
    "    1. Task = Sentence Embeddings\n",
    "    1. Revision (based on [this pull request for safetensors](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2/discussions/8) = a21e6630\n",
    "    1. Container Type = Text Embeddings Inference\n",
    "    \n",
    "Set the other options as you prefer."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ec78c98a-6b7b-4689-8ef8-582c3fcdf66e",
   "metadata": {},
   "source": [
    "### Test Endpoint"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "a69e2ee1-67f2-4f0a-b496-02f5415a52ca",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdin",
     "output_type": "stream",
     "text": [
      "What is your BEARER TOKEN? Check your endpoint. ········\n",
      "What is your API_URL? ········\n"
     ]
    }
   ],
   "source": [
    "import getpass\n",
    "bearer_token = getpass.getpass(prompt='What is your BEARER TOKEN? Check your endpoint.')\n",
    "API_URL = getpass.getpass(prompt='What is your API_URL?')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "949d6bf8-804f-496b-a59a-834483cc7073",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Constants\n",
    "HEADERS = {\n",
    "\t\"Authorization\": f\"Bearer {bearer_token}\",\n",
    "\t\"Content-Type\": \"application/json\"\n",
    "}\n",
    "MAX_WORKERS = 512"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "d00b4af1-8fbc-4f7a-8a78-e1c52dd77a66",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0.0047912598, -0.03164673, -0.018051147, -0.057739258, -0.04498291]...\n"
     ]
    }
   ],
   "source": [
    "import requests\n",
    "\n",
    "\n",
    "def query(payload):\n",
    "\tresponse = requests.post(API_URL, headers=HEADERS, json=payload)\n",
    "\treturn response.json()\n",
    "\t\n",
    "output = query({\n",
    "\t\"inputs\": \"This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music!\",\n",
    "})\n",
    "print(f'{output[0][:5]}...')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b1b28232-b65d-41ce-88de-fd70b93a528d",
   "metadata": {},
   "source": [
    "# Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "abb5186b-ee67-4e1e-882d-3d8d5b4575d4",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import asyncio\n",
    "from pathlib import Path\n",
    "import json\n",
    "import time\n",
    "\n",
    "\n",
    "from aiohttp import ClientSession, ClientTimeout\n",
    "from tqdm.notebook import tqdm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "c4b82ea2-8b30-4c2e-99f0-9a30f2f1bfb7",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/home/ec2-user/arabic-wiki\n"
     ]
    }
   ],
   "source": [
    "proj_dir = Path.cwd().parent\n",
    "print(proj_dir)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "76119e74-f601-436d-a253-63c5a19d1c83",
   "metadata": {},
   "source": [
    "# Config"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "f6f74545-54a7-4f41-9f02-96964e1417f0",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "files_in = list((proj_dir / 'data/processed/').glob('*.ndjson'))\n",
    "folder_out = proj_dir / 'data/embedded/'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5e73235d-6274-4958-9e57-977afeeb5f1b",
   "metadata": {},
   "source": [
    "# Embed\n",
    "## Strategy\n",
    "TEI allows multiple concurrent requests, so its important that we dont waste the potential we have. I used the default `max-concurrent-requests` value of `512`, so I want to use that many `MAX_WORKERS`.\n",
    "\n",
    "Im using an `async` way of making requests that uses `aiohttp` as well as a nice progress bar. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cf3da8cc-1651-4704-9091-39c2a1b835be",
   "metadata": {},
   "source": [
    "Note that Im using `'truncate':True` as even with our `350` word split earlier, there are always exceptions. Its important that as this scales we have as few issues as possible when embedding. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "e455dd52-aad3-4313-8738-03141ee5152a",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "async def request(document, semaphore):\n",
    "    # Semaphore guard\n",
    "    async with semaphore:\n",
    "        payload = {\n",
    "            \"inputs\": document['content'],\n",
    "            \"truncate\": True\n",
    "        }\n",
    "        \n",
    "        timeout = ClientTimeout(total=10)  # Set a timeout for requests (10 seconds here)\n",
    "\n",
    "        async with ClientSession(timeout=timeout, headers=HEADERS) as session:\n",
    "            async with session.post(API_URL, json=payload) as resp:\n",
    "                if resp.status != 200:\n",
    "                    raise RuntimeError(await resp.text())\n",
    "                result = await resp.json()\n",
    "                \n",
    "        document['embedding'] = result[0]  # Assuming the API's output can be directly assigned\n",
    "        return document\n",
    "\n",
    "async def main(documents):\n",
    "    # Semaphore to limit concurrent requests. Adjust the number as needed.\n",
    "    semaphore = asyncio.BoundedSemaphore(512)\n",
    "\n",
    "    # Creating a list of tasks\n",
    "    tasks = [request(document, semaphore) for document in documents]\n",
    "    \n",
    "    # Using tqdm to show progress. It's been integrated into the async loop.\n",
    "    for f in tqdm(asyncio.as_completed(tasks), total=len(documents)):\n",
    "        await f\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "f0d17264-72dc-40be-aa46-17cde38c8189",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "1db8949409284a7cbeec2638ed197f59",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "0it [00:00, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "5945500ccf8649988918e2633269cb7b",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/243068 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch 1: Embeddings = 243068 documents = 243068\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "0cf8121a116f49fba72095fee46ef49d",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/104065 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch 2: Embeddings = 104065 documents = 104065\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "8f94983077854b5f9ab512f7d429eb55",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/123154 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch 3: Embeddings = 123154 documents = 123154\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "3d2932212e6b4323a377ff23758e7af7",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/135965 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch 4: Embeddings = 135965 documents = 135965\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "3de41d88c8bb439591925de045d8afe8",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/99138 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch 5: Embeddings = 99138 documents = 99138\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "604a4f3b1baf429687ac00aa63778cdf",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/83678 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch 6: Embeddings = 83678 documents = 83678\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "ecc69a2b763c4296b3a1fa35b15477aa",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/30573 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch 7: Embeddings = 30573 documents = 30573\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "bbdc03d2ca5a4099b412c1767b3d394c",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/78957 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch 8: Embeddings = 78957 documents = 78957\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "d4c64dfc612c4b7986de5385f5d88ba7",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/86327 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch 9: Embeddings = 86327 documents = 86327\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "d05e2884021143e0baf595a86725466a",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/83111 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch 10: Embeddings = 83111 documents = 83111\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "47f3537175d740aba9e2dc7de4c89fec",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/92664 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch 11: Embeddings = 92664 documents = 92664\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "02b89207ee8c407db4ad8045b3634243",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/66404 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch 12: Embeddings = 66404 documents = 66404\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "e1a96dfd66644007a39a5fef38e008ab",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/62844 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch 13: Embeddings = 62844 documents = 62844\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "3bef95b8dff044fa922e52e1e88b9813",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/59349 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch 14: Embeddings = 59349 documents = 59349\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "eb74103e549c4b0386d19f5d475af812",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/52554 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch 15: Embeddings = 52554 documents = 52554\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "41228a5bf8294c1e95f34b9376714543",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/34240 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch 16: Embeddings = 34240 documents = 34240\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "ba6eb3a975514d33ae3acac8859278d1",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/35933 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch 17: Embeddings = 35933 documents = 35933\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "67ecca34d2a1414c9e9817d835fe2083",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/64575 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch 18: Embeddings = 64575 documents = 64575\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "7fb632776adb4933b92c48f852b0ae6b",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/94244 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch 19: Embeddings = 94244 documents = 94244\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "17d73c88d4334357854f852c9783bfdb",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/124472 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch 20: Embeddings = 124472 documents = 124472\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "4dd7c3477a244d43b1d85417d4549eaa",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/121849 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch 21: Embeddings = 121849 documents = 121849\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "e8b657d57f584128ae5a7ee2ecf23c7f",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/147110 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch 22: Embeddings = 147110 documents = 147110\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "f3465378528a425e8dc9d040a003588a",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/70322 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch 23: Embeddings = 70322 documents = 70322\n",
      "6250.827601939993\n"
     ]
    }
   ],
   "source": [
    "start = time.perf_counter()\n",
    "for i, file_in in tqdm(enumerate(files_in)):\n",
    "\n",
    "    with open(file_in, 'r') as f:\n",
    "        documents = [json.loads(line) for line in f]\n",
    "        \n",
    "    # Get embeddings\n",
    "    await main(documents)\n",
    "        \n",
    "    # Make sure we got it all\n",
    "    count = 0\n",
    "    for document in documents:\n",
    "        if document['embedding'] and len(document['embedding']) == 384:\n",
    "            count += 1\n",
    "    print(f'Batch {i+1}: Embeddings = {count} documents = {len(documents)}')\n",
    "\n",
    "    # Write to file\n",
    "    with open(folder_out/file_in.name, 'w', encoding='utf-8') as f:\n",
    "        for document in documents:\n",
    "            json_str = json.dumps(document, ensure_ascii=False)\n",
    "            f.write(json_str + '\\n')\n",
    "print(time.perf_counter() - start)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cdee2b1c-0493-4b3e-8ecb-9d79109c756e",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "documents[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f90a0ed7-b5e9-4ae4-9e87-4c04875ebcc9",
   "metadata": {},
   "source": [
    "Lets double check that we got all the embeddings we expected!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5b78bfa4-d365-4906-a71c-f444eabf6bf8",
   "metadata": {
    "tags": []
   },
   "source": [
    "Great, we can see that they match.\n",
    "\n",
    "Let's write our embeddings to file"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fc1e7cc5-b878-42bb-9fb4-e810f3f5006a",
   "metadata": {
    "tags": []
   },
   "source": [
    "# Next Steps\n",
    "We need to import this into a vector db. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}