Spaces:

derek-thomas
/

arabic-RAG

Build error

App Files Files Community

derek-thomas HF staff commited on Nov 2, 2023

Commit

867c18d

•

1 Parent(s): c606718

Adding more info around LanceDB

Browse files

Files changed (1) hide show

notebooks/05_vector_db.ipynb +40 -14

notebooks/05_vector_db.ipynb CHANGED Viewed

@@ -6,9 +6,25 @@
    "metadata": {},
    "source": [
     "# Approach\n",
     "There are a number of aspects of choosing a vector db that might be unique to your situation. You should think through your HW, utilization, latency requirements, scale, etc before choosing. \n",
     "\n",
-    "Im targeting a demo (low utilization, latency can be relaxed) that will live on a huggingface space. I have a small scale that could even fit in memory. I like [Qdrant](https://qdrant.tech) for this. "
    ]
   },
   {
@@ -97,7 +113,7 @@
    },
    "source": [
     "# Setup\n",
-    "Read in our list of dictionaries. This is the upper end for the machine Im using. This takes ~10GB of RAM. We could easily do this in batches of ~100k and be fine in most machines. "
    ]
   },
   {
@@ -115,14 +131,6 @@
     "    document['vector'] = document.pop('embedding')"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "id": "98aec715-8d97-439e-99c0-0eff63df386b",
-   "metadata": {},
-   "source": [
-    "Convert the dictionaries to `Documents`"
-   ]
-  },
   {
    "cell_type": "code",
    "execution_count": 6,
@@ -170,9 +178,7 @@
    "id": "676f644c-fb09-4d17-89ba-30c92aad8777",
    "metadata": {},
    "source": [
-    "Instantiate our `DocumentStore`. Note that Im saving this to disk, this is for portability which is good considering I want to move from this ec2 instance into a Hugging Face Space. \n",
-    "\n",
-    "Note that if you are doing this at scale, you should use a proper instance and not saving to file. You should also take a [measured ingestion](https://qdrant.tech/documentation/tutorials/bulk-upload/) approach instead of using a convenient loader. "
    ]
   },
   {
@@ -187,11 +193,23 @@
     "from lancedb.embeddings.registry import EmbeddingFunctionRegistry\n",
     "from lancedb.embeddings.sentence_transformers import SentenceTransformerEmbeddings\n",
     "\n",
-    "\n",
     "db = lancedb.connect(proj_dir/\".lancedb\")\n",
     "tbl = db.create_table('arabic-wiki', [document])"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 8,
@@ -818,6 +836,14 @@
     "    "
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 9,

    "metadata": {},
    "source": [
     "# Approach\n",
+    "## VectorDB\n",
     "There are a number of aspects of choosing a vector db that might be unique to your situation. You should think through your HW, utilization, latency requirements, scale, etc before choosing. \n",
     "\n",
+    "I've been hearing a lot about LanceDB and wanted to check it out. It's newer and may or may not be good for **your** use-case. I'm attracted by its fast ingestion, cuda assisted indexing, and portability. It has some drawbacks, it doesnt support hnsw yet and it could change significantly given how early it is.\n",
+    "\n",
+    "\n",
+    "You will be blown away on how fast ingestion + indexing is with LanceDB. \n",
+    "\n",
+    "## Ingestion Strategy\n",
+    "I used the ~100k document `.ndjson` files in sequence to upload. After uploading I index.\n",
+    "\n",
+    "## Indexing\n",
+    "The algorithm used is `IVF_PQ`. I ignore the `PQ` part because I want better recall. Recall is important since Jais only has a 2k context window, I can't put my top 10 documents for RAG in my prompt. It will be my top 3 (512\\*3 + query + instructions ~ 2k). For many use-cases its worth the trade-off as you get much faster retrieval with not much performance loss. \n",
+    "\n",
+    "More partitions means faster retrieval but slower indexing. I chose 384 sub_vectors to be equal to my embedding dimension size. \n",
+    "\n",
+    "```tbl.create_index(num_partitions=1024, num_sub_vectors=384, accelerator=\"cuda\")```\n",
+    "\n",
+    "Read more about it [here](https://lancedb.github.io/lancedb/ann_indexes/)."
    ]
   },
   {
    },
    "source": [
     "# Setup\n",
+    "To work with LanceDB we want to create the table before ingesting the first batch. To create a table we need at least 1 doc."
    ]
   },
   {
     "    document['vector'] = document.pop('embedding')"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 6,
    "id": "676f644c-fb09-4d17-89ba-30c92aad8777",
    "metadata": {},
    "source": [
+    "Here we create the db and the table."
    ]
   },
   {
     "from lancedb.embeddings.registry import EmbeddingFunctionRegistry\n",
     "from lancedb.embeddings.sentence_transformers import SentenceTransformerEmbeddings\n",
     "\n",
     "db = lancedb.connect(proj_dir/\".lancedb\")\n",
     "tbl = db.create_table('arabic-wiki', [document])"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "502f7cb9-32cf-4b32-8cb3-b021e02bd06c",
+   "metadata": {},
+   "source": [
+    "For each file we:\n",
+    "- Read the `ndjson` into a list of documents\n",
+    "- Replace 'embedding' with 'vector' to be compatible with LanceDB\n",
+    "- Write the docs to the table\n",
+    "\n",
+    "After that we index with a cuda accelerator."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 8,
     "    "
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "179af522-84ca-4985-9ca4-ffd1bde487eb",
+   "metadata": {},
+   "source": [
+    "It's crazy how fast it was. 42minutes to ingest and index >2M documents. Lets run a test to make sure it worked!"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 9,