Spaces:
Build error
Build error
Commit
•
867c18d
1
Parent(s):
c606718
Adding more info around LanceDB
Browse files- notebooks/05_vector_db.ipynb +40 -14
notebooks/05_vector_db.ipynb
CHANGED
@@ -6,9 +6,25 @@
|
|
6 |
"metadata": {},
|
7 |
"source": [
|
8 |
"# Approach\n",
|
|
|
9 |
"There are a number of aspects of choosing a vector db that might be unique to your situation. You should think through your HW, utilization, latency requirements, scale, etc before choosing. \n",
|
10 |
"\n",
|
11 |
-
"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
12 |
]
|
13 |
},
|
14 |
{
|
@@ -97,7 +113,7 @@
|
|
97 |
},
|
98 |
"source": [
|
99 |
"# Setup\n",
|
100 |
-
"
|
101 |
]
|
102 |
},
|
103 |
{
|
@@ -115,14 +131,6 @@
|
|
115 |
" document['vector'] = document.pop('embedding')"
|
116 |
]
|
117 |
},
|
118 |
-
{
|
119 |
-
"cell_type": "markdown",
|
120 |
-
"id": "98aec715-8d97-439e-99c0-0eff63df386b",
|
121 |
-
"metadata": {},
|
122 |
-
"source": [
|
123 |
-
"Convert the dictionaries to `Documents`"
|
124 |
-
]
|
125 |
-
},
|
126 |
{
|
127 |
"cell_type": "code",
|
128 |
"execution_count": 6,
|
@@ -170,9 +178,7 @@
|
|
170 |
"id": "676f644c-fb09-4d17-89ba-30c92aad8777",
|
171 |
"metadata": {},
|
172 |
"source": [
|
173 |
-
"
|
174 |
-
"\n",
|
175 |
-
"Note that if you are doing this at scale, you should use a proper instance and not saving to file. You should also take a [measured ingestion](https://qdrant.tech/documentation/tutorials/bulk-upload/) approach instead of using a convenient loader. "
|
176 |
]
|
177 |
},
|
178 |
{
|
@@ -187,11 +193,23 @@
|
|
187 |
"from lancedb.embeddings.registry import EmbeddingFunctionRegistry\n",
|
188 |
"from lancedb.embeddings.sentence_transformers import SentenceTransformerEmbeddings\n",
|
189 |
"\n",
|
190 |
-
"\n",
|
191 |
"db = lancedb.connect(proj_dir/\".lancedb\")\n",
|
192 |
"tbl = db.create_table('arabic-wiki', [document])"
|
193 |
]
|
194 |
},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
195 |
{
|
196 |
"cell_type": "code",
|
197 |
"execution_count": 8,
|
@@ -818,6 +836,14 @@
|
|
818 |
" "
|
819 |
]
|
820 |
},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
821 |
{
|
822 |
"cell_type": "code",
|
823 |
"execution_count": 9,
|
|
|
6 |
"metadata": {},
|
7 |
"source": [
|
8 |
"# Approach\n",
|
9 |
+
"## VectorDB\n",
|
10 |
"There are a number of aspects of choosing a vector db that might be unique to your situation. You should think through your HW, utilization, latency requirements, scale, etc before choosing. \n",
|
11 |
"\n",
|
12 |
+
"I've been hearing a lot about LanceDB and wanted to check it out. It's newer and may or may not be good for **your** use-case. I'm attracted by its fast ingestion, cuda assisted indexing, and portability. It has some drawbacks, it doesnt support hnsw yet and it could change significantly given how early it is.\n",
|
13 |
+
"\n",
|
14 |
+
"\n",
|
15 |
+
"You will be blown away on how fast ingestion + indexing is with LanceDB. \n",
|
16 |
+
"\n",
|
17 |
+
"## Ingestion Strategy\n",
|
18 |
+
"I used the ~100k document `.ndjson` files in sequence to upload. After uploading I index.\n",
|
19 |
+
"\n",
|
20 |
+
"## Indexing\n",
|
21 |
+
"The algorithm used is `IVF_PQ`. I ignore the `PQ` part because I want better recall. Recall is important since Jais only has a 2k context window, I can't put my top 10 documents for RAG in my prompt. It will be my top 3 (512\\*3 + query + instructions ~ 2k). For many use-cases its worth the trade-off as you get much faster retrieval with not much performance loss. \n",
|
22 |
+
"\n",
|
23 |
+
"More partitions means faster retrieval but slower indexing. I chose 384 sub_vectors to be equal to my embedding dimension size. \n",
|
24 |
+
"\n",
|
25 |
+
"```tbl.create_index(num_partitions=1024, num_sub_vectors=384, accelerator=\"cuda\")```\n",
|
26 |
+
"\n",
|
27 |
+
"Read more about it [here](https://lancedb.github.io/lancedb/ann_indexes/)."
|
28 |
]
|
29 |
},
|
30 |
{
|
|
|
113 |
},
|
114 |
"source": [
|
115 |
"# Setup\n",
|
116 |
+
"To work with LanceDB we want to create the table before ingesting the first batch. To create a table we need at least 1 doc."
|
117 |
]
|
118 |
},
|
119 |
{
|
|
|
131 |
" document['vector'] = document.pop('embedding')"
|
132 |
]
|
133 |
},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
134 |
{
|
135 |
"cell_type": "code",
|
136 |
"execution_count": 6,
|
|
|
178 |
"id": "676f644c-fb09-4d17-89ba-30c92aad8777",
|
179 |
"metadata": {},
|
180 |
"source": [
|
181 |
+
"Here we create the db and the table."
|
|
|
|
|
182 |
]
|
183 |
},
|
184 |
{
|
|
|
193 |
"from lancedb.embeddings.registry import EmbeddingFunctionRegistry\n",
|
194 |
"from lancedb.embeddings.sentence_transformers import SentenceTransformerEmbeddings\n",
|
195 |
"\n",
|
|
|
196 |
"db = lancedb.connect(proj_dir/\".lancedb\")\n",
|
197 |
"tbl = db.create_table('arabic-wiki', [document])"
|
198 |
]
|
199 |
},
|
200 |
+
{
|
201 |
+
"cell_type": "markdown",
|
202 |
+
"id": "502f7cb9-32cf-4b32-8cb3-b021e02bd06c",
|
203 |
+
"metadata": {},
|
204 |
+
"source": [
|
205 |
+
"For each file we:\n",
|
206 |
+
"- Read the `ndjson` into a list of documents\n",
|
207 |
+
"- Replace 'embedding' with 'vector' to be compatible with LanceDB\n",
|
208 |
+
"- Write the docs to the table\n",
|
209 |
+
"\n",
|
210 |
+
"After that we index with a cuda accelerator."
|
211 |
+
]
|
212 |
+
},
|
213 |
{
|
214 |
"cell_type": "code",
|
215 |
"execution_count": 8,
|
|
|
836 |
" "
|
837 |
]
|
838 |
},
|
839 |
+
{
|
840 |
+
"cell_type": "markdown",
|
841 |
+
"id": "179af522-84ca-4985-9ca4-ffd1bde487eb",
|
842 |
+
"metadata": {},
|
843 |
+
"source": [
|
844 |
+
"It's crazy how fast it was. 42minutes to ingest and index >2M documents. Lets run a test to make sure it worked!"
|
845 |
+
]
|
846 |
+
},
|
847 |
{
|
848 |
"cell_type": "code",
|
849 |
"execution_count": 9,
|