Update README.md
Browse files
README.md
CHANGED
@@ -253,6 +253,61 @@ print(generated_text)
|
|
253 |
|
254 |
---
|
255 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
256 |
## Limitations
|
257 |
- **Bias**: Outputs may reflect biases present in the original DeepSeek model or training dataset.
|
258 |
- **Context Length**: Limited to 1,000 tokens per sequence.
|
|
|
253 |
|
254 |
---
|
255 |
|
256 |
+
### **System Requirements**
|
257 |
+
|
258 |
+
|
259 |
+
| Precision | **Total VRAM Usage** | **VRAM Per GPU (with 2 GPUs)** | **VRAM Per GPU (with 4 GPUs)** |
|
260 |
+
|------------|----------------------|-------------------------------|-------------------------------|
|
261 |
+
| **FP32 (Full Precision)** | ~24GB | ~12GB | ~6GB |
|
262 |
+
| **FP16 (Half Precision)** | **~14GB** | **~7GB** | **~3.5GB** |
|
263 |
+
| **8-bit Quantization** | ~8GB | ~4GB | ~2GB |
|
264 |
+
| **4-bit Quantization** | ~4GB | ~2GB | ~1GB |
|
265 |
+
|
266 |
+
**Important Notes:**
|
267 |
+
- **Multi-GPU setups** distribute model memory usage across available GPUs.
|
268 |
+
- Using **`device_map="auto"`** in `transformers` automatically balances memory across devices.
|
269 |
+
- **Quantized versions (8-bit, 4-bit)** are planned for lower VRAM requirements.
|
270 |
+
|
271 |
+
---
|
272 |
+
|
273 |
+
### **Loading the Model in 4-bit and 8-bit Quantization**
|
274 |
+
To reduce memory usage, you can load the model using **4-bit or 8-bit quantization** via **bitsandbytes**.
|
275 |
+
|
276 |
+
#### **Install Required Dependencies**
|
277 |
+
```bash
|
278 |
+
pip install transformers accelerate bitsandbytes
|
279 |
+
```
|
280 |
+
|
281 |
+
#### **Load Model in 8-bit Quantization**
|
282 |
+
```python
|
283 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
|
284 |
+
|
285 |
+
model_name = "luvGPT/deepseek-uncensored-lore"
|
286 |
+
|
287 |
+
# Define quantization config for 8-bit loading
|
288 |
+
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
|
289 |
+
|
290 |
+
# Load tokenizer
|
291 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
292 |
+
|
293 |
+
# Load model in 8-bit mode
|
294 |
+
model = AutoModelForCausalLM.from_pretrained(
|
295 |
+
model_name,
|
296 |
+
device_map="auto",
|
297 |
+
quantization_config=quantization_config
|
298 |
+
)
|
299 |
+
|
300 |
+
```
|
301 |
+
|
302 |
+
---
|
303 |
+
|
304 |
+
### **Future Work**
|
305 |
+
- **GGUF Format Support**: We plan to provide a **GGUF-quantized version** of this model, making it compatible with **llama.cpp** and other lightweight inference frameworks.
|
306 |
+
- **Fine-tuning & Alignment**: Exploring reinforcement learning and user feedback loops to improve storytelling accuracy and coherence.
|
307 |
+
- **Optimized Inference**: Integrating FlashAttention and Triton optimizations for even faster performance.
|
308 |
+
|
309 |
+
|
310 |
+
|
311 |
## Limitations
|
312 |
- **Bias**: Outputs may reflect biases present in the original DeepSeek model or training dataset.
|
313 |
- **Context Length**: Limited to 1,000 tokens per sequence.
|