luvGPT
/

deepseek-uncensored-lore

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

magicsquares137 commited on 12 days ago

Commit

061db0f

·

verified ·

1 Parent(s): d30fe50

Update README.md

Files changed (1) hide show

README.md +55 -0

README.md CHANGED Viewed

@@ -253,6 +253,61 @@ print(generated_text)
 ---
 ## Limitations
 - **Bias**: Outputs may reflect biases present in the original DeepSeek model or training dataset.
 - **Context Length**: Limited to 1,000 tokens per sequence.

 ---
+### **System Requirements**
+| Precision  | **Total VRAM Usage** | **VRAM Per GPU (with 2 GPUs)** | **VRAM Per GPU (with 4 GPUs)** |
+|------------|----------------------|-------------------------------|-------------------------------|
+| **FP32 (Full Precision)** | ~24GB | ~12GB | ~6GB |
+| **FP16 (Half Precision)** | **~14GB** | **~7GB** | **~3.5GB** |
+| **8-bit Quantization** | ~8GB | ~4GB | ~2GB |
+| **4-bit Quantization** | ~4GB | ~2GB | ~1GB |
+**Important Notes:**
+- **Multi-GPU setups** distribute model memory usage across available GPUs.
+- Using **`device_map="auto"`** in `transformers` automatically balances memory across devices.
+- **Quantized versions (8-bit, 4-bit)** are planned for lower VRAM requirements.
+---
+### **Loading the Model in 4-bit and 8-bit Quantization**
+To reduce memory usage, you can load the model using **4-bit or 8-bit quantization** via **bitsandbytes**.
+#### **Install Required Dependencies**
+```bash
+pip install transformers accelerate bitsandbytes
+```
+#### **Load Model in 8-bit Quantization**
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
+model_name = "luvGPT/deepseek-uncensored-lore"
+# Define quantization config for 8-bit loading
+quantization_config = BitsAndBytesConfig(load_in_8bit=True)
+# Load tokenizer
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+# Load model in 8-bit mode
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    device_map="auto",
+    quantization_config=quantization_config
+)
+```
+---
+### **Future Work**
+- **GGUF Format Support**: We plan to provide a **GGUF-quantized version** of this model, making it compatible with **llama.cpp** and other lightweight inference frameworks.
+- **Fine-tuning & Alignment**: Exploring reinforcement learning and user feedback loops to improve storytelling accuracy and coherence.
+- **Optimized Inference**: Integrating FlashAttention and Triton optimizations for even faster performance.
 ## Limitations
 - **Bias**: Outputs may reflect biases present in the original DeepSeek model or training dataset.
 - **Context Length**: Limited to 1,000 tokens per sequence.