Hugging Face Implementation Plan

Overview

This document outlines the plan to rebuild the RAG system using Hugging Face's models and capabilities instead of Google Cloud services, while preserving the original cloud implementation as a separate option.

Repository Links

GitHub: https://github.com/Daanworg/cloud-rag-webhook
Hugging Face Space: https://huggingface.co/spaces/Ultronprime/cloud-rag-webhook

Migration Strategy

The key difference in our approach is to replace all Google Cloud dependencies with Hugging Face models and tools:

Replace Google's DocumentAI → Use Hugging Face OCR models (like microsoft/layoutlm-base-uncased)
Replace Vertex AI → Use Hugging Face embeddings models (like sentence-transformers/all-MiniLM-L6-v2)
Replace BigQuery → Use FAISS/Chroma vector store with local storage or Hugging Face Datasets
Replace Cloud Storage → Use Hugging Face's persistent storage
Replace Cloud Run → Use Hugging Face Spaces continuous execution

Implementation Steps

Set Up New Architecture:
- Create a revised Dockerfile for Hugging Face
- Set up persistent storage (20GB purchased)
- Configure A100 GPU using accelerate for pro users
Replace Text Processing Pipeline:
- Create a new OCR module using Transformers document models
- Implement a chunking system using pure Python
- Add text cleaning and processing without DocumentAI
Replace Vector Database:
- Implement FAISS/Chroma for vector storage
- Use Hugging Face Datasets for persistent indexed storage
- Create migration utility to move data from BigQuery
Replace Embedding System:
- Use sentence-transformers models for embeddings
- Implement similarity search using FAISS/Chroma
- Create a compatible API to replace Vertex AI functions
Update Application Layer:
- Modify Flask app to run on Hugging Face
- Update file handling to use local storage
- Create model caching for better performance

Key Components

Text Processing:

# New approach using Hugging Face models
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from datasets import Dataset

def process_text(text_content):
    """Process text using Hugging Face models."""
    tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
    model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
    
    # Process and chunk the text
    chunks = chunk_text(text_content)
    
    # Store in persistent dataset
    dataset = Dataset.from_dict({"text": chunks})
    dataset.save_to_disk("./data/chunks")
    
    return dataset

Vector Storage:

# New approach using FAISS
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer

class FAISSVectorStore:
    def __init__(self):
        self.model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
        self.dimension = self.model.get_sentence_embedding_dimension()
        self.index = faiss.IndexFlatL2(self.dimension)
        self.texts = []
        
    def add_texts(self, texts):
        embeddings = self.model.encode(texts)
        self.index.add(np.array(embeddings, dtype=np.float32))
        self.texts.extend(texts)
        
    def search(self, query, k=5):
        query_embedding = self.model.encode([query])[0]
        distances, indices = self.index.search(
            np.array([query_embedding], dtype=np.float32), k
        )
        return [self.texts[i] for i in indices[0]]

Hugging Face Space Configuration:

title: RAG Document Processing
emoji: 📄
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
pinned: false
models:
  - sentence-transformers/all-MiniLM-L6-v2
  - facebook/bart-large-cnn
license: apache-2.0

Automation Plan

Background Processing:
- Implement a file watcher for the persistent storage directory
- Process files automatically when added to upload directory
- Use Gradio/Streamlit for UI with background task system
Scheduled Tasks:
- Use Hugging Face Space's GitHub Actions for scheduling
- Run index maintenance tasks periodically
- Implement file processing queue for batch operations
GitHub Integration:
- Push processed data to GitHub repository as backup
- Use GitHub to store model configuration
- Implement version control for processed data

Required Libraries

transformers==4.40.0
datasets==2.17.1
sentence-transformers==2.3.1
faiss-cpu==1.7.4  # or faiss-gpu for CUDA support
gradio==4.19.2
streamlit==1.32.0
langchain==0.1.5
torch==2.1.2
accelerate==0.28.0

Hardware Requirements

Use Hugging Face Pro's free A100 tier (zero.gpu)
Configure model inference for optimal performance on GPU
Set up model caching to reduce memory usage
Utilize Hugging Face's persistent storage (20GB)

Project Goals

Create a fully self-contained RAG system on Hugging Face:

Process text files automatically
Generate embeddings with Hugging Face models
Store vectors in FAISS/Chroma on persistent storage
Query the data with a simple API
Run continuously "under the hood"
Utilize Hugging Face Pro benefits (A100 GPU, persistent storage)

Implementation Files

We'll create the following new files to implement the Hugging Face version:

hf_process_text.py - Text processing with HF models
hf_embeddings.py - Embedding generation with sentence-transformers
hf_vector_store.py - FAISS/Chroma implementation
hf_app.py - Gradio/Streamlit interface
hf_rag_query.py - Query interface for HF models
requirements_hf.txt - HF-specific dependencies

This will allow us to maintain both implementations in parallel.