aless2212/Mistral-7B-Instruct-v0.2-onnx-fp32

Introduction This repository hosts the optimized versions of Mistral-7B-Instruct-v0.2 to accelerate inference with ONNX Runtime CUDA execution provider.

See the usage instructions for how to inference this model with the ONNX files hosted in this repository.

Model Description Developed by: MistralAI Model type: Pretrained generative text model License: Apache 2.0 License Model Description: This is a conversion of the Mistral-7B-Instruct-v0.2 for ONNX Runtime inference with ROCM/MiGraphx execution provider. Format Provided: ONNX-FP32

Usage Example if you or your dad is rich. (When i started, i had a dream and ten million dollars) : Following the benchmarking instructions. Example steps:

Clone onnxruntime repository.
git clone https://github.com/microsoft/onnxruntime
cd onnxruntime

Install required dependencies
python3 -m pip install -r onnxruntime/python/tools/transformers/models/llama/requirements-cuda.txt

Inference using manual model API, or use Hugging Face's ORTModelForCausalLM
from optimum.onnxruntime import ORTModelForCausalLM
from onnxruntime import InferenceSession
from transformers import AutoConfig, AutoTokenizer

sess = InferenceSession("model.onnx", providers = ["CUDAExecutionProvider"]) //CUDAExecutionProvider for cuda, for rocm ROCMExecutionProvider or MIGRAPHXExecutionProvider
config = AutoConfig.from_pretrained("Mistral-7B-Instruct-v0.2-onnx-fp32/") //location of tokenizer.json

model = ORTModelForCausalLM(sess, config, use_cache = True, use_io_binding = True)

tokenizer = AutoTokenizer.from_pretrained("Mistral-7B-Instruct-v0.2-onnx-fp32") //location of model.onnx or model_optimized.onnx

inputs = tokenizer("Instruct: What is a fermi paradox?\nOutput:", return_tensors="pt")

outputs = model.generate(**inputs)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Example if you or your dad is not Rich:

We will start from here :

https://www.youtube.com/watch?v=1SJeLcI8otk&list=PLLruToFvdJEHh7tOTwvV4jjrvGu7syWdb&index=14&pp=iAQB https://www.youtube.com/watch?v=NpM0n6xBbrA&list=PLLruToFvdJEHh7tOTwvV4jjrvGu7syWdb&index=22&pp=iAQB

Now We have learnt Host aka CPU side working, https://www.youtube.com/watch?v=zfru8aHZ44M&list=PL5Q2soXY2Zi-qSKahS4ofaEwYl7_qp9mw&index=2&pp=iAQB https://www.youtube.com/watch?v=xz9DO-4Pkko&pp=ygUwZXRoIHp1cmljaCBjb21wdXRlciBhcmNoaXRlY3R1cmUgZ3B1IHByb2dyYW1taW5n We learn about GPU Programming.

now we head over to : For ROCm: https://rocm.docs.amd.com/projects/HIP/en/latest/doxygen/html/group___memory.html For CUDA: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html

Now We figure out How to Use Unified Memory still not compromising the Speed for a Single Build. and we head Over to. Pytorch : CUDA: https://github.com/pytorch/pytorch/blob/main/c10/cuda/CUDACachingAllocator.cpp HIP: https://github.com/pytorch/pytorch/blob/main/c10/hip/HIPCachingAllocator.cpp (This doesn't exist it will be automatically generated, follow instructions on git).

ONNX: CUDA || ROCM: https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/core/providers

We find and change our code in unified memory structure. we compile and we succeed and we don't get out of memory errors. BAM! its the miracle and contribution of everyone other than you. you did it without dad's money.

now follow dad's money part of instructions