@ImranzamanML on Hugging Face: "Today lets discuss about 32-bit (FP32) and 16-bit (FP16) floating-point!…"

Hugging Face

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Back to feed

ImranzamanML

posted an update Oct 21, 2024

Post

1752

Today lets discuss about 32-bit (FP32) and 16-bit (FP16) floating-point!

Floating-point numbers are used to represent real numbers (like decimals) and they consist of three parts:

Sign bit: 
Indicates whether the number is positive (0) or negative (1).
Exponent:
Determines the scale of the number (i.e., how large or small it is by shifting the decimal point).
Mantissa (or fraction): 
Represents the actual digits of the number.

32-bit Floating Point (FP32)
Total bits: 32 bits
Sign bit: 1 bit
Exponent: 8 bits
Mantissa: 23 bits
For example:
A number like -15.375 would be represented as:
Sign bit: 1 (negative number)
Exponent: Stored after being adjusted by a bias (127 in FP32).
Mantissa: The significant digits after converting the number to binary.

16-bit Floating Point (FP16)
Total bits: 16 bits
Sign bit: 1 bit
Exponent: 5 bits
Mantissa: 10 bits
Example:
A number like -15.375 would be stored similarly:
Sign bit: 1 (negative number)
Exponent: Uses 5 bits, limiting the range compared to FP32.
Mantissa: Only 10 bits for precision.

Precision and Range
FP32: Higher precision and larger range, with about 7 decimal places of accuracy.
FP16: Less precision (around 3-4 decimal places), smaller range but faster computations and less memory use.

John6666

Oct 21, 2024

An image that may help explain.
https://huggingface.co/blog/4bit-transformers-bitsandbytes

ImranzamanML

Oct 22, 2024

That is very useful. Thanks!

Smorty100

Oct 21, 2024

AI newcomers when they realize that that Q8 stands for 8 bit quant 🤯

In this post