Nvidia details efficiency of the NVFP4 format for LLM training — new paper reveals how NVFP4 offers benefits over FP8 and BF16

Nvidia details efficiency of the NVFP4 format for LLM training — new paper reveals how NVFP4 offers benefits over FP8 and BF16


When Nvidia began to disclose details about its new 4-bit floating point format — NVFP4 — earlier this year, it stated that while it is mainly designed for inference, it could also be used for AI training without significant loss in accuracy. Recently, the company released a paper describing how it managed to train a 12-billion-parameter model on a 10-trillion-token dataset using the NVFP4 format, with several supporting techniques, and achieved results that closely match those of an FP8 baseline.

(Image credit: Nvidia)

Blackwell and NVFP4: A match made in heaven

Nvidia’s NVFP4 is a purpose-built 4-bit floating point format developed for the Blackwell GPU architecture that is aimed at improving efficiency of both training and inference tasks. It combines highly compact data representation with a multi-level scaling strategy, achieving accuracy close to BF16 while substantially lowering performance and memory requirements.

Nvidia

(Image credit: Nvidia)

Structurally, NVFP4 adheres to the same E2M1 layout used in standard FP4 formats —consisting of 1 sign bit, 2 exponent bits, and 1 mantissa bit — giving it the ability to encode values approximately between −6 and +6. To overcome the inherently limited dynamic range of 4-bit formats, Nvidia introduces a hierarchical scaling mechanism: every 16-element block of FP4 values is assigned a dedicated scale factor stored in FP8 using the E4M3 layout, and in parallel, an FP32 scale factor is applied globally across the full tensor. Nvidia claims that this two-tier system keeps numerical noise low without losing performance efficiency that a 4-bit format has to offer.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *