Bfloat16 Processing for Neural Networks

摘要

Bfloat16 ("BF16") is a new floating-point format tailored specifically for high-performance processing of Neural Networks and will be supported by major CPU and GPU architectures as well as Neural Network accelerators. This paper proposes a possible implementation of a BF16 multiply-accumulation operation that relaxes several IEEE Floating-Point Standard features to afford low-cost hardware implementations. Specifically, subnorms are flushed to zero; only one non-standard rounding mode (Round-Odd) is supported; NaNs are not propagated; and IEEE exception flags are not provided. The paper shows that this approach achieves the same network-level accuracy as using IEEE single-precision arithmetic ("FP32") for less than half the datapath area cost and with greater throughput.