Hopper tensor cores

Nvidia’s Hopper tensor cores utilize FP16 multiply accumulate operations, that support FP8 operations. Explain how one might use this to implement a W4A16 linear layer. What is the throughput and memory gain compared to a full FP16 and a W8A16 linear layers. What are the problems or costs of this approach?

Submit