In fact, it’s an open question whether larger batch sizes are better. https://tw...

In fact, it’s an open question whether larger batch sizes are better. https://twitter.com/jeremyphoward/status/1189643170377658369...

Seconding all of your questions! Details about successful 1.5B training is really hard to come by.

In case it’s helpful, here are some details of how a Chinese 1.5b GPT-2 was trained: https://github.com/imcaspar/gpt2-ml

It looks like they used a batch size of 2 on a TPUv3-256 pod. It took 50 hours and 99,000 training steps, which seems like about 1.3 examples per second.