Seconding all of your questions! Details about successful 1.5B training is really hard to come by.
In case it’s helpful, here are some details of how a Chinese 1.5b GPT-2 was trained: https://github.com/imcaspar/gpt2-ml
It looks like they used a batch size of 2 on a TPUv3-256 pod. It took 50 hours and 99,000 training steps, which seems like about 1.3 examples per second.
Seconding all of your questions! Details about successful 1.5B training is really hard to come by.
In case it’s helpful, here are some details of how a Chinese 1.5b GPT-2 was trained: https://github.com/imcaspar/gpt2-ml
It looks like they used a batch size of 2 on a TPUv3-256 pod. It took 50 hours and 99,000 training steps, which seems like about 1.3 examples per second.