Writing an LLM from scratch, part 20 – starting training, and cross entropy loss

leopoldj · 2025-10-03T13:43:15 1759498995

There's more than one way to do self supervised training.

This is the approach the author has taken.

    Training corpus: "The fat cat sat on the mat"

    Input -> Label
    --------------
    "The" -> " fat"
    "The fat" -> " cat"
    "The fat cat" -> " sat"

Hugging Face's Trainer class takes a different approach. The label is same as input shifted left by 1 and padded by the <ignore> token (-1).

    Training corpus: "The fat cat sat on the mat"
    Input (7 tokens): "The fat cat sat on the mat"
    Output logit (7 tokens): "mat fat sat on fat mat and"
    Shifted label (7 tokens): "fat cat sat on the mat <ignore>"

Cross entropy is then calculated for the output logits and shifted label. At least this my understanding after reviewing the code.

blackbear_ · 2025-10-03T16:09:50 1759507790

The two ways are equivalent (it's always next token prediction) but the latter is way more efficient as it computes the loss for N tokens in a single forward pass.

asimovDev · 2025-10-03T13:42:53 1759498973

https://www.gilesthomas.com/2024/12/llm-from-scratch-1

part 1