Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Writing an LLM from scratch, part 20 – starting training, and cross entropy loss (gilesthomas.com)
41 points by gpjt 35 days ago | hide | past | favorite | 3 comments


There's more than one way to do self supervised training.

This is the approach the author has taken.

    Training corpus: "The fat cat sat on the mat"

    Input -> Label
    --------------
    "The" -> " fat"
    "The fat" -> " cat"
    "The fat cat" -> " sat"

Hugging Face's Trainer class takes a different approach. The label is same as input shifted left by 1 and padded by the <ignore> token (-1).

    Training corpus: "The fat cat sat on the mat"
    Input (7 tokens): "The fat cat sat on the mat"
    Output logit (7 tokens): "mat fat sat on fat mat and"
    Shifted label (7 tokens): "fat cat sat on the mat <ignore>"

Cross entropy is then calculated for the output logits and shifted label. At least this my understanding after reviewing the code.


The two ways are equivalent (it's always next token prediction) but the latter is way more efficient as it computes the loss for N tokens in a single forward pass.





Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: