Someone mentions temperature in the context of algorithms, can't stop thinking, cool, simulated annealing. Haven't seen temperature used in any other family of algo before this.
If you squint, itβs the same thing. Simulated annealing generally attempts to sample from the Boltzmann distribution. (Presumably because actual annealing is a thermodynamic thing, and you can often think of annealing in a way that the system is a sample from the Boltzmann distribution.)
And softmax is exactly the function that maps energies into the corresponding normalized probabilities under the Boltzmann distribution. And transformers are generally treated as modeling the probabilities of strings, and those probabilities are expressed as energies under the Boltzmann distribution (i.e., logits are on a log scale), and asking your favorite model a question works by sampling from the Boltzmann distribution based on the energies (log probabilities) the model predicts, and you can sample that distribution at any temperature you like.