Ohhh, you know what? I think this makes sense, since it's basically like making prototypes from the training set, only it's instead comparing against every training example, then averaging. Cool. Did not know that would be remotely close to 90%, but that makes sense to me.
I wonder if something like a weighted max_mean would perform better or something L1. Or maybe L2 is ideal, it is at the center of a lot of things information theory, after all! ;PPPP
I wonder if something like a weighted max_mean would perform better or something L1. Or maybe L2 is ideal, it is at the center of a lot of things information theory, after all! ;PPPP