I think it’s the other way around? I haven’t run into any ML code that could be multithreaded that wasn’t written in C++, but have often run into server tasks that could use a polling thread, etc.
All the ML code is written in lower level languages and that’s very unlikely to change, GIL or no.
Yeah, you're right - even though CUDA is async, doing any preprocessing (in Python) can be harder if you don't have shared memory (the start-up latency hit of multiprocessing is not a problem in this context). I've only ever encountered "embarrassingly parallel" data-feeding problems, where the memory overhead of multiprocessing was small, but I could see other situations. Comment retracted.