PyTorch elastic training

kumv · on Dec 9, 2019

We at AWS worked on this last year. We saw some promising results with our implementation with Apache MXNet DeepLearning framework.

https://aws.amazon.com/blogs/machine-learning/introducing-dy...

Good to see more work in this direction and we would be happy to collaborate on this.

pythux · on Dec 7, 2019

I have a more meta question related to Python, and only remotely related to this post. With the huge use case of machine learning and deep learning that Python is loved for; I am wondering if other traditional use cases that Python was used for could suffer. What about future developments for the language itself, is ML big enough to influence the direction and priorities of the language? This is far fetched but, if a critical mass of companies and people start using Python for ML exclusively, could it hurt Python as a "general purpose language"?

aivosha · on Dec 7, 2019

One can always fork python to keep it "general" if it ever comes to that. In fact I would argue both ML and non-ML domains would benefit from that fork. The reason python is used in ML is not because its intrinsically good for it, but because it happened to have a good set of libraries and community at the right time and place. Having specialized subset of python fine-tuned/"compiled" for ML would be a good path for it as a product to evolve towards. I would not mind having all kinds of ML related, algebraic operations as part of the language so the code reads more naturally. And not only that, all frameworks have to support that language and thus become much more interchangeable. Just like you can have CPython and pypy and others that converge into the same language, imagine the same thing with TensorFlow, PyTorch and alike converging to a singe language that support builtin ML-operations.

scythe · on Dec 7, 2019

I think the hidden variable here is the affinity of academics for Python. I've never seen so much Python as in universities. It mostly draws converts from MATLAB, and the feel of the language fits nicely onto what they're used to. It's probably the most popular language in education (having replaced Smalltalk and Scheme) and it'll probably stick around for a while.

Other traditional use cases for Python haven't been affected by the popularity of ML so much as the shift towards static typing in corporate settings and competition from new convenience-oriented languages like Go and Elixir, as well as the growth of the Javascript ecosystem. My first real job involved porting an app from Python to Go.

moksly · on Dec 7, 2019

We use it for a lot of back-end services that used to be build in C# and ASP. Partly because Python has better supported libraries for a lot of the things we need (even related directly to Microsoft techs), but also because it’s just really productive. It’s not as performance as .Net Core, but we don’t need that as much as we need shorter development/operations time. We can buy more iron, but we quite literally can’t find enough techies.

Luckily everyone seem to treat python as a first class citizen, so it integrates perfectly with our Microsoft heavy tech stack.

sytelus · on Dec 7, 2019

This is great to see! This framework will allow defining functions that then can be run on many machines, gathering the output in fault-tolerant scale-out way. However, only AWS example is very limiting. I hope Azure and GCP gets added soon. Better docs on how infrastructure works underneath (Ray/Kubernetes?) would also be appreciated. I'd love to see an example that allows training ImageNet in just few minutes if cheap spot instances were available in cloud of your choice.

choppaface · on Dec 7, 2019

At a high level, this looks similar Spark barrier mode for Tensorflow / Horovod. Except this system relies on etcd, which k8s folk know has some limitations and admin costs...

For a modeling-focused project, one will still probably do better with a multi-gpu machine versus elastic complexity.

https://medium.com/plumbersofdatascience/whats-new-in-spark-...

sandGorgon · on Dec 7, 2019

im actually not sure why everyone is investing in their own scaling frameworks.

Would it have been hard to enhance Dask and leverage that ? For example there is a huge conversation and lots of work that has happened in Dask to specifically support Pytorch - https://github.com/dask/distributed/issues/2581

Dask has support for AWS ECS, Kubernetes, GKE, EMR, etc built in - https://docs.dask.org/en/latest/setup/cloud.html