More

smhx · on Aug 16, 2024

the author got a couple of things wrong, that are worth pointing out:

1. PyTorch is going all-in on torch.compile -- Dynamo is the frontend, Inductor is the backend -- with a strong default Inductor codegen powered by OpenAI Triton (which now has CPU, NVIDIA GPU and AMD GPU backends). The author's view that PyTorch is building towards a multi-backend future isn't really where things are going. PyTorch supports extensibility of backends (including XLA), but there's disproportionate effort into the default path. torch.compile is 2 years old, XLA is 7 years old. Compilers take a few years to mature. torch.compile will get there (and we have reasonable measures that the compiler is on track to maturity).

2. PyTorch/XLA exists, mainly to drive a TPU backend for PyTorch, as Google gives no other real way to access the TPU. It's not great to try shoe-in XLA as a backend into PyTorch -- as XLA fundamentally doesn't have the flexibility that PyTorch supports by default (especially dynamic shapes). PyTorch on TPUs is unlikely to ever have the experience of JAX on TPUs, almost by definition.

3. JAX was developed at Google, not at Deepmind.

n7g · on Aug 17, 2024

Hey, thanks for actually engaging with the blog's points instead of "Google kills everything it touches" :)

1. I'm well aware of the PyTorch stack, but this point:

> PyTorch is building towards a multi-backend future isn't really where things are going

>PyTorch supports extensibility of backends (including XLA)

Is my problem. Those backends just never integrate well as I mentioned in the blogpost. I'm not sure if you've ever gone into the weeds, but there are so many (often undocumented) sharp edges when using different backends that they never really work well. For example, how bad Torch:XLA is and the nightmare inducing bugs & errors with it.

> torch.compile is 2 years old, XLA is 7 years old. Compilers take a few years to mature

That was one of my major points - I don't think leaning on torch.compile is the best idea. A compiler would inherently place restrictions that you have to work-around.

This is not dynamic, nor flexible - and it flies in the face of torch's core philosophies just so they can offer more performance to the big labs using PyTorch. For various reasons, I dislike pandering to the rich guy instead of being an independent, open-source entity.

2. Torch/XLA is indeed primarily meant for TPUs - like the quoted announcement, where they declare to be ditching TF:XLA in favour of OpenXLA. But there's still a very real effort to get it working on GPUs - infact, a lab on twitter declared that they're using Torch/XLA on GPUs and will soon™ release details.

XLA's GPU support is great, its compatible across different hardware, its optimized and mature. In short, its a great alternative to the often buggy torch.compile stack - if you fix the torch integration.

So I won't be surprised if in the long-term they lean on XLA. Whether that's a good direction or not is upto the devs to decide unfortunately - not the community.

3. Thank you for pointing that out. I'm not sure about the history of JAX (maybe might make for a good blogpost for JAX devs to write someday), but it seems that it was indeed developed at Google research, though also heavily supported + maintained by DeepMind.

Appreciate you giving the time to comment here though :)

smhx · on Aug 19, 2024

If you're the author, unfortunately I have to say that the blog is not well-written -- misinformed about some of the claims and has a repugnant click-baity title. you're getting the attention and clicks, but probably losing a lot of trust among people. I didn't engage out of choice, but because of a duty to respond to FUD.

> > torch.compile is 2 years old, XLA is 7 years old. Compilers take a few years to mature

> That was one of my major points - I don't think leaning on torch.compile is the best idea. A compiler would inherently place restrictions that you have to work-around.

There are plenty of compilers that place restrictions that you barely notice. gcc, clang, nvcc -- they're fairly flexible, and "dynamic". Adding constraints doesn't mean you have to give up on important flexibility.

> This is not dynamic, nor flexible - and it flies in the face of torch's core philosophies just so they can offer more performance to the big labs using PyTorch. For various reasons, I dislike pandering to the rich guy instead of being an independent, open-source entity.

I think this is an assumption you've made largely without evidence. I'm not entirely sure what your point is. The way torch.compile is measured for success publicly (even in the announcement blogpost and Conference Keynote, link https://pytorch.org/get-started/pytorch-2.0/ ) is by measuring on a bunch of popular PyTorch-based github repos in the wild + popular HuggingFace models + the TIMM vision benchmark. They're curated here https://github.com/pytorch/benchmark . Your claim that its to mainly favor large labs is pretty puzzling.

torch.compile is both dynamic and flexible because: 1. it supports dynamic shapes, 2. it allows incremental compilation (you dont need to compile the parts that you wish to keep in uncompilable python -- probably using random arbitrary python packages, etc.). there is a trade-off between dynamic, flexible and performance, i.e. more dynamic and flexible means we don't have enough information to extract better performance, but that's an acceptable trade-off when you need the flexibility to express your ideas more than you need the speed.

> XLA's GPU support is great, its compatible across different hardware, its optimized and mature. In short, its a great alternative to the often buggy torch.compile stack - if you fix the torch integration.

If you are an XLA maximalist, that's fine. I am not. There isn't evidence to prove out either opinions. PyTorch will never be nicely compatible with XLA until XLA has significant constraints that are incompatible with PyTorch's User Experience model. The PyTorch devs have given clear written-down feedback to the XLA project on what it takes for XLA+PyTorch to get better, and its been a few years and the XLA project prioritizes other things.

n7g · on Aug 20, 2024

> There are plenty of compilers that place restrictions that you barely notice. gcc, clang, nvcc -- they're fairly flexible, and "dynamic"

In the context of scientific computing - this is completely, blatantly false. We're not lowering low-level IR to machine code. We want to perform certain mathematical processes often distributed on a large number of nodes. There's a difference between ensuring optimization (i.e no I/O bottlenecks, adequate synchronization between processes, overlapping computation with comms) vs. simply transforming a program to a different representation.

This is classic [false analogy](https://simple.wikipedia.org/wiki/False_analogy)

Adding constraints does mean that you give up on flexibility precisely because you have to work around them. For example, XLA is constrained intentionally against dynamic-loops because you lose a lot of performance and suffer a huge overhead. So the API forces you to think about it statically (like you can work around it with fancier methods like using checkpointing and leveraging a tree-verse algorithm)

I'll need more clarification regarding this point, because I don't know what dev in which universe will not regard "constraints" as flying against the face of flexibility.

> popular HuggingFace models + the TIMM vision benchmark

Ah yes, benchmark it on models that are entirely static LLMs or convnet-hybrids. Clearly, high requirement on dynamicness and flexibility there.

(I'm sorry but that statement alone has lost you any credibility for me.)

> Your claim that its to mainly favor large labs is pretty puzzling.

Because large labs often play with the safest models, which often involves scaling them up (OAI, FAIR, GDM etc.) and those tend to be self-attention/transformer like workloads. The devs have been pretty transparent about this - you can DM them if you want - but their entire stack is optimized for these usecases.

And ofcourse, that won't involve considering for research workloads which tend to be highly non-standard, dynamic and rather complex and much, much harder to optimize for.

This is where the "favouring big labs" comes from.

> 1. it supports dynamic shapes

I agree that in the specifically narrow respect of dynamic shapes, it's better than XLA.

But then it also misses a lot of the optimization features XLA has such as its new cost model and Latency Hiding Scheduler (LHS) stack which is far better at async overlapping of comms, computations and even IO (as its lazy).

> there is a trade-off between dynamic, flexible and performance

Exactly. Similarly, there's a difference in the features offered by each particular compiler. Torch's compiler's strengths may be XLA's weakness, and vice-versa.

But its not perfect - no software can be, and compilers certainly aren't exceptions. My issue is that the compiler is being considered at all in torch.

There are use-cases where the torch.compile stack fails completely (not sure how much you hang around more research-oriented forums) wherein there are some features that simply do not work with torch.compile. I cited FSDP as the more egregious one because its so common in everyone's workflow.

That's the problem. Torch is optimizing their compiler stack for certain workloads, with a lot of new features relying on them (look at newly proposed DTensor API for example).

If I'm a researcher with a non-standard workload, I should be able to enjoy those new features without relying on the compiler - because otherwise, it'd be painful for me to fix/restrict my code for that stack.

In short, I'm being bottlenecked by the compiler's capabilities preventing me to fully utilize all features. This is what I don't like. This is why torch should never be leaning at a compiler at all.

It 'looks' like a mere tradeoff, but reality is just not as simple as that.

> XLA:GPU

I don't particularly care if torch uses whatever compiler stack the devs choose - that's beside the point. Really, I just don't like the compiler-integrated approach at all. The choice of the specific stack doesn't matter.

lunaticd · on Aug 17, 2024

3. The project started under a Harvard affiliated Github org during the course of PhDs. These same people later joined Google where it continued to be developed and over time adopted more and more in place of TensorFlow.

smhx · on Aug 1, 2024

it's not a new PyTorch feature.

It's just a showcase of existing PyTorch features (including libtorch) as an end-to-end example.

On the server-side it uses libtorch, and on mobile, it uses PyTorch's executorch runtime (that's optimized for edge)

BaculumMeumEst · on Aug 1, 2024

Did not know executorch existed! That's so cool! I have it on my bucket list to tinker with running LLMs on wearables after I'm a little further along in learning, great to see official tooling for that!

https://github.com/pytorch/executorch

sunshinesfbay · on Aug 1, 2024

I think this is not about new Pytorch features, although it requires the latest Pytorch and Executorch making me think that some features in pytorch and executorch got extended optimized for this use case?

What makes this cool is that you can use the same model and the same library and apply to server, desktop, laptop and mobile on iOS and Android, with a variety of quantization schemes and other features.

Definitely still some rough edges as I'd expect from any first software release!

smhx · on May 18, 2024

that's a direct implication that they're waiting for a liquidity event before they speak

smhx · on April 22, 2024

most of the GenAI players use both PyTorch and JAX, depending on the hardware they are running on. Character, Anthro, Midjourney, etc. are dual shops (they use both). xAI only uses JAX afaik.

smhx · on Dec 7, 2023

You've created a superior llama/mistral-derivative model -- like https://old.reddit.com/r/LocalLLaMA/comments/17vcr9d/llm_com...

How can you convince the world to use it (and pay you)?

Step 1: You need a 3rd party to approve that this model is safe and responsible. the Purple Llama project starts to bridge this gap!

Step 2: You need to prove non-sketchy data-lineage. This is yet unsolved.

Step 3: You need to partner with a cloud service that hosts your model in a robust API and (maybe) provides liability limits to the API user. This is yet unsolved.

smhx · on May 20, 2023

>What's a fair benchmark?

the absolute golden benchmarks are https://github.com/pytorch/benchmark They are a diverse set of userland code taken from github as-is and made into benchmarks.

smhx · on Jan 1, 2023

> So technically, if you are pulling the older version of pytorch-nightly (specifically 2.0.0.dev20221230), it will still pull that compromised dependency (because torch have explicit version lock to it).

All PyTorch nightlies with this dependency have been deleted

shanipribadi · on Jan 1, 2023

@smhx are you sure? at the time of this comment, I was still able to download 2.0.0.dev20221230

  pip3 download torch==2.0.0.dev20221230+cpu --extra-index-url https://download.pytorch.org/whl/nightly/cpu

and on extracting the wheel, METADATA still have

  Requires-Dist: torchtriton (==2.0.0+0d7e753227) ; extra == 'dynamo'

The package dated 20221231 has pytorch-triton already (so should be safe now)

Although I guess this is low risk, because people normally would download nightlies without pinning to a particular version/date.

But in case there are people that does pin their version, and cache those vulnerable versions (locally or on their own proxies/private repositories), they could still be affected.

I recommend to get PyPA to yank the 2.0.0.dev20221230 version in pypi, and possibly amend the post to remind people to purge their caches not just on their local but also on their proxies/private repos/mirrors (mainly for the torchtriton package) and to immediately stop using any pytorch nightlies dated before Dec 31 2022 (mainly any pytorch nighlies that has a pin on torchtriton==2.0.0+0d7e753227, not just between 25 Dec to 30 Dec).

smhx · on Jan 1, 2023

thanks for the heads-up, looks like we didn't yank the CPU wheels on those dates. will get to them in the next set of working hours, as its an unlikely scenario (not only do you have to install the wheel of a specific date, you also have to specify the undocumented feature flag [dynamo])

smhx · on Jan 1, 2023

This is the last one. It was also the first one.

upwardbound · on Jan 1, 2023

thank you.

smhx · on Jan 1, 2023

this only affects the nightly pytorch. the stable pytorch build doesn't depend on `torchtriton`.

the nightly pytorch moved to depend on our own secured `pytorch-triton` now, secured on PyPI and our nightly channel.

smhx · on Jan 1, 2023

the version numbers are different, even though the package names are the same. A stable version will have a version number such as `1.13.0`, where as a nightly version will have the date in the version number, such as `2.0.0.dev20221230`. You can check this either with `pip list | grep torch` or via `python -c "import torch; print(torch.__version__)"`

If you installed `torch` via the instructions to install the nightly version specifically, then you get the nightly version. By default, you get the stable version of `torch`.