I love the idea, that's the future. However you should be aware that the explanation of second law of thermodynamics generated by the LLM you used in your app store screenshot is wrong: the LLM has it backwards. Energy transfers to less stable states from more stable states, and not the reverse. (I use LLMs for science education apps like https://apps.apple.com/fr/app/explayn-learn-chemistry/id6448..., so I am quite used to spot that kind of errors in LLM outputs...)
Local, app embedded, and purpose-built targeted experts is clearly the future in my mind for a variety of reasons. Looking at TPUs in Android devices and neural engine in Apple hardware it's pretty clear.
Xcode already has an ML studio, for example, that can not only embed and integrate models in apps but also finetune, etc. It's obvious to me that at some point most apps will have embedded models in the app (or device) for specific purposes.
No AI can compare to humans and even we specialize. You wouldn't hire a plumber to perform brain surgery and you wouldn't hire a neurosurgeon to fix your toilet. Mixture of experts with AI models is a thing of course but when we look at how we primarily interact with technology and the functionality it provides it's generally pretty well siloed to specific purposes.
A purposed domain and context trained/tuned small model doing stuff on your on-device data would likely do nearly as well if not better for some applications than even ChatGPT. Think of the next version of device keyboards doing RAG+LLM through your text messages to generate replies. Stack it up with speech to text, vision, multimodal models, and who knows what and yeah, interesting.
Throw in the automatic scaling, latency, and privacy and the wins really stack up.
Some random app developer can integrate a model in their application and scale higher with better performance than ChatGPT without setting money on fire.
> Local, app embedded, and purpose-built targeted experts is clearly the future in my mind for a variety of reasons. Looking at TPUs in Android devices and neural engine in Apple hardware it's pretty clear.
I think that’s only true for delay-intolerant or privacy-focused features. For most situations, a remote model running on an external server will outperform a local model. There is no thermal, battery or memory headroom for the local model to ever do better. The cost being a mere hundred milliseconds delay at most.
I expect most models triggered on consumer devices to run remotely, with a degraded local service option in case of connection problems.
Snapchat filters, iPhone photo processing/speech to text/always-on Hey Siri/OCR/object detection and segmentation - there are countless applications and functionality doing this on device today (and for years). For something like the RAG approach I mentioned the sync and coordination of your local content to a remote API would be more taxing on the battery just in terms of the radio than what we already see from on device neural engines and TPUs as leveraged by the functionality I described.
These applications would also likely be very upload heavy (photo/video inference - massive upload, tiny JSON response) which could very likely end up taxing cell networks further. Even RAG is thousands of tokens in and a few hundred out (in most cases).
There's also the issue of Nvidia GPUs having > 1 yr lead times and the exhaustion of GPUs available from various cloud providers. LLMs especially use tremendous resources for training and this increase is leading to more and more contention for available GPU resources. People are going to be looking more and more to save the clouds and big GPUs for what you really need to do there - big training.
Plus, not everyone can burn $1m/day like ChatGPT.
If AI keeps expanding and eating more and more functionality the remote-first approach just isn't sustainable.
There will likely always be some sort of blend (with serious heavy lifting being cloud, of course) but it's going to shift more and more to local and on-device. There's just no other way.
> Snapchat filters, iPhone photo processing/speech to text/always-on Hey Siri/OCR/object detection and segmentation - there are countless applications and functionality doing this on device today (and for years)
But those are peanuts compared to what will be possible in the (near) future. You think content-aware fill is neat? Wait until you can zoom out of a photo 50% or completely change the angle.
That’ll costs gobs of processing power and thus time and battery, much more than a 20MB burst transfer of a photo and the backsynced modifications.
> If AI keeps expanding and eating more and more functionality the remote-first approach just isn't sustainable.
It’ll definitely create a large moat around companies with lots of money or extremely efficient proprietary models.
> That’ll costs gobs of processing power and thus time and battery
The exact same thing was said about the functionality we're describing yet there it is. Imagine describing that to someone in 2010 who's already complaining about iPhone battery life. The response would be carbon-copy to yours.
In five years from the iPhone 8 to the iPhone 14 TOPS on the neural engine went from 0.6 to 17[0]. The iPhone 15 more than doubled that and stands at 35 TOPS[1]. Battery life is better than ever and that's a 58x gain just in neural, not even GPU, CPU, performance cores, etc.
Over that same period of time Nvidia GPUs only increased about 9x[2] - they're pushing the fundamentals much harder as a law of large numbers-ish issue.
So yeah, I won't have to wait long for zoom out of a photo 50%, completely change the angle, or who knows what else to be done locally. In fact, for these use cases increasingly advanced optics, processing, outside visual range sensors, etc, etc makes my point even more - even more data going to the cloud when the device is best suited to be doing it anyway.
Look at it this way - Apple sold over 97 million iPhones in 2023. Assuming the lower averages that's 1,649,000,000 combined TOPS out there.
Cloud providers benefit from optimization and inherent oversubscription but by comparison Nvidia sold somewhere around 500,000,000 TFLOPS worth of H100s last year.
Mainframe and serial terminal to desktop to thin client and terminal server - around and around we go.
Stability is actually defined by having a lower energy level. That explains why energy can only flow from a less stable system to a more stable system : the more stable system does not have available energy to give.
Using blocks allows to keep good performence on GPUS, while giving some flexibility in the pruning pattern. And when removing entirely empty rows and columns the pruned matrices are actually pretty dense, so competitive with structured pruning for speedup, but less "aggressive" on the network during the pruning process.
Disclaimer: I am the main co-author.
Thanks for this! I have been working on something similar for an upcoming education app, record a course and play in the same app with a compact file format (1MB per minute, could be much less with some tricks). You can see a demo here : https://youtu.be/zcHAzQXm3Hg and more at https://explayn.me I will definitely check your lib, and will be happy to switch if it’s better !
Thanks for the kind words! Yes, file size for live action or shipping directly in an app is always an issue. I may actually contribute some code, my file format is no rocket science, just a bare set of floats with some metas, not even some diff encoding between frames, so quite easy to interpret between languages and platforms.
For a body, with bones, since they can't stretch. All you need is rotation at each joint. You can get away with 10/11 bits per axis.
So for a full body you should be able to compress to 200 bytes per frame for 50 joints. That would mean 300k for 1 minute of animation at 30fps. Interpolate to get 60fps. That doesn't include faces.
If you do faces like Apple does, which IIUC is just N morph targets where N is like 15? Those are 1 weight each and you could easily make those 1 byte per weight or less so that's 27k for 1 minute of animation
Both of those could probably easily be compressed by storing deltas like draco or fit to curves for lots more compression.
Thanks! Yes, that’s the way to go. There is only a compromise between code simplicity/specialization and compression performance. And another point is the ability to use the format in memory without too much decoding or overhead when opening the file, random access for fast forward etc (when doing delta encoding) etc. (And I actually need strech too because my NPC interacts with objects, so the arm/hand bones should be at their exact place at replay, that’s 32 bones just for fingers.)
Take a look at the MCAP file format (https://mcap.dev), we invented it for the robotics industry but it’s a generic write-optimized container format for time series data. Since it’s a container format, you need to choose a serialization format as well such as flatbuffers or protobufs.
Efficiency of solar panels on satellites is about 40%. That means ~ 700W/m2 if correctly oriented. That's quite a lot of power, even if you stay 50% of the time in night while orbiting Earth. So no need for huge panels usually. (Updated : more like 40%, not 50%)
I don't think you have the right numbers in mind talking about the compute you need for AI. The prices are getting lower and lower of course, but you still need tons of money to train the kind of networks that make the news.
I don't think you have the right numbers in mind talking about the networks used in academic works. The majority of network used in publications are good old references like VGG
Example? I have yet to see something actually deployed by one of the big tech companies that could not be trained by students on a university cluster. I also think you underestimate grant funding. I worked at a state school a couple years back in a research lab that had over a million dollars in grant funds specifically for equipment and outside compute (not for salaries or new hires) and this is not at all abnormal.
State of the art CV models (image, not video) can cost 3-figure dollars per training run.
State of the art language models can cost 5-figure dollars per training run.
There are a lot of variable in play here so your mileage will definitely vary (how much data, how long are you willing to wait, do you really need to train from scratch, etc) and these should only be considered very rough ballpark numbers. However, those are real numbers for SotA models on gold-standard benchmark datasets using cost-optimized cloud ML training resources.
At 5-figures per training run, the list of people who can be innovators in the LM research space is very small (fine-tuning on top of a SotA LM is a different, more affordable matter).
Sure but 3 figure and 5 figure runs certainly do not eliminate universities (see my above comment). Not to mention as I have said, most good universities will have clusters capable of training these that they maintain on premise drastically reducing that cost (and in a worst case just take longer to train).
It really does. You've got to remember that a good SotA paper takes hundreds of training runs, at least.
I can't go into detail about budgets, but suffice to say if you think $1M is a university compute budget that lets you be a competitive research team on the cutting edge, you are __severely__ underestimating the amount of compute that leading corporate researchers are using. Orders of magnitude off.
On-prem is good for a bit until you're 18 months into your 3 year purchase cycle and you're on K80s while the major research leaders are running V100s and TPUs and you can't even fit the SotA model in your GPUs' memories any more.
Longer to train can mean weeks or even months for one experiment - that iteration speed makes it so hard to stay on the cutting edge.
And this is before considering things like neural architecture search and internet scale image/video/speech datasets where costs skyrocket.
The boundary between corporate research and academia is incredibly porous and a big part of that is the cost of research (compute, but also things like data labelling and staffing ML talent).
Your goalposts moved a few figures. Furthermore, $1 million+ was not a university compute budget - that was money for a single lab on campus (at a general state school nonetheless) on a specific project.
You still have yet to provide any concrete sources to back up your claims. We're talking about contributing to research here. If multi-million dollar training jobs are what it takes to be at the cutting edge you should be able to provide ample sources of that claim.
- "Some of the models are so big that even in MILA we can’t run them because we don’t have the infrastructure for that. Only a few companies can run these very big models they’re talking about" [1]. NOTE: MILA is a very good AI research center and, while I don't know too much about him, that person being quoted has great credentials so I would generally trust them.
- "the current version of OpenAI Five has consumed 800 petaflop/s-days" [2].
- Check out the Green AI paper. They have good number on the amount of compute to train a model and you can translate that into numbers.
I'm not an expert in on-prem ML costs, but I know many of the world's best on-prem ML users use the cloud to handle the variability of their workloads so I don't think on-prem is a magic bullet cost wise.
$1M annually per project (vs per lab) isn't bad at all. It's also way out of whack with what I saw when I was doing AI research in academia, but that was pre deep learning revolution, so what do I know.
Re: the moving goalposts - the distinction is between the cost of a training run and the cost of a paper-worth research result. Due to inherent variability, architecture search, hyperparameter search and possibly data cleaning work, the total cost is a couple orders of magnitude more than the cost of a training run (multiple will vary a lot by project and lab).
I understand why you don't trust what I'm saying. I wish I could give hard numbers, but I'm limited in what I can say publicly so this is the best I can do.
There is no such thing as a dumb rocket : to get down, they are using almost exactly the same stuff they use to get up, weight wise: they use the gimbaled engines, and the nitrogen attitude thrusters. The only additional devices are the grid fins, to control aerodynamically the descent, and the legs. Then you have to write the right software, and it does not weight much more ;-)
That and a little extra fuel. But, as you say probably less weight than you might think. Also, getting dunked in water probably adds quite a bit of new stresses so you would need a stronger rocket shell which is more wight.
Seems like every thing else in history has really been a missile. This thing can land, so seems like it's literally the first rocket. Do we have a definition of missile vs rocket?