More

tmostak · 2025-10-14T20:05:43 1760472343

Even without NVLink C2C, on a GPU with 16XPCIe 5.0 lanes to host, you have 128GB/sec in theory and 100+ GB/sec in practice bidirectional bandwidth (half that in each direction), so still come out ahead with pipelining.

Of course prefix sums are often used within a series of other operators, so if these are already computed on GPU, you come out further ahead still.

ashtonsix · 2025-10-14T20:34:49 1760474089

Haha... GPUs are great. But do you mean to suggest we should swap a single ARM core for a top-line GPU with 10k+ cores and compare numbers on that basis? Surely not.

Let's consider this in terms of throughput-per-$ so we have a fungible measurement unit. I think we're all agreed that this problem's bottleneck is the host memory<->compute bus so the question is: for $1 which server architecture lets you pump more data from memory to a compute core?

It looks like you can get a H100 GPU with 16xPCIe 5.0 (128 GB/s theoretical, 100 GB/s realistic) for $1.99/hr from RunPod.

With an m8g.8xlarge instance (32 ARM CPU cores) you should get much-better RAM<->CPU throughput (175 GB/s realistic) for $1.44/hr from AWS.

_zoltan_ · 2025-10-15T13:11:01 1760533861

GH200 is $1.5/hr at lambda and can do 450GB/s to the GPU. seems still cheaper?

tmostak · 2025-06-11T01:27:13 1749605233

We've made extensive use of perfect hashing in HeavyDB (formerly MapD/OmniSciDB), and it has definitely been a core part of achieving strong group by and join performance.

You can use perfect hashes not only the usual suspects of contiguous integer and dictionary-encoded string ranges, but also use cases like binned numeric and date ranges (epoch seconds binned per year can use a perfect hash range of one bin per year for a very wide range of timestamps), and can even handle arbitrary expressions if you propagate the ranges correctly.

Obviously you need a good "baseline" hash path to fall back to you, but it's surprising how many real-world use cases you can profitably cover with perfect hashing.

anitil · 2025-06-11T02:42:11 1749609731

So in HeavyDB do you on-the-fly build perfect hashes for queries? I've only ever seen perfect hashes used at 'build time' when the keys are already known and fixed (like keywords in a compiler)

TheTaytay · 2025-06-11T12:44:21 1749645861

I had the same question! I have never heard of runtime perfect hashing. (Admittedly, I haven’t read the paper yet.)

senderista · 2025-06-11T14:49:21 1749653361

In the DSA theory literature there is so-called “dynamic perfect hashing” but I don’t think it’s ever been implemented and its use case is served by high-load factor techniques like bucketized cuckoo hashing.

bytehamster · 2025-06-11T15:56:34 1749657394

In the appendix of the survey, there are 3 references on dynamic perfect hashing. I think the only actual implementation of a dynamic PHF is a variant of perfect hashing though fingerprinting in the paper "perfect hashing for network applications". However, that implementation is not fully dynamic and needs to be re-built if the key set changes too much.

wahern · 2025-06-11T23:11:35 1749683495

All of those modern algorithms, even relatively older ones like CHD, can find a perfect hash function over millions of keys in less than a second.[1] Periodically rebuilding a function can be more than fast enough, depending on your use case.

Last time I tried gperf 8-10 years, it took it hours or even days to build a hash function CHD could do in seconds or less. If someone's idea of perfect hash function construction cost is gperf (at least gperf circa 2015)... welcome to the future.

[1] See my implementation of CHD: https://25thandclement.com/~william/projects/phf.html

anitil · 2025-06-11T23:09:08 1749683348

The article had a reference for being able to compress data in a table (maybe similar in spirit to using a foreign key to a small table). I could also see it being useful in compression dictionaries, but again that's not really a run-time use (and I'm sure I'm not the first to think of it)

tmostak · on Jan 10, 2025

This looks amazing!

Just looking through the code a bit, it seems that the model both supports a (custom) attention mechanism between features and between rows (code uses the term items)? If so, does the attention between rows help improve accuracy significantly?

Generally, for standard regression and classification use cases, rows (observations) are seen to be independent, but I'm guessing cross-row attention might help the model see the gestalt of the data in some way that improves accuracy even when the independence assumption holds?

ersiees · on Jan 10, 2025

Author here: The new introduction of attention between features did make a big impact compared to the first variant of TabPFN. The old model handled every feature like it was completely different to be feature 5 vs 15, but actually features are typically more-or-less permutation invariant. So the logic is similar to why a CNN is better for images than an MLP.

dist-epoch · on Jan 10, 2025

Speculating, cross-row might give you information where you are in that row distribution.

tmostak · on Dec 29, 2024

You should be able to train/full-fine-tune (i.e. full weight updates, not LoRA) a much larger model with 96GB of VRAM. I generally have been able to do a full fine-tune (which is equivalent to training a model from scratch) of 34B parameter models at full bf16 using 8XA100 servers (640GB of VRAM) if I enable gradient checkpointing, meaning a 96GB VRAM box should be able to handle models of up to 5B parameters. Of course if you use LoRA, you should be able to go much larger than this, depending on your rank.

tmostak · on June 13, 2024

This assumes that you can linearly scale up the number of TPUs to get equal performance to Nvidia cards for less cost. Like most things distributed, this is unlikely to be the case.

logicchains · on June 13, 2024

This is absolutely the case, TPUs scale very well: https://github.com/google/maxtext .

pama · on June 13, 2024

The repo mentiones a Karpathy tweet from Jan 2023. Andrej has recently created llm.c and the same model trained about 32x faster on the same NVidia hardware mentioned in the tweet. I dont think the perfomance estimate that the repo used (based on that early tweet) was accurate for the performance of the NVidia hardware itself.

tmostak · on April 30, 2024

Are you measuring tokens/sec or words per second?

The difference matters as generally in my experience, Llama 3, by virtue of its giant vocabulary, generally tokenizes text with 20-25% less tokens than something like Mistral. So even if its 18% slower in terms of tokens/second, it may, depending on the text content, actually output a given body of text faster.

tmostak · on April 29, 2024

But it's likely to be much slower than what you'd get with a backend like llama.cpp on CPU (particularly if you're running on a Mac, but I think on Linux as well), as well as not supporting features like CPU offloading.

p1esk · on April 29, 2024

Are there benchmarks? 2x speed up would not be enough for me to return to c++ hell, but 5x might be, in some circumstances.

SushiHippie · on April 29, 2024

I think the biggest selling point of ollama (llama.cpp) are quantizations, for a slight hit (with q8 or q4) in quality you can get a significant performance boost.

p1esk · on April 29, 2024

Does ollama/llama.cpp provide low bit operations (avx or cuda kernels) to speed up inference? Or just model compression with inference still done in fp16?

My understanding is the modern quantization algorithms are typically implemented in Pytorch.

SushiHippie · on April 29, 2024

Sorry I don't know much about this topic.

The only thing I know (from using it) that with quantization I can fit models like llama2 13b, in my 24GB of VRAM when I use q8 (16GB) instead of fp16 (26GB). This means I can get nearly the full quality of llama2 13b's output while still being able to use only my GPU, without the need to do very slow inference on only CPU+RAM.

And the models are quantized before inference, so I'd only download 16GB for the llama2 13b q8 instead of the full 26GB, which means it's not done on the fly.

p1esk · on April 30, 2024

As an aside, even gpt4 level quality does not feel satisfactory to me lately. I can’t imagine willingly using models as dumb as llama2-13b. What do you do with it?

SushiHippie · on April 30, 2024

Yeah I agree, everytime a new model releases I download the highest quantization or fp16, that fits into my VRAM, test it out with a few prompts, and then realize that downloadable models are still not as good as the closed ones (except speed wise).

I don't know why I still do it, but everytime I read so many comments how good model X is, and how it outperforms anything else, and then I want to see it for myself.

simonw · on April 29, 2024

There's a Python binding for llama.cpp which is actively maintained and has worked well for me: https://github.com/abetlen/llama-cpp-python

tmostak · on Feb 22, 2024

Thank you, it's been a major team effort!

tmostak · on Feb 21, 2024

More info can be found here: https://www.heavy.ai/heavyiq/overview

tmostak · on Feb 1, 2024

HEAVY.AI | SQL Analyst/Wrangler | Part-time or Full-time | Remote

HEAVY.AI builds a GPU-accelerated analytics platform that allows users to interactively query and visualize billions of records of data in milliseconds.

We’re looking for someone who really knows SQL. If you can decipher schemas, figure out what’s wrong with SQL statements and correct them, as well as generate queries in response to user questions, we'd love to talk to you.

The work would initially be on contract, but could lead to full-time employment. Geospatial analytics, data science background, and Python programming skills would be very useful to have as well, but are not absolute requirements.

If interested please reach out to pey.silvester@heavy[dot]ai.