More

gopalv · 2025-12-12T16:22:57 1765556577

> But for just the cost of doubling our space, we can use two Bloom filters!

We can optimize the hash function to make it more space efficient.

Instead of using remainders to locate filter positions, we can use a mersenne prime number mask (like say 31), but in this case I have a feeling the best hash function to use would be to mask with (2^1)-1.

AlotOfReading · 2025-12-12T17:47:56 1765561676

This produced strange results on my ternary computer. I had to use a recursive popcnt instead.

piersadrian · 2025-12-12T19:49:01 1765568941

this is my new favorite comment on this cursed website

gopalv · 2025-12-12T03:12:35 1765509155

This is roughly what my startup is doing, automating financials.

We didn't pick this because it was super technical, but because the financial team is the closest team to the CEO which is both overstaffed and overworked at the same time - you have 3-4 days of crunch time for which you retain 6 people to get it done fast.

This was the org which had extremely methodical smart people who constantly told us "We'll buy anything which means I'm not editing spreadsheets during my kids gymnastics class".

The trouble is that the UI that each customer wants has zero overlap with the other, if we actually added a drop-down for each special thing one person wanted, this would look like a cockpit & no new customer would be able to do anything with it.

The AI bit is really making the required interface complexity invisible (but also hard to discover).

In a world where OpenAI is Intel and Anthropic is AMD, we're working on a new Excel.

However, to build something you need to build a high quality message passing co-operating multi-tasking AI kernel & sort of optimize your L1 caches ("context") well.

gopalv · 2025-12-08T21:08:55 1765228135

> Well, if you don't fsync, you'll go fast, but you'll go even faster piping customer data to /dev/null, too.

The trouble is that you need to specifically optimize for fsyncs, because usually it is either no brakes or hand-brake.

The middle-ground of multi-transaction group-commit fsync seems to not exist anymore because of SSDs and massive IOPS you can pull off in general, but now it is about syscall context switches.

Two minutes is a bit too too much (also fdatasync vs fsync).

senderista · 2025-12-08T23:00:13 1765234813

IOPS only solves throughput, not latency. You still need to saturate internal parallelism to get good throughput from SSDs, and that requires batching. Also, even double-digit microsecond write latency per transaction commit would limit you to only 10K TPS. It's just not feasible to issue individual synchronous writes for every transaction commit, even on NVMe.

tl;dr "multi-transaction group-commit fsync" is alive and well

gopalv · 2025-12-05T18:23:33 1764959013

> produce something as high-quality as GoT

Netflix is a different creature because of streaming and time shifting.

They don't care about people watching a pilot episode or people binge watching last 3 seasons when a show takes off.

The quality metric therefore is all over the place, it is a mildly moderated popularity contest.

If people watch "Love is Blind", you'll get more of those.

On the other hand, this means they can take a slightly bigger risk than a TV network with ADs, because you're likely to switch to a different Netflix show that you like and continue to pay for it, than switch to a different channel which pays a different TV network.

As long as something sticks the revenue numbers stay, the ROI can be shaky.

Black Mirror Bandersnatch for example was impossible to do on TV, but Netflix could do it.

Also if GoT was Netflix, they'd have cancelled it on Season 6 & we'd be lamenting the loss of what wonders it'd have gotten to by Season 9.

gopalv · 2025-11-19T22:03:11 1763589791

> For double/bigint joins that leads to observable differences between joins and plain comparisons, which is very bad.

This was one of the bigger hidden performance issues when I was working on Hive - the default coercion goes to Double, which has a bad hash code implementation [1] & causes joins to cluster & chain, which caused every miss on the hashtable to probe that many away from the original index.

The hashCode itself was smeared to make values near Machine epsilon to hash to the same hash bucket so that .equals could do its join, but all of this really messed up the folks who needed 22 digit numeric keys (eventually Decimal implementation handled it by adding a big fixed integer).

Databases and Double join keys was one of the red-flags in a SQL query, mostly if you see it someone messed up something.

[1] - https://issues.apache.org/jira/browse/HADOOP-12217

gopalv · 2025-11-15T19:57:18 1763236638

> trauma that our parents, or grandparents experienced could lead to behavior modifications and poorer outcomes in us

The nurture part of it is already well established, this is the nature part of it.

However, this is not a net-positive for the folks who already discriminate.

The "faults in our genes" thinking assumes that this is not redeemable by policy changes, so it goes back to eugenics and usually suggests cutting such people out of the gene pool.

The "better nurture" proponents for the next generation (free school lunches, early intervention and magnet schools) will now have to swim up this waterfall before arguing more investment into the uplifting traumatized populations.

We need to believe that Change (with a capital C) is possible right away if start right now.

underlipton · 2025-11-15T22:35:36 1763246136

I would think it's the opposite. Intervention is preventative of further sliding. The alternative - genocide - is expensive; they're generally a luxury of states benefiting from a theft-based windfall.

gopalv · 2025-11-14T05:14:50 1763097290

> Can you build a Linux version? :-)

Generally speaking, it is the hardware not the OS that makes it easier to build for Macs right now.

Apple Neural Engine is a sleeping giant, in the middle of all this.

daemonologist · 2025-11-14T05:19:37 1763097577

Parakeet still runs at 5x realtime on a middle-of-the-road CPU; it should be quite doable (at the cost of some battery life).

gopalv · 2025-11-10T17:14:00 1762794840

> would a fixed line in India typically be above that speed?

My family lives outside of a tier 2 city border, in what used to be farmland in the 90s.

They have Asianet FTTH at 1Gbps, but most of the video/streaming traffic ends at the CDN hosts in the same city.

That CDN push to the edge is why Hotstar is faster to load there - the latency on seeks isn't going around the planet.

matt-p · 2025-11-10T17:20:37 1762795237

That is really cool, but sad to see it's only at around 15% penetration.

gopalv · 2025-10-30T16:57:12 1761843432

The useful part is that duckdb is so easy to use as a client with an embedded server, because duckdb is a great client (+ a library).

Similar to how git can serve a repo from a simple http server with no git installed on that (git update-server-info).

The frozen part is what iceberg promised in the beginning, away from Hive's mutable metastore.

Point to a manifest file + parquet/orc & all you need to query it is S3 API calls (there is no metadata/table server, the server is the client).

> Creating and publishing a Frozen DuckLake with about 11 billion rows, stored in 4,030 S3-based Parquet files took about 22 minutes on my MacBook

Hard to pin down how much of it is CPU and how much is IO from s3, but doing something like HLL over all the columns + rows is pretty heavy on the CPU.

gopalv · 2025-10-24T15:32:52 1761319972

> will try to learn more about normal sockets to see if I could perhaps make them work with the app.

There's a whole skit in the vein of "What have the Romans ever done for us?" about ZeroMQ[1] which has probably lost to the search index now.

As someone who has held a socket wrench before, fought tcp_cork and dsack, Websockets isn't a bad abstraction to be on top of, especially if you are intending to throw TLS in there anyway.

Low level sockets is like assembly, you can use it but it is a whole box of complexity (you might use it completely raw sometimes like a tickle ack in the ctdb[2] implementation).

[1] - https://news.ycombinator.com/item?id=32242238

[2] - https://linux.die.net/man/1/ctdb