> But for just the cost of doubling our space, we can use two Bloom filters!
We can optimize the hash function to make it more space efficient.
Instead of using remainders to locate filter positions, we can use a mersenne prime number mask (like say 31), but in this case I have a feeling the best hash function to use would be to mask with (2^1)-1.
This is roughly what my startup is doing, automating financials.
We didn't pick this because it was super technical, but because the financial team is the closest team to the CEO which is both overstaffed and overworked at the same time - you have 3-4 days of crunch time for which you retain 6 people to get it done fast.
This was the org which had extremely methodical smart people who constantly told us "We'll buy anything which means I'm not editing spreadsheets during my kids gymnastics class".
The trouble is that the UI that each customer wants has zero overlap with the other, if we actually added a drop-down for each special thing one person wanted, this would look like a cockpit & no new customer would be able to do anything with it.
The AI bit is really making the required interface complexity invisible (but also hard to discover).
In a world where OpenAI is Intel and Anthropic is AMD, we're working on a new Excel.
However, to build something you need to build a high quality message passing co-operating multi-tasking AI kernel & sort of optimize your L1 caches ("context") well.
> Well, if you don't fsync, you'll go fast, but you'll go even faster piping customer data to /dev/null, too.
The trouble is that you need to specifically optimize for fsyncs, because usually it is either no brakes or hand-brake.
The middle-ground of multi-transaction group-commit fsync seems to not exist anymore because of SSDs and massive IOPS you can pull off in general, but now it is about syscall context switches.
Two minutes is a bit too too much (also fdatasync vs fsync).
IOPS only solves throughput, not latency. You still need to saturate internal parallelism to get good throughput from SSDs, and that requires batching. Also, even double-digit microsecond write latency per transaction commit would limit you to only 10K TPS. It's just not feasible to issue individual synchronous writes for every transaction commit, even on NVMe.
tl;dr "multi-transaction group-commit fsync" is alive and well
Netflix is a different creature because of streaming and time shifting.
They don't care about people watching a pilot episode or people binge watching last 3 seasons when a show takes off.
The quality metric therefore is all over the place, it is a mildly moderated popularity contest.
If people watch "Love is Blind", you'll get more of those.
On the other hand, this means they can take a slightly bigger risk than a TV network with ADs, because you're likely to switch to a different Netflix show that you like and continue to pay for it, than switch to a different channel which pays a different TV network.
As long as something sticks the revenue numbers stay, the ROI can be shaky.
Black Mirror Bandersnatch for example was impossible to do on TV, but Netflix could do it.
Also if GoT was Netflix, they'd have cancelled it on Season 6 & we'd be lamenting the loss of what wonders it'd have gotten to by Season 9.
> For double/bigint joins that leads to observable differences between joins and plain comparisons, which is very bad.
This was one of the bigger hidden performance issues when I was working on Hive - the default coercion goes to Double, which has a bad hash code implementation [1] & causes joins to cluster & chain, which caused every miss on the hashtable to probe that many away from the original index.
The hashCode itself was smeared to make values near Machine epsilon to hash to the same hash bucket so that .equals could do its join, but all of this really messed up the folks who needed 22 digit numeric keys (eventually Decimal implementation handled it by adding a big fixed integer).
Databases and Double join keys was one of the red-flags in a SQL query, mostly if you see it someone messed up something.
> trauma that our parents, or grandparents experienced could lead to behavior modifications and poorer outcomes in us
The nurture part of it is already well established, this is the nature part of it.
However, this is not a net-positive for the folks who already discriminate.
The "faults in our genes" thinking assumes that this is not redeemable by policy changes, so it goes back to eugenics and usually suggests cutting such people out of the gene pool.
The "better nurture" proponents for the next generation (free school lunches, early intervention and magnet schools) will now have to swim up this waterfall before arguing more investment into the uplifting traumatized populations.
We need to believe that Change (with a capital C) is possible right away if start right now.
I would think it's the opposite. Intervention is preventative of further sliding. The alternative - genocide - is expensive; they're generally a luxury of states benefiting from a theft-based windfall.
The useful part is that duckdb is so easy to use as a client with an embedded server, because duckdb is a great client (+ a library).
Similar to how git can serve a repo from a simple http server with no git installed on that (git update-server-info).
The frozen part is what iceberg promised in the beginning, away from Hive's mutable metastore.
Point to a manifest file + parquet/orc & all you need to query it is S3 API calls (there is no metadata/table server, the server is the client).
> Creating and publishing a Frozen DuckLake with about 11 billion rows, stored in 4,030 S3-based Parquet files took about 22 minutes on my MacBook
Hard to pin down how much of it is CPU and how much is IO from s3, but doing something like HLL over all the columns + rows is pretty heavy on the CPU.
> will try to learn more about normal sockets to see if I could perhaps make them work with the app.
There's a whole skit in the vein of "What have the Romans ever done for us?" about ZeroMQ[1] which has probably lost to the search index now.
As someone who has held a socket wrench before, fought tcp_cork and dsack, Websockets isn't a bad abstraction to be on top of, especially if you are intending to throw TLS in there anyway.
Low level sockets is like assembly, you can use it but it is a whole box of complexity (you might use it completely raw sometimes like a tickle ack in the ctdb[2] implementation).
We can optimize the hash function to make it more space efficient.
Instead of using remainders to locate filter positions, we can use a mersenne prime number mask (like say 31), but in this case I have a feeling the best hash function to use would be to mask with (2^1)-1.
reply