Hacker Newsnew | past | comments | ask | show | jobs | submit | adastral's commentslogin

PostgresTV livestreams "vibe coding" 1h sessions implementing small PostgreSQL features with Cursor (mostly claude-4-sonnet model) every week, by experienced PostgreSQL contributors. [0] is their latest stream.

I personally have not watched much, but it sounds just like what you are looking for!

[0] https://www.youtube.com/watch?v=3MleDtXZUlM


Yes, good enough for me, thanks. I look forward to watching it. This is particularly interesting as it’s a group stream.


> I default to a headless browser

Headless browsers consume orders of magnitude more resources, and execute far more requests (e.g. fetching images) than a common webscraping job would require. Having run webscraping at scale myself, the cost of operating headless browsers made us only use them as a last resort.


Blocking all image/video/CSS requests is the rule of thumb when working with headless browsers via CDP


Speaking as a person who has played on both offense and defense: this is a heuristic that's not used frequently enough by defenders. Clients that load a single HTML/JSON endpoint without loading css or image resources associated with the endpoints are likely bots (or user agents with a fully loaded cache, but defenders control what gets cached by legit clients and how). Bot data thriftiness is a huge signal.


Even legitimate users might want to disable CSS and pictures and whatever, and I often do when I just want to read the document.

Blind users also might have no use for the pictures, and another possibility is if the document is longer than the screen so the picture is out of view then the user might program the client software to use lazy loading, etc.


Indeed, that's why it's one heuristic/signal among many others


As a high load system engineer you'd want to offload asset serving to CDN which makes detection slightly more complicated. The easy way is to attach an image onload handler with client js, but that would give a high yield of false positives. I personally have never seen such approach and doubt its useful for many concerns.


Unless organization policy forces you to, you do not have to put all resources behind a CDN. As a matter of fact, getting this heuristic to work requires a non-optimal caching strategy of one or more real or decoy resources - CDN or not. "Easy" is not an option for the bot/anti-bot arms race, all the low hanging fruit is now gone when fighting a determined adversary on either end.

> I personally have never seen such approach and doubt its useful for many concerns.

It's an arms race and defenders are not keen on sharing their secret sauce, though I can't be the only one who thought of this rather basic bot characteristic, multiple abuse trams probably realized this decades ago. It works pretty well against the low-resource scrapers with fakes UA strings and all the right TLS handshakes. It won't work against the headless browsers that costs scrapers more in resources and bandwidth, and there are specific countermeasures for headless browsers [1], and counter-countermeasures. It's a cat and mouse game.

1. e.g. Mouse movement, as made famous as ine signal evaluated by Google's reCAPTCHA v2, monitor resolution & window size and position, and Canvas rendering, all if which have been gradually degraded by browser anti-fingerprinting efforts. The bot war is fought on the long tail.


So you maintain a table of domains and how to access them?

How do you build that table and keep it up to date? Manually?


From https://www.codeofhonor.com/blog/the-starcraft-path-finding-...

> whenever harvesters are on their way to get minerals, or when they’re on the way back carrying those minerals, they ignore collisions with other units

Maybe that's part of the fluid behaviour you noticed.

The blog contains more technical posts about StarCraft 2, so you might find that "special sauce" somewhere else in there :)


This is about harvesters, which is only the case when they are mining minerals / gas.

The commenter was talking about army movements, units which actually collide with each other.


The blog is interesting, but it is about sc1. Pathfinding in sc1 is so terrible that moving units where you want is part of what's make one a pro player.


Not sure if this is what the above comment means by "atomic", but a shortcoming of Postgres' JSON support is that it will have to rewrite an entire JSON object every time a part of it gets updated, no matter how many keys the update really affected. E.g. if I update an integer in a 100MB JSON object, Postgres will write ~100MB (plus WAL, TOAST overhead, etc.), not just a few bytes. I imagine this can be a no-go for certain use cases.


It drives me batty to see people store 100MB JSON objects with a predictable internal structure as single records in an RDB rather than destructuring it and storing the fields as a single record. Like, yes, you can design it the worst possible way like that, but why? But I see it all the time.


Because schemas. The whole point of nosql is that you can alter your data model without having to reload the whole database


Actually, that's the whole point of RDBs: that you can alter your data model (in most cases) just by a simple DDL+DML query. And it is with NoSQL that you have to manually download all the affected data from the DB, run the transformation with consistency checks, and upload it back. Or, alternatively, you have to write your business logic so that it can work with/transform on-demand all the different versions of data objects, which to my taste is even more of a nightmarish scenario.


Which is great in the early stages of development, but people actually deploy like this


The benefits of going schemaless in the early stages of development are highly suspect in my experience. The time that one might save in data modeling and migrations comes out from the other end with shittier code that’s harder to reason about.


My perspective is that using NoSQL does not save time in data modeling and migrations. Moreover, one has to pay in increased time for these activities, because (a) in most cases, data has to follow some model in order to be processable anyway, the question is whether we formally document and enforce it at a relational storage, or leave it to external means (which we have to implement) to benefit from some specifically-optimized non-relational storage, (b) NoSQL DBs return data (almost) as stored, one cannot rearrange results as freely as with SQL queries, not even close, thus much more careful design is required (effectively, one has to design not only schema but also the appropriate denormalization of it), (c) migrations are manual and painful, so one had better arrive at the right design at once rather than iterate on it.

That is, of course, if one doesn't want to deal with piles of shitty code and even more shitty data.


It's not an issue with size. It's an issue with race conditions. With Mongo I can update a.b and a.c concurrently from different nodes and both writes will set the right values.

You can't do that with PG JSONB unless you lock the row for reading...


Yes but that simplified write complexity means you are pushing a ton of read complexity out to your application.


What?? That's an insane argument. That's like saying if one client sets column X to 1 and another client concurrently sets SET y = 2, one client's writes will be LOST. It shouldn't, and it doesn't. If it did, nobody would use Postgres. This issue only exists with PG's JSON impl.


What?? That’s an insane way to describe what I’m talking about. Data/transaction isolation is very complex and extremely specific to every use case, which is why database engines worth anything let you describe to them what your needs are. Hence why when one client writes to Y they specify what they think X should be if relevant and get notified to try again if the assumptions are wrong. An advantage of specifying your data and transaction model up front is that it will surface these subtle issues to you before they destructively lose important information in an unrecoverable manner.

https://en.wikipedia.org/wiki/Isolation_(database_systems)


So updating one column on a table is destructive and can lose important information now? :D

How do you increment a value in a column while other people write to the database? You don't grab the whole damn row and rewrite it...


Also this statement:

> which is why database engines worth anything let you describe to them what your needs are

Contradicts your argument.

Mongo has atomic updates to update specific fields, or you can do replaceOne() etc to replace the whole document.

While PG only gives you "replace" with JSON.

So I guess postgres isn't worth anything. :)


A JSON object which is 100 MB after compression is a quite huge thing.


I see they don't mention Citus (https://github.com/citusdata/citus), which is already a fairly mature native Postgres extension. From the details given in the article, it sounds like they just reimplemented it.

I wonder if they were unaware of it or disregarded it for a reason —I currently am in a similar situation as the one described in the blog, trying to shard a massive Postgres DB.


I have worked on teams that have both sharded and partitioned PostgreSQL ourselves (somewhat like Figma) (Postgres 9.4-ish time frame) as well as those that have utilized Citus. I am a strong proponent of Citus and point colleagues in that direction frequently, but depending on how long ago Figma was considering this path I will say that there were some very interesting limitations to Citus not that long ago.

For example, it was only 2 years ago that Citus allowed the joining of data in "local" tables and data retrieved from distributed tables (https://www.citusdata.com/updates/v11-0). In this major update as well, Citus enabled _any_ node to handle queries, previously all queries (whether or not it was modifying data) had to go through the "coordinator" node in your cluster. This could turn into a pretty significant bottleneck which had ramifications for your cluster administration and choices made about how to shape your data (what goes into local tables, reference tables, or distributed tables).

Again, huge fan of Citus, but it's not a magic bullet that makes it so you no longer have to think about scale when using Postgres. It makes it _much_ easier and adds some killer features that push complexity down the stack such that it is _almost_ completely abstracted from application logic. But you still have be cognizant of it, sometimes even altering your data model to accommodate.


You also benefit from the tailwind of the CitusData team making continued improvement to the extension, whereas an in-house system depends on your company's ability to hire and retain people to maintain + improve the in-house system.

It's hard to account for the value of benefits that have yet to accrue, but this kind of analysis, even if you pretty heavily-discount that future value, tilts the ROI in favor of solutions like Citus, IMO. Especially if your time horizon is 5+ or 10+ years out.

Like you said, if they made this decision 3ish years ago, you would have had to be pretty trusting on that future value. A choice, made today, hinges less on that variable.


Huh, I would have thought the opposite. Companies at Figma size are easily able to hire talent to maintain a core part of their engineering stack. On the other hand, they retain no control of Citus decision making. Those tailwinds could easily have been headwinds if they went in a direction that did not suit Figma.


I think this is true for things higher up the "stack", but doesn't necessarily apply to tech like Postgres [and Citus, IMO].

The line separating "build in-house" vs "use OSS" exists, and it's at a different layer of the stack in every company. IMO, for most companies in 2024, the line puts Citus on the same side as Postgres.

FWIW, I would have assumed that Citus would be on the other end of the line, until I had to look into Citus for work for a similar reason that Figma did. You can pick and choose among the orthogonal ideas they implement that most cleanly apply to the present stage of your business, and I would've chosen to build things the same way they did (TBH, Figma's choices superficially appear to be 1:1 to Citus's choices).


I thought of that as well. The only thing I could think of is that they mentioned that they don't want to move off of RDS, and there is 0% chance of Citus coming to AWS since Microsoft bought them.


Before clicking on the article I assumed it was Citus, and was surprised when it wasn’t.

Maybe because CitusData was bought by Microsoft around the same time, so Microsoft could create “Azure Cosmos DB for Postgres Cluster”, yet another one of Microsoft’s typical product naming crapshoots.


> yet another one of Microsoft’s typical product naming crapshoots.

Well said. I haven't seen any company as terrible as Microsoft at naming things. Anyone know why?


Naming things is hard.

At a previous employer, I saw several cool-ish open source projects instantly doomed to obscurity by picking a name that either completely duplicated the name of an existing OSS project or were guaranteed to have terrible SEO for another reason.

However, Microsoft seems to have a unique crossover of fragmented business units and centralized marketing. That's why you end up with Azure -> Subproject -> Actual Product/Service word soup. Perviously, they did this with the Windows Live brand from 2005-2012, and "Xbox" for a wide range of gaming projects (many of which were on PC).


related, Microsoft on Microsoft marketing:

https://www.youtube.com/watch?v=EUXnJraKM3k


The committee wanted Cosmos, Azure, and Postgres all in the name and wouldn't compromise.


AWS is putting up good fight


Figma uses AWS RDS, RDS doesn't list citus as a supported extension.


This is my guess of why they didn't use Citus. They weren't interested in the options of (1) going multi-cloud [DB in Azure Cosmos / Backend(s) in AWS] (2) going all-in on Azure [DB in Azure Cosmos / Backend(s) in Azure] (3) self-managing Postgres+Citus in EC2.

It'd be interesting to compare the expected capex of developing this in-house solution + the opex of maintaining it vs the same categories of expected costs for option (3) – because I imagine that's probably the most palatable option.

They also may have pre-paid for dedicated RDS instances for the next X years (before this horizontal scaling initiative began, to boot), as AWS allows companies to do this at a pretty steep discount rate, which would probably tilt them away from (3).


Especially because Option 3 lets you go waaaay farther on vertical scaling, since you can get native NVMe drives (they mentioned hitting IOPS limits for RDS), more exotic instance classes with far more RAM, and do stuff like ZFS for native compression and snapshots.


I would love to see a comparison of the major PostgresQL services such as Citus, EDB, Crunchy, Neon, and some OSS distributions/packages


how "massive" is massive in your case?


I've had CitusDB running across 68 bare metal machines (40 vCPU, 768GiB ram, 20TiB of storage each + 40GiB network links) and it ran decently well.

Not sure what your definition of massive is, I think Spanner would easily beat it.

Also, it's very use-case dependent, you can't "just use" Citus for everything, it's not quite as flexible as a bog-standard pgsql install due to the way it's sharding, you have to be a tad more careful with your data model.


Is there a reason there's comparatively little storage in your machines in relation to RAM or even CPUs?

Do your machines do compute heavy loads or something?

For a DB I'd expect a lot more storage per node


NVMe SSDs aren't so large unfortunately.

a 1U server has capacity for 8 drives, we used 2 slots for the OS (RAID1), 2 slots for the WAL volume (2 slots) leaving only 4 slots in RAID10.

So I'm already cheating a little and claiming WAL storage was part of total storage.


Shouldn't you try to get something with PCIe bifurcation in this case?

I doubt you're saturating the PCIe bus bandwidth on any of them?

I imagine your DB is extremely high performance, though!


What is your definition of "decently well", and is your primary cluster (without replicas) above 1PB?


They said 20TiB * 68, which I think is 1.5PB.


That could be all of the nodes, or just the primaries without replicas.


Around ten heavily-updated (50-400k updated rows/min) tables ranging between 500M and 5B rows, with a couple tables over 40B rows each (5TB each IIRC).


Where's the fun in that? I'm not being snarky either. Maybe it's not the best decision business-wise, but I guarantee it was more challenging and more fun. There's something to be said for that.


> they'll be encrypted

Since some months (years?) ago, WhatsApp lets you set up your own encryption password for the DB backup. I set one up and used https://github.com/ElDavoo/wa-crypt-tools to get access to the decrypted SQLite and run some analytics over my messages :)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: