Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I read this and I have to wonder, did anyone ever think it was reasonable that a cluster that apparently needed only 120gb of memory was consuming 1.2TB just for logging (or whatever vector does)


We're a much smaller scale company and the cost we lose on these things is insignificant compared to what's in this story. Yesterday I was improving the process for creating databases in our azure and I stumbled upon a subscription which was running 7 mssql servers for 12 databases. These weren't elastic and they were each paying a license that we don't have to pay because we qualify for the base cost through our contract with our microsoft partner. This company has some of the thightest control over their cloud infrastructure out of any organisation I've worked with.

This is anecdotal, but if my experiences aren't unique then there is a lot of lack of reasonable in DevOps.


Isn't that mostly down to the fact the vast majority of devs explicitly don't want to do anything wrt Ops?

DevOps has - ever since it's originally well meaning inception (by Netflix iirc?) - been implemented across our industry as an effective cost cutting measure, forcing devs that didn't see it as their job to also handle it.

Which consequently means they're not interfacing with it whatsoever. They do as little as they can get away with, which inevitably means things are being done with borderline malicious compliance... Or just complete incompetence.

I'm not even sure I'd blame these devs in particular. The devs just saw it as a quick bonus generator for the MBA in charge of this rebranding while offloading more responsibilities in their shoulders.

DevOps made total sense in the work culture where this concept was conceived - Netflix was well known at that point to only ever employ senior Devs. However, in the context of the average 9-5 dev, which often knows a lot less then even some enthusiastic Jrs... Let's just say that it's incredibly dicey wherever it's successful in practice.


I politely disagree. I spent maybe 8 hours over a week rightsizing a handful of heavy deployments from a previous team and reduced their peak resource usage by implementing better scaling policies. Before the new scaling policy the service would scale out and new pods would remain idle and ultimately get terminated without ever responding to a request quite frequently.

The service dashboards already existed, all I had to do was a bit of load testing and read the graphs.

It's not too much extra work to make sure you're scaling efficiently.


You disagree but then cite another example of low hanging fruits that nobody took action on until you came along?

Did you accidentally respond to the wrong comment? Because if anything you're giving another example of "most devs not wanting to interface with ops, hence letting it slide until someone bothers to pick up their slack"...


The first time my director asked me if I'd ever heard of DevOps, I said, "Sure, doing two jobs for one paycheck." I'm a software developer, buddy. I write the programs. Leave me out of running them.


> Leave me out of running them.

This is how customers end up with too-expensive Rube Goldberg machines.

You have to take some interest in how your code will run in production, even if you don't personally "operate" it.


Here's the extent of my interest: I take my understanding of your use case and specifications, then I write source code that tries to generate as few instructions to suit your needs as possible while still being comprehensible to the next maintainer.

The app should write records to a database? Fine. Here's where you configure the connection. The app in production is slow because the database server is weak? Not my problem, talk to your DBA.

The app should expose an HTTP endpoint for liveness probes? Fine. It's served from the path you specified. Your reused it for an external outage check, and that's reporting the service is down because the route timed out due to your ops team screwing up the reverse proxy? Literally not my problem, I could not care less.


Allow me to politely pick apart the "Not my problem, talk to your DBA" comment from the perspective of someone who's worn every IT hat there is.

Okay, so, what is the DBA to do? Double the server capacity to "see if that helps"?

It didn't, and now the opex of the single most expensive cloud server is 2x what it was and is starting to dwarf everything else... combined.

Maybe it's "just" a bad query. Which one? Under what circumstances? Is it supposed to be doing that much work because that's what the app needs, or is it an error that it's sucking down a gigabyte of data every few minutes?

How is the DBA to know what the usecases are?

The best tools that solve these runtime performance are modern APM tools like Azure App Insights, Open Telemetry, or the like.

Some of these products can be injected into precompiled apps using "codeless attach" methods, and this works... okay at best.

So SysOps takes your code, layers on an APM, sees a long list of potential issues... and the developers "don't care" because they think that this is a SysOps thing.

But if the developer takes an interest and is an involved party, then they can integrate the APM software development kit, "enrich" the logged data, log user names, internal business metadata, etc... They log on to the APM web portal and investigate how their app is running in production, with real-world users instead of synthetic tests, with real data, with "noisy neighbours", and all that.

Now if Bob's queries are slowing down the entire platform, it's a trivial matter to track this down and fix Bob's custom report SQL query that is sucking down SELECT * FROM "MassiveReportView" and killing the entire server.

Troubleshooting, performance, security, etc... are all end-to-end things. Nobody can work in isolation and expect a good end result.


DBAs don't necessarily need telemetry in an app to diagnose an issue with the app's behavior. They can run a trace and see some SELECT is running a thousand times a second and deduce that it's being called in a loop over the result set of an earlier query. And they'd be right to say hey, this is an app issue, open a ticket with the developer.

If you put that responsibility on the developer--meaning you expect the dev to diagnose an issue that they introduced in the first place--what kind of result do you think you're going to get?

Layering these demands takes away from the overall quality of the application in my experience. You want an app developer to learn all about Prometheus so the app can have an endpoint with all these custom metrics, okay, and you want structured logging and expect the dev to learn how to use Kibana effectively? All that's a huge cognitive burden that eats a slice of the same pie (their brains) as domain knowledge, language & runtime knowledge, etc.

Get maybe one app developer to specialize, get maybe one app developer to cross-train with ops or monitoring even. But leave most of us out of it.

When you flip that expectation of developer involvement in operations, it exposes how unreasonable that arrangement is. Hey, DBA, the app is sucking up resources. Why don't you crack open an IDE and write a patch for it? What do you mean you don't know Go, what do you mean you don't use Git? Every DBA should know how to attach a debugger to a remote process, shouldn't they?

It's just exploitative. Or at least that's been my experience, so there's my bias.


Half joking: this is how you get N+1 queries all around the codebase.

Author here: You’d be surprised what you don’t notice given enough nodes and slow enough resource growth over time! Out of the total resource usage in these clusters even at the high water mark for this daemonset it was still a small overall portion of the total.


I’m not sure if that makes it better or worse.


It seems realistic to me, commonplace even. Lots to do in a company like this one.


I didn't know what Render was when I skimmed the article at first, but after reading these comments, I had to check out what they do.

And they're a "Cloud Application Platform" meaning they manage deploys and infrastructure for other people. Their website says "Click, click, done." which is cool and quick and all, but to me it's kind of crazy an organization that should be really engineering focused and mature, doesn't immediately notice 1.2TB being used and tries to figure out why, when 120GB ended up being sufficient.

It gives much more of a "We're a startup, we're learning as we're running" vibe which again, cool and all, but hardly what people should use for hosting their own stuff on.


If your report for the month is "I saved a terabyte of ram usage across our cluster estate!" and I as a manager do some quick maths and say great, that's our income from 2 median customers. We lost 8 customers because we didn't laugh feature foo in time, which is what you were supposed to be working on, so your contribution for the month is a massive loss to the company...

Does that frame things differently? There's are times in your product lifecycle where you doing want your developers looking at things like this, and a time when you do


Reading comments like yours makes me realise that I should just leave commercial programming and never come back. Your framing is terrifying.

You know why? I am not saying that what you said does not make sense. Of course it does make sense, financially so. But! You the manager one day come to me and my team and say "How could we allow to have 7TB of unused memory sitting around and we paid for it?!" and we'll then have multiple follow-up meetings where we'll be scolded and "trained" how to avoid things like this. We'll get sent articles and told to improve.

And believe me when I tell you, _all_ the techies in these meetings want to roll their eyes through it all. Because many of them likely asked "Can I take a closer look at our infra, it seems expensive and we can potentially optimise it?" and were said no by managers like yourself.

As an engineer you just can't win. So I don't blame myself or any other techie who sometimes goes cowboy mode to find such problems without asking.

Finally, "my contribution for the month" is technological work and nothing else. If I wanted to be a cofounder or have a seat in the board so I have fiduciary duty, I would have said so. It's your job as a manager to put this barrier between stakeholders and front soldiers so the latter can do their thing without disruption, so the organisation can succeed.

Are you doing your job well?


7tb in an organization running probably petabytes of ram total is easy to slip under the radar. There's a lot of systems and a lot of moving parts and if it's not broke or triggering alarms, you probably don't care very much.


how large are the clusters then?


It probably doesn't help that the first line of treatment for any error is to blindly increase memory request/limit and claim it's fixed (preferably without looking at the logs once).


we have on-prem with heavy spikes (our batch workload can utilize the 20TB of memory in the cluster easily) and we just don't care much and add 10% every year to the hardware requested. Compared to employing people or paying other vendors (relational databases with many TB-sized tables...) this is just irrelevant.

Sadly devs are incentivized by that and going towards the cloud might be a fun story. Given the environment I hope they scrap the effort sooner rather than later, buy some Oxide systems for the people who need to iterate faster than the usual process of getting a VM and replace/reuse the 10% of the company occupied with the cloud (mind you: no real workload runs there yet...) to actually improve local processes...


Somewhat unrelated, but you just tied wasteful software design to high it salaries, and also suggest a reason why Russian programmers might also seem to on the whole be far more effective than we are

I wonder if msft simply cut dev salaries by 50% in the 90s, would it have had any measurable effect on windows quality by today




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: