Hacker Newsnew | past | comments | ask | show | jobs | submit | mgerdts's commentslogin

What is up with fin? Is it really just writing an int 0 in the memory right after some variable present in libc or similar?

        extern fin;

        if(getpw(0, pwbuf))
                goto badpw;
        (&fin)[1] = 0;

Predecessor of

    extern FILE *stdin;

I’m guessing v4 C didn’t have structs yet (v6 C does, but struct members are actually in the global namespace and are basically just sugar for offset and a type cast; member access even worked on literals. That’s why structs from early unix APIs have prefixed member names, like st_mode.

> I’m guessing v4 C didn’t have structs yet

There may have been a early C without structs (B had none,) but according to Ken Thompson, the addition of structs to C was an important change, and a reason why his third attempt rewrite UNIX from assembly to a portable language finally succeeded. Certainly by the time the recently recovered v4 tape was made, C had structs:

    ~/unix_v4$ cat usr/sys/proc.h
    struct proc {
            char    p_stat;
            char    p_flag;
            char    p_pri;
            char    p_sig;
            char    p_null;
            char    p_time;
            int     p_ttyp;
            int     p_pid;
            int     p_ppid;
            int     p_addr;
            int     p_size;
            int     p_wchan;
            int     *p_textp;
    } proc[NPROC];

    /* stat codes */
    #define SSLEEP  1
    #define SWAIT   2
    #define SRUN    3
    #define SIDL    4
    #define SZOMB   5

    /* flag codes */
    #define SLOAD   01
    #define SSYS    02
    #define SLOCK   04
    #define SSWAP   010


Heh. I had the same impulse but then didn't do it, upon refreshing the page your comment was there :)

According to the chatbot, the first word of `fin` is the file descriptor, the second its state. "Reset stdin’s flags to a clean state".

It seems pointless to issue flush commands when writing to an NVMe drive with a direct IO implementation that functions properly. The NVMe spec says:

> 6.8 Flush command

> …

> If a volatile write cache is not present or not enabled, then Flush commands shall complete successfully and have no effect.

And:

> 5.21.1.6 Volatile Write Cache

> …

> Note: If the controller is able to guarantee that data present in a write cache is written to non-volatile media on loss of power, then that write cache is considered non-volatile and this feature does not apply to that write cache.


If you know your application will only ever run against enterprise SSDs with power loss protection, then sending flush commands to the drive itself would indeed be pointless no-ops. But it if it's a flush command that has effects somewhere between the application layer and the NVMe drive (eg. if you're not using direct IO) or if there's any possibility of the code being run on a consumer SSD (eg. a developer's laptop) then the flush commands are probably worth including; the performance hit on enterprise drives will be very small.

IOCTLs can tell you if write caching is enabled or not. Can they reliably tell you whether the write cache is volatile, though? Many drives with PLPs still report volatile write caches, or at least did when I was testing this a few years back.

What SSDs are reasonably performant without a volatile write cache? The standards you quote specify why it is necessary to issue flush!

Per the definition of volatile write cache in the standard I quoted, pretty much any drive TLC drive in the hyperscalar, datacenter, or enterprise product lineup will have great write performance. They have a DRAM cache that is battery-backed, and as such is not a volatile write cache.

A specific somewhat dated example: Samsung 980 Pro (consumer client), PM9A1 (OEM client), and PM9A3 (datacenter) are very similar drives that have the same PCI ID and are all available as M.2. PM9A3 drives have power loss protection and the others don’t. It has very consistent write latency (on the order of 20 - 50 μs when not exceptionally busy) and very consistent throughput (up to 1.5 GB/s) regardless of how full it is. The same cannot be said of the client drives without PLP but with tricks like TurboWrite (aka pseudo-SLC). When more than 30% of the NAND is erased, the client drives can take writes at 5 GB/s but that rate falls off a cliff and gets wobbly when the pseudo-SLC cache fills.


Thanks! Yes, as the sibling noted, if you limit this to PLP drives it makes sense, but that is also a special case. Outside of the latency hit (which is significant in some cases), FLUSH is also nearly free on those though.

The original idea of boot environments in Solaris came from Live Upgrade, which worked at least as far back as Solaris 8. Live upgrade was not part of Solaris, rather it was an addon that came from the services or enterprise support parts of Sun.

Solaris 11 made boot environments a mandatory part of the OS, which was an obvious choice with the transition from UFS to ZFS for the root fs. This came into Solaris development a bit before Solaris 11, so it was present in OpenSolaris and lives on in many forms of illumos.


This article is a great read explaining how this trap happens.

https://www.yesigiveafig.com/p/part-1-my-life-is-a-lie


Datacenter storage will generally not be using M.2 client drives. They employ optimizations that win many benchmarks but sacrifice on consistency multiple dimensions (power loss protection, write performance degrades as they fill, perhaps others).

With SSDs, the write pattern is very important to read performance.

Datacenter and enterprise class drives tend to have a maximum transfer size of 128k, which is seemingly the NAND block size. A block is the thing that needs to be erased before rewriting.

Most drives seem to have an indirection unit size of 4k. If a write is not a multiple of the IU size or not aligned, the drive will have to do a read-modify-write. It is the IU size that is most relevant to filesystem block size.

If a small write happens atop a block that was fully written with one write, a read of that LBA range will lead to at least two NAND reads until garbage collection fixes it.

If all writes are done such that they are 128k aligned, sequential reads will be optimal and with sufficient queue depth random 128k reads may match sequential read speed. Depending on the drive, sequential reads may retain an edge due to the drive’s read ahead. My own benchmarks of gen4 U.2 drives generally backs up these statements.

At these speeds, the OS or app performing buffered reads may lead to reduced speed because cache management becomes relatively expensive. Testing should be done with direct IO using libaio or similar.


This article is talking about SATA SSDs, not HDDs. While the NVMe spec does allow for MVMe HDDs, it seems silly to waste even one PCIe lane on a HDD. SATA HDDs continue to make sense.


And I'm saying assuming that m.2 slots are sufficient to replace SATA is folly because it is only talking about SSDs.

And SATA SSDs do make sense, they are significantly more cost effective than NVMe and trivial to expand. Compare the simplicity, ease, and cost of building an array/pool of many disks comprised of either 2.5" SATA SSDs or M.2 NVMe and get back to me when you have a solution that can scale to 8, 14, or 60 disks as easily and cheaply as the SATA option can. There are many cases where the performance of SSDs going over ACHI (or SAS) is plenty and you don't need to pay the cost of going to full-on PCIe lanes per disk.


> And SATA SSDs do make sense, they are significantly more cost effective than NVMe

That doesn't seem to be what the vendors think, and they're probably in a better position to know what's selling well and how much it costs to build.

We're probably reaching the point where the up-front costs of qualifying new NAND with old SATA SSD controllers and updating the firmware to properly manage the new NAND is a cost that cannot be recouped by a year or two of sales of an updated SATA SSD.

SATA SSDs are a technological dead end that's no longer economically important for consumer storage or large scale datacenter deployments. The one remaining niche you've pointed to (low-performance storage servers) is not a large enough market to sustain anything like the product ecosystem that existed a decade ago for SATA SSDs.


In addition to my other comments about parallel IO and unbuffered IO, be aware that WS2022 has (had?) a rather slow NVMe driver. It has been improved in WS2025.


I just benchmarked this to death using a 24-core VM with two different kinds of NVMe storage.

Windows Server 2025 is somewhat better on reads but only at low parallelism.

There’s no difference on writes.


I just stumbled across this:

> Native NVMe is now generally available (GA) with an opt-in model (disabled by default as of October’s latest cumulative update for WS2025).

https://www.elevenforum.com/t/announcing-native-nvme-in-wind...


Robocopy has options for unbuffered IO (/J) and parallel operations (/MT:N) which could make it go much faster.

Performing parallel copies is probably the big win with less than 10 Gb/s of network bandwidth. This will allow SMB multichannel to use multiple connections, hiding some of the slowness you can get with a single TCP connection.

When doing more than 1-2 GB/s of IO the page cache can start to slow IO down. That’s when unbuffered (direct) IO starts to show a lot of benefit.


The strange thing is, I did have /MT:32 on (added in a comment at the bottom of the page because I had to go to bed). I like to stick with defaults but I'm not that inept. /J probably shouldn't matter for my use case because 125 MBps just isn't that much in the grand scheme of things.


A workload that uses only a fraction of such system can be corralled onto a single socket or portion thereof and use local memory through the use of cgroups.

Most likely other workloads will also run on this machine. They can be similarly bound to meet their needs.

With kubernetes, CPU manager can be a big help.


That’s not the kind of software I had in mind. I mean single large logical systems—databases being likely the largest and most common—that can’t meaningfully be distributed & are still growing in size and workload scale.


This article misses several important points.

- Consumer drives like Samsung 980 Pro and WD SN 850 Black use TLC as SLC when about 30+% of the drive is erased. At this time you a burst write a bit less than 10% of the drive capacity at 5 GB/s. After that, it slows remarkably. If the filesystem doesn’t automatically trim free space, the drive will eventually be stuck in slow mode all the time.

- Write amplification factor (WAF) is not discussed. Random small writes and partial block deletions will trigger garbage collection, which ends up rewriting data to reclaim freed space in a NAND block.

- A drive with a lot of erased blocks can endure more TBW than one that has all user blocks with data. This is because garbage collection can be more efficient. Again, enable TRIM on your fs.

- Overprovisioning can be used to increase a drive’s TBW. If before you write to your 0.3 DWPD 1024 GB drive, you partition it so you use only 960 GB, you now have a 1 DWPD drive.

- per the NVMe spec there are indicators of drive health in the SMART log page.

- Almost all current datacenter or enterprise drives support an OCP SMART log page. This allows you to observe things like the write amplification factor (WAF), rereads due to ECC errors, etc.


You’re also missing an important factor: Many drives now reserve some space that cannot be used by the consumer so they have extra space to work with. This is called factory overprovisioning.

> - Consumer drives like Samsung 980 Pro and WD SN 850 Black use TLC as SLC when about 30+% of the drive is erased. At this time you a burst write a bit less than 10% of the drive capacity at 5 GB/s. After that, it slows remarkably. If the filesystem doesn’t automatically trim free space, the drive will eventually be stuck in slow mode all the time.

This is true, but despite all of the controversy about this feature it’s hard to encounter this in practical consumer use patterns.

With the 980 Pro 1TB you can write 113GB before it slows down. (Source https://www.techpowerup.com/review/samsung-980-pro-1-tb-ssd/... ) So you need to be able to source that much data from another high speed SSD and then fill nearly 1/8th of the drive to encounter the slowdown. Even when it slows down you’re still writing at 1.5GB/sec. Also remember that the drive is factory overprovisioned so there is always some amount of space left to handle some of this burst writing.

For as much as this fact gets brought up, I doubt most consumers ever encounter this condition. Someone who is copying very large video files from one drive to another might encounter it on certain operations, but even in slow mode you’re filling the entire drive capacity in under 10 minutes.


> You’re also missing an important factor: Many drives now reserve some space that cannot be used by the consumer so they have extra space to work with. This is called factory overprovisioning.

This has always been the case, thus why even a decade ago the “pro” drives were odd sizes like 120g vs 128g.

Products like that still exist today and the problem tends to show up as drives age and that pool shrinks.

DWPD and TB written like modern consumer drives use are just different ways of communicating that contract.

FWIW I’d you do a drive wide discard and then only partition 90% of the drive you can dramatically improve the garbage collection slowdown on consumer drives.

In the world of ML and containers you can hit that if you say have fstrim scheduled once a week to avoid the cost of online discards.

I would rather have visibility into the size of the reserve space through smart, but I doubt that will happen.


> You’re also missing an important factor: Many drives now reserve some space that cannot be used by the consumer so they have extra space to work with. This is called factory overprovisioning.

I think it is safe to say that all drives have this. Refer to the available spare field in the SMART log page (likely via smartctl -a) to see the percentage of factory overprovisioned blocks that are still available.

I hypothesize that as this OP space dwindles writes get slower because they are more likely to get bogged down behind garbage collection.

> I doubt most consumers ever encounter this condition. Someone who is copying very large video files from one drive to another might encounter it on certain operations

I agree. I agree so much that I question the assertion that drive slowness is a major factor in machines feeling slow. My slow laptop is about 5 years old. Firefox spikes to 100+% CPU for several seconds on most page loads. The drive is idle during that time. I place the vast majority of the blame on software bloat.

That said, I am aware of credible assertions that drive wear has contributed to measurable regression in VM boot time for a certain class of servers I’ve worked on.


Now that PCIe 5.0 SSDs are available since 6+ months and you could backup your SSD with 15 GB/s but:

> you’re still writing at 1.5GB/sec.

Except of few seconds at the start, the whole process lasts as if you had PCIe 2.0 (15+ years ago). Having so fast SSDs there is no chance to make a quick backup/restore. And during restore you're second time in a row too slow.

It's crazy that instead of using slow PLC at the time of slow PCIe 1.0, back then fast SLC was in use. Now with PCIe 5.0 when you really need fast SLC, you get slow TLC or very slow QLC or even worse PLC coming.


> With the 980 Pro 1TB you can write 113GB before it slows down.

113GB is pretty easily reached with video files.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: