`openat` has basically solved that since 2.6.16 (which came out in 2006). There are still some uncommon APIs have been slow to gain `at` variants but there's usually a workaround (for example, `getxattrat` and family were only added in 6.13 (this January), but can be implemented in terms of `openat` + `fgetxattr`)
This is good because the current directory conceptually process wide for the user as well. If your program isn't a shell or performs a similar function then you probably should not change the working directory at all.
If you need thread-specific local paths just use one of the *at() variants that let you explicitly specify a directory handle that your path is relative to.
Technically, it's true, as far as I am aware. Android has low level functions which themselves then directly call Linux, but you are not allowed to make a direct Linux call.
In the Android-y part of Android, yes, they use ART, which is abstracted.
But, they also give app authors the NDK. How do you think Android knows whether it's libc or your app making a syscall from native code?
It doesn't - you're allowed access to full Linux-land. The syscall numbers and userland EABI for Android are the same as for any other Linux. They use a different `libc` from most GNU/Linux (one of a few places this wonderful turn of phrase actually makes sense) flavors, but this is no different from Alpine Linux using `musl` instead of `glibc` for example.
As such, you can use `musl`, `glibc`, or non-libc based environments (Rust, Zig) on Android without issue. You can run any C you want, either by porting it to `bionic` (most termux apps, although they support glibc now), statically linking your own runtime (Rust, Zig, etc.), or abusing the dynamic linker into making glibc work (termux-glibc, I think).
Yes, but there are strict SELinux policies forbidding you from accessing certain "dangerous" stuff like execve(), which may end up killing native terminal emulators like Termux (despite Android not forbidding dynamic code execution elsewhere, Play Store policies aside).
Sure, and this VM solution is the exact path forward for “install stuff in a box” solutions as Android move towards trying to enforce w^x, which is probably why they chose a full Linux VM as their demo app. Emulators and games and apps with embedded JIT will be harder to deal with.
What do you imagine happens when you call fopen()? At some point there's a transition from user mode into kernel mode. What can the system do to prevent normal "app" code from making the transition but not "low level" code, when it's all user mode stuff running in the same process?
What exactly is it you're "right" about? How do you think any libc wrapper or direct syscall works? Syscalls are filtered for security reasons on many platforms, including Android. I mean, a syscall is literally an API itself to get a kernel's internals to do something for you.
What you're suggesting is, as someone else said, non-sensical. Phrased without invincible ignorance, it's equivalent to saying, "Android prevents you from creating a new OS kernel to supplant its own control over hardware access," which is true of basically any OS. How helpful or secure would it be if you could make 0 guarantees about how your hardware is used?
You can do direct hardware access through drivers that are integrated/interfaced with the kernel somehow, but this still is not an entirely arbitrary access, and is still going through the kernel in a way.
1 - Core system firmware data/host platform configuration; typically contains serial and model numbers
2 - Extended or pluggable executable code; includes option ROMs on pluggable hardware
3 - Extended or pluggable firmware data; includes information about pluggable hardware
4 - Boot loader and additional drivers; binaries and extensions loaded by the boot loader
7 - SecureBoot state
8 - Commands and kernel command line
9 - All files read (including kernel image)
Now the problem is, 8 and 9 I would argue are the most important (since technically 7 probably covers everything else in that list?), whereas my kernel and initrd are not encrypted and my command line can just be edited (but normally wouldn't need to be). But I can't find anyway to get grub, from a booted system, to simulate the output of those values so I can pre-seal the LUKS volume with the new values.
So in practice, I just always need to remember my password (bad) which means there's no way to make a reasonable assessment of system integrity on boot if I get prompted (I'd argue also the UI experience here isn't good: if I'm being prompted for a password, that clevis boot script should output what changed at what level - i.e. if secure boot got turned off, or my UEFI firmware changed on me when I'm staying in a hotel, maybe I shouldn't unlock that disk).
> You can mitigate this by including PCRs that sign the kernel and initrd.
No, that's not an effective mitigation. The signed kernel+initrd would still boot into the impersonated root.
> however it means whenever you update you need to unlock manually. On Redhat-based distros this can be done with PCRs 8 and 9, though IIRC this may change on other distros.
> Also AFAIK there is no standard way to guess the new PCRs on reboot so you can't pre-update them before rebooting. So you either need to unlock manually or use a network decryption like dracut-sshd.
For multiple users on the same server it was IMO well designed. Everyone had their ~ and could place whatever libraries/binaries/etc. in there and do whatever they wanted.
Package managers are way more modern than that and their design does by itself not require root (see pip). You can in fact run most package managers without root, you just won't be able to modify system files. You can use them to install a chroot as regular user, e.g. `zypper --installroot ~/tw install bash`.
FUSE doesn't really relate to single vs. multi-user AFAICT.
Users are perfectly sandboxed if you configure the system that way. Depending on the distribution that's even the default.
This is largely a package manager problem. There is a way to run Homebrew (the package manager widely used on macOS) on Linux in a rootless mode, and it will install packages into your home directory no problem.
It’s a good trick to have in your back pocket if you’re given an unprivileged user on a compute node and want to make use of modern tools.
> You can in fact run most package managers without root
It is very clear from the context that dehrmann was talking about Linux distro package managers (Apt, Yum, Dnf, Apk, etc.) and as far as I know they all require root, or at least I have never once seen someone use them without root.
I figured most package managers (brew, pip, nix, npm, etc.) are not actually one of the few Linux distro package managers. You listed them almost exhaustively after all (excepting pacman).
Right but as I said from the context it was clear he was talking about distro package managers, not language package managers.
Nix requires root (at least by default). Brew I'll give you - I didn't know you can use it on Linux. Do people actually do that enough that it works reliably?
> We love to praise Unix, but it wasn't built for modern multi-user use. FUSE was an after-thought. So were package managers, and they got added, but they require root.
The latency shows after how many cycles the result of an instruction can be consumed by another, while the throughput shows how many such instructions can be pipelined per cycle, i.e. in parallel.
I believe the throughput shown in those tables is the total throughput for the whole CPU core, so it isn't immediately obvious which instructions have high throughput due to pipelining within an execution unit and which have high throughput due just to the core having several execution units capable of handling that instruction.
That's true, but another part of the tables show how many "ports" the operation can be executed on, which is enough information to concluded an operation is pipelined.
For example, for many years Intel chips had a multiplier unit on a single port, with a latency of 3 cycles, but an inverse throughput of 1 cycle, so effectively pipelined across 3 stages.
In any case, I think uops.info [1] has replaced Agner for up-to-date and detailed information on instruction execution.
Yes. In the past new HW has been made available to the uops.info authors in order to run their benchmark suite and publish new numbers: I'm not sure if that just hasn't happened for the new stuff, or if they are not interested in updating it.
FWIW, there are two ideas of parallelism being conflated here. One is the parallel execution of the different sequential steps of an instruction (e.g. fetch, decode, operate, retire). That's "pipelining", and it's a different idea than decoding multiple instructions in a cycle and sending them to one of many execution units (which is usually just called "dispatch", though "out of order execution" tends to connote the same idea in practice).
The Fog tables try hard show the former, not the latter. You measure dispatch parallelism with benchmarks, not microscopes.
Also IIRC there are still some non-pipelined units in Intel chips, like the division engine, which show latency numbers ~= to their execution time.
I don't think anyone is talking about "fetch, decode, operate, retire" pipelining (though that is certainly called pipelinig): only pipelining within the execution of a instruction that takes multiple cycles just to execute (i.e., latency from input-ready to output-ready).
Pipelining in stages like fetch and decode are mostly hidden in these small benchmarks, but are visible when there are branch misprediction, other types of flushes, I$ misses and so on.
> I don't think anyone is talking about "fetch, decode, operate, retire" pipelining (though that is certainly called pipelinig): only pipelining within the execution of a instruction that takes multiple cycles just to execute (i.e., latency from input-ready to output-ready).
I'm curious what you think the distinction is? Those statements are equivalent. The circuit implementing "an instruction" can't work in a single cycle, so you break it up and overlap sequentially issued instructions. Exactly what they do will be different for different hardware, sure, clearly we've moved beyond the classic four stage Patterson pipeline. But that doesn't make it a different kind of pipelining!
We are interested in the software visible performance effects of pipelining. For small benchmarks that don't miss in the predictors or icache, this mostly means execution pipelining. That's the type of pipelining the article is discussing and the type of pipelining considered in instruction performance breakdowns considered by Agner, uops.info, simulated by LLVM-MCA, etc.
I.e., a lot of what you need to model for tight loops only depends on the execution latencies (as little as 1 cycle), and not on the full pipeline end-to-end latency (almost always more than 10 cycles on big OoO, maybe more than 20).
Adding to this: the distinction is that an entire "instruction pipeline" can be [and often is] decomposed into many different pipelined circuits. This article is specifically describing the fact that some execution units are pipelined.
Those are different notions of pipelining with different motivations: one is motivated by "instruction-level parallelism," and the other is motivated by "achieving higher clock rates." If 64-bit multiplication were not pipelined, the minimum achievable clock period would be constrained by "how long it takes for bits to propagate through your multiplier."
> one is motivated by "instruction-level parallelism," and the other is motivated by "achieving higher clock rates."
Which are exactly the same thing? For exactly the same reasons?
Sure, you can focus your investigation on one or the other but that doesn't change what they are or somehow change the motivations for why it is being done.
And you can have a shorter clock period than your non-pipelined multiplier just fine. Just that other uses of that multiplier would stall in the meantime.
An OoO design is qualitatively different from an in-order one because of renaming and dynamic scheduling, but the pipelining is essentially the same and for the same reasons.
> and it's a different idea than decoding multiple instructions in a cycle and sending them to one of many execution units (which is usually just called "dispatch", though "out of order execution
Being able to execute multiple instructions is more properly superscalar execution, right? In-order designs are also capable of doing it and the separate execution unit do not even need to run in lockstep (consider the original P5 U and V pipes).
Right; it's easy to forget that superscalar CPU cores don't actually have to be in-order, but most of them are out-of-order because that's usually necessary to make good use of a wide superscalar core.
(What's the best-performing in-order general purpose CPU core? POWER6 was notably in-order and ran at quite high clock speeds for the time. Intel's first-gen Atom cores were in-order and around the same time as POWER6 but at half the clock speed. SPARC T3 was ran at an even lower clock speed.)
The IBM Z10 came out a year later. It was co-designed with POWER6 as part of IBM's eClipz project, and shared a number of features / design choices, including in-order execution.
In-order parallel designs are "VLIW". The jargon indeed gets thick. :)
But as to OO: the whole idea of issuing sequential instructions in parallel means that the hardware needs to track dependencies between them so they can't race ahead of their inputs. And if you're going to do that anyway, allowing them to retire out of order is a big performance/transistor-count win as it allows the pipeline lengths to be different.
VLIW is again a different thing. It uses a single instruction that encodes multiple independent operations to simplify decoding and tracking, usually with exposed pipelines.
But you can have, for example, a classic in-order RISC design that allows for parallel execution. OoO renaming is not necessary for dependency tracking (in fact even scalar in order CPUs need dependency tracking to solve RAW and other hazards), it is "only" needed for executing around stalled instructions (while an in order design will stall the whole pipeline).
Again P5 (i.e the original Pentium) was a very traditional in order design, yet could execute up to two instructions per cycle.
No it isn't. I'm being very deliberate here with refusing pedantry. In practice, "multiple dispatch" means "OO" in the same way that "VLIW" means "parallel in order dispatch". Yes, you can imagine hypothetical CPUs that mix the distinction, but they'd be so weird that they'd never be built. Discussing the jargon without context only confuses things.
> you can have, for example, a classic in-order RISC design that allows for parallel execution.
Only by inventing VLIW, though, otherwise there's no way to tell the CPU how to order what it does. Which is my point; the ideas are joined at the hip. Note that the Pentium had two defined pipes with specific rules about how the pairing was encoded in the instruction stream. It was, in practice, a VLIW architecture (just one with a variable length encoding and where most of the available instruction bundles only filled one slot)! Pedantry hurts in this world, it doesn't help.
> Note that the Pentium had two defined pipes with specific rules about how the pairing was encoded in the instruction stream. It was, in practice, a VLIW architecture (just one with a variable length encoding and where most of the available instruction bundles only filled one slot)!
This is ridiculous. There are no nop-filled slots in the instruction stream, and you can't even be sure which instructions will issue together unless you trace backwards far enough to find a sequence of instructions that can only be executed on port 0 and thus provide a known synchronization point. The P5 only has one small thing in common with VLIW, and there's already a well-accepted name for that feature, and it isn't VLIW.
Meh. P5's decode algorithm looks very much like VLIW to me, and emphatically not like the 21264/P6 style of dispatch that came to dominate later. I find that notable, and in particular I find senseless adherence to jargon definitions[1] hurts and doesn't help in this sphere. Arguing about how to label technology instead of explaining what it does is a bad smell.
[1] That never really worked anyway. ia64, Transmeta's devices and Xtensa HiFi are all "VLIW" by your definition yet work nothing like each other.
The point was exactly that "VLIW" as a term has basically no meaning. What it "means" in practice is parallel in-order dispatch (and nothing about the instruction format), which is what I said upthread.
VLIW means Very Large Instruction Word. It is a property of the instruction set, not of the processor that implements it.
You could have a VLIW ISA that is implemented by a processor that "unrolls" each instruction word and mostly executes the constituent instructions serially.
> Also IIRC there are still some non-pipelined units in Intel chips, like the division engine, which show latency numbers ~= to their execution time
I don't think that's accurate. That latency exists because the execution unit is pipelined. If it were not pipelined, there would be no latency. The latency corresponds to the fact that "doing division" is distributed across multiple clock cycles.
Division is complicated by the fact that it is a complex micro-coded operation with many component micro-operations. Many or all of those micro-operations may in fact be pipelined (e.g., 3/1 lat/itput) , but the overall effect of executing a large number of them looks not very pipelined at all (e.g., 20 of them on a single EU would have 22/20 lat/itput, basically not pipelined when examined at that level).
Sorry, correcting myself here: it's cut across multiple cycles but not pipelined. Maybe I confused this with multiplication?
If it were pipelined, you'd expect to be able to schedule DIV every cycle, but I don't think that's the case. Plus, 99% of the time the pipeline would just be doing nothing because normal programs aren't doing 18 DIV instructions in a row :^)
Wine doesn't do any of that as far as I know. It loads a Windows PE file, it's (placeholder) DLL dependencies, and starts executing at the main entrypoint. Box64 does translation and has some specific placeholder DLLs that intercept API calls and call native code rather than emulating the entire process.
If what you describe is being done, it's being done by Box64. I don't know enough about aarch64 to know what ARM64X and ARM64EC do, though. I can find [a Github comment](https://github.com/ptitSeb/box64/pull/858#issuecomment-16057...) where one of the authors states that the goal is to implement ARM64EC, but I wouldn't know what the status is on that.
Wine does support ARM64EC and I'm pretty sure you can actually use an x86 emulator with it too (at the very least Hangover should support this, minimizing the amount of emulated code needed.)
And Kalpa is that just with Plasma as DE.