Hacker Newsnew | past | comments | ask | show | jobs | submit | inkyoto's commentslogin

> Chinese novels are on the other side of the spectrum. The sentences simply can't be very long and but often don't have any connecting words between sentences. The readers have to infer.

There is no grammatical ceiling on sentence length in Sinitic languages, Chinese languages (all of them) can form long sentences, and they all do possess a great many connecting words. Computational work on Chinese explicitly talks about «long Chinese sentences» and how to parse them[0].

However, many Chinese varieties and writing styles often rely more on parataxis[1] than English does, so relations between clauses are more often (but not always) conveyed by meaning, word order, aspect, punctuation, and discourse context, rather than by obligatory overt conjunctions. That is a tendency, not an inability.

[0] https://nlpr.ia.ac.cn/2005papers/gjhy/gh77.pdf

[1] https://hub.hku.hk/bitstream/10722/127800/1/Content.pdf


Sure. You can try to create arbitrarily long sentences with nested clauses in Chinese. Just like in English you can create arbitrarily long sentences like: "I live in a house which was built by the builders which were hired by the owner who came from England on a steamship which was built...".

But it feels unnatural. So most Chinese sentences are fairly short as a result. And it's also why commas, stops, and even spacing between words are a fairly recent invention. They are simply not needed when the text is formed of implicitly connected statements that don't need to be deeply nested.

To give an example, here's our favorite long-winded Ishmael: "Yes, here were a set of sea-dogs, many of whom without the slightest bashfulness had boarded great whales on the high seas—entire strangers to them—and duelled them dead without winking; and yet, here they sat at a social breakfast table—all of the same calling, all of kindred tastes—looking round as sheepishly at each other as though they had never been out of sight of some sheepfold among the Green Mountains." The Chinese translation is: "是的,这里坐着的是一群老水手,其中有很多人,在怒海中会毫不畏怯地登到巨鲸的背上——那可是他们一无所知的东西啊——眼都不眨地把鲸鱼斗死;然而,这时他们一起坐在公共的早餐桌上——同样的职业,同样的癖好——他们却互相羞怯地打量着对方,仿佛是绿山山从未出过羊圈的绵羊"

Or word-for-word: "Yes, here sitting [people] are the group of old sailors, among them there are many people, [who] in the middle of the raging sea can/will without fear on the whale's back climb. That whales were something they knew nothing about".

The subordinate clauses become almost stand-alone statements, and it's up to the reader to connect them.


> Some English translations of Russian literature can run into the absurd (sentences at half a page long), but even then there is a beauty to it.

C. K. Scott Moncrieff and Terence Kilmartin’s translation of Marcel Proust’s «In Search of Lost Time (Remembrance of Things Past)» contains nearly half-page long sentences.

Many modern readers complain about the substantial difficulty in following such sentences, although I personally find them delightful.


Meet TIMI – the Technology Independent Machine Interface of IBM's i Series (nèe AS/400), which defines pointers as 128-bit values[0], which is a 1980's design.

It has allowed the AS/400 to have a single-level store, which means that «memory» and «disk» live in one conceptual address space.

A pointer can carry more than just an address – object identity, type, authority metadata – AS/400 uses tagged 16-byte pointers to stop arbitrary pointer fabrication, which supports isolation without relying on the usual per-process address-space model in the same way UNIX does.

Such «fat pointer» approach is conceptually close to modern capability systems (for example CHERI’s 128-bit capabilities), which exist for similar [safety] reasons.

[0] 128-bit pointers in the machine interface, not a 128-bit hardware virtual address space though.


is this still used in IBM hardware ?

It is, although TIMI does not exist in the hardware – it is a virtual architecture that has been implemented multiple times in different hardware (i.e., CPU's – IMPI, IBM RS64, POWER, and only heavens know which CPU IBM uses today).

The software written for this virtual architecture, on the other hand, has not changed and continues to run on modern IBM iSeries systems, even when it originates from 1989 – this is accomplished through static binary translation, or AOT in modern parlance, which recompiles the virtual ISA into the target ISA at startup.


> But we don't have a linear address space, unless you're working with a tiny MCU.

We actually do, albeit for a brief duration of time – upon a cold start of the system when the MCU is inactive yet, no address translation is performed, and the entire memory space is treated as a single linear, contiguous block (even if there are physical holes in it).

When a system is powered on, the CPU runs in the privileged mode to allow an operating system kernel to set up the MCU and activate it, which takes place early on in the boot sequence. But until then, virtual memory is not available.


Those holes can be arbitrarily large, though, especially in weirder environments (e.g., memory-mapped optane and similar). Linear address space implies some degree of contiguity, I think.

Indeed. It can get ever weirder in the embedded world where a ROM, an E(E)PROM or a device may get mapped into an arbitrary slice of physical address space, anywhere within its bounds. It has become less common, though.

But devices are still commonly mapped at the top of the physical address space, which is a rather widespread practice.


And it's not uncommon for devices to be mapped multiple times in the address space! The different aliases provide slightly different ways of accessing it.

For example, 0x000-0x0ff providing linear access to memory bank A, 0x100-0x1ff linear access to bank B, but 0x200-0x3ff providing striped access across the two banks, with evenly-addressed words coming from bank A but odd ones from bank B.

Similarly, 0x000-0x0ff accessing memory through a cache, but 0x100-0x1ff accessing the same memory directly. Or 0x000-0x0ff overwriting data, 0x100-0x1ff setting bits (OR with current content), and 0x200-0x2ff clearing bits.


> Because the attempts at segmented or object-oriented address spaces failed miserably.

> That is false. In the Intel World, we first had the iAPX 432, which was an object-capability design. To say it failed miserably is overselling its success by a good margin.

I would further posit that segmented and object-oriented address spaces have failed and will continue to fail for as long as we have a separation into two distinct classes of storage: ephemeral (DRAM) and persistent storage / backing store (disks, flash storage, etc.) as opposed to having a single, unified concept of nearly infinite (at least logically if not physically), always-on just memory where everything is – essentially – an object.

Intel's Optane has given us a brief glimpse into what such a future could look like but, alas, that particular version of the future has not panned out.

Linear address space makes perfect sense for size-constrained DRAM, and makes little to no sense for the backing store where a file system is instead entrusted with implementing an object-like address space (files, directories are the objects, and the file system is the address space).

Once a new, successful memory technology emerges, we might see a resurgence of the segmented or object-oriented address space models, but until then, it will remain a pipe dream.


I don't see how any amount of memory technology can overcome the physical realities of locality. The closer you want the data to be to your processor, the less space you'll have to fit it. So there will always be a hierarchy where a smaller amount of data can have less latency, and there will always be an advantage to cramming as much data as you can at the top of the hierarchy.

while that's true, CPUs already have automatically managed caches. it's not too much of a stretch to imagine a world in which RAM is automatically managed as well and you don't have a distinction between RAM and persistent storage. in a spinning rust world, that never would have been possible, but with modern nvme, it's plausible.

Cpus manage it, but ensuring your data structures are friendly to how they manage caches is one of the keys to fast programs - which some of us care about.

«Memory technology» as in «a single tech» that blends RAM and disk into just «memory» and obviates the need for the disk as a distinct concept.

One can conjure up RAM, which has become exabytes large and which does not lose data after a system shutdown. Everything is local in such a unified memory model, is promptly available to and directly addressable by the CPU.

Please do note that multi-level CPU caches still do have their places in this scenario.

In fact, this has been successfully done in the AS/400 (or i Series), which I have mentioned elsewhere in the thread. It works well and is highly performant.


> «Memory technology» as in «a single tech» that blends RAM and disk into just «memory» and obviates the need for the disk as a distinct concept.

That already exists. Swap memory, mmap, disk paging, and so on.

Virtual memory is mostly fine for what it is, and it has been used in practice for decades. The problem that comes up is latency. Access time is limited by the speed of light [1]. And for that reason, CPU manufacturers continue to increase the capacities of the faster, closer memories (specifically registers and L1 cache).

[1] https://www.ilikebigbits.com/2014_04_21_myth_of_ram_1.html


I shudder to think about the impact of concurrent data structures fsync'ing on every write because the programmer can't reason about whether the data is in memory where a handful of atomic fences/barriers are enough to reason about the correctness of the operations, or on disk where those operations simply do not exist.

Also linear regions make a ton of sense for disk, and not just for performance. WAL-based systems are the cornerstone of many databases and require the ability to reserve linear regions.


Linear regions are mostly a figment of imagination in real life, but they are a convenient abstraction and a concept.

Linear regions are nearly impossible to guarantee, unless the underlying hardware has specific, controller-level provisions.

  1) For RAM, the MCU will obscure the physical address of a memory page, which can come from a completely separate memory bank. It is up to the VMM implementation and heuristics to ensure the contiguous allocation, coalesce unrelated free pages into a new, large allocation or map in a free page from a «distant» location.

  2)  Disks (the spinning rust variety) are not that different.  A freed block can be provided from the start of the disk. However, a sophisticated file system like XFS or ZFS, and others like it, will make an attempt do its best to allocate a contiguous block.

  3) Flash storage (SSDs, NVMe) simply «lies» about the physical blocks and does it for a few reasons (garbage collection and the transparent reallocation of ailing blocks – to name a few). If I understand it correctly, the physical «block» numbers are hidden even from the flash storage controller and firmware themselves.
The only practical way I can think of to ensure the guaranteed contiguous allocation of blocks unfortunately involves a conventional hard drive that has a dedicated partition created just for the WAL. In fact, this is how Oracle installation worked – it required a dedicated raw device to bypass both the VMM and the file system.

When RAM and disk(s) are logically the same concept, WAL can be treated as an object of the «WAL» type with certain properties specific to this object type only to support WAL peculiarities.


Ultimately everything is an abstraction. The point I'm making is that linear regions are a useful abstraction for both disk and memory, but that's not enough to unify them. Particularly in that memory cares about the visibility of writes to other processes/threads, whereas disk cares about the durability of those writes. This is an important distinction that programmers need to differentiate between for correctness.

Perhaps a WAL was a bad example. Ultimately you need the ability to atomically reserve a region of a certain capacity and then commit it durably (or roll back). Perhaps there are other abstractions that can do this, but with linear memory and disk regions it's exceedingly easy.

Personally I think file I/O should have an atomic CAS operation on a fixed maximum number of bytes (just like shared memory between threads and processes) but afaik there is no standard way to do that.


otoh, WAL systems are only necessary because storage devices present an interface of linear regions. the WAL system could move into the hardware.

> I don't think that anyone actually believes that writing code is only for junior developers.

That is, unquestionably, how it ought to be. However, the mainstream – regrettably – has devolved into a well-worn and intellectually stagnant trajectory, wherein senior developers are not merely encouraged but expected to abandon the coding altogether, ascending instead into roles such as engineering managers (no offence – good engineering managers are important, it is the quality that has been diluted across the board), platform overseers (a new term for stage gate keepers), or so-called solution architects (the ones who are imbued with compliance, governance and do not venture out past that).

In this model, neither role is expected – and in some lamentable cases, is explicitly forbidden[0] – to engage directly with code. The result is a sterile detachment from the very systems they are charged with overseeing.

Worse still, the industry actively incentivises ill-considered career leaps – for instance, elevating a developer with limited engineering depth into the position of a solution designer or architect. The outcome is as predictable as it is corrosive: individuals who can neither design nor architect.

The number of organisations in which expert-level coding proficiency remains the norm at senior or very senior levels has dwindled substantially over the past couple of decades or so – job ads explicitly call out the management experience, knowledge of vacuous or limited usefulness architectural frameworks (TOGAF and alike). There do remain rare islands in an ever-expanding ocean of managerial abstraction where architects who write code, not incessantly but when a need be, are still recognised as invaluable. Yet their presence is scarce.

The lamentable state of affairs has led to a piquant situation on the job market. In recent years, headhunters have started complaining about being unable to find an actually highly proficient, experienced, and, most importantly, technical architect. One's loss is another one's gain, or at least an opportunity, of course.

[0] Speaking from firsthand experience of observing a solution architect to have quit their job to run a bakery (yes) due to the head of architecture they were reporting to explicitly demanding the architect quit coding. The architect did quit, albeit in a different way.


> One reason is that using static binaries greatly simplifies the problem of establishing Binary Provenance, upon which security claims and many other important things rely.

It depends.

If it is a vulnerability stemming from libc, then every single binary has to be re-linked and redeployed, which can lead to a situation where something has been accidentally left out due to a unaccounted for artefact.

One solution could be bundling the binary or related multiple binaries with the operating system image but that would incur a multidimensional overhead that would be unacceptable for most people and then we would be talking about «an application binary statically linked into the operating system» so to speak.


> If it is a vulnerability stemming from libc, then every single binary has to be re-linked and redeployed, which can lead to a situation where something has been accidentally left out due to a unaccounted for artefact.

The whole point of Binary Provenance is that there are no unaccounted-for artifacts: Every build should produce binary provenance describing exactly how a given binary artifact was built: the inputs, the transformation, and the entity that performed the build. So, to use your example, you'll always know which artefacts were linked against that bad version of libc.

See https://google.github.io/building-secure-and-reliable-system...


I am well aware of and understand that.

However,

> […] which artefacts were linked against that bad version of libc.

There is one libc for the entire system (a physical server, a virtual one, etc.), including the application(s) that have/have been deployed into an operating environment.

In the case of the entire operating environment (the OS + applications) being statically linked against a libc, the entire operating environment has to be re-linked and redeployed as a single concerted effort.

In dynamically linked operating environments, only the libc needs to be updated.

The former is a substantially more laborious and inherently more risky effort unless the organisation has achieved a sufficiently large scale where such deployment artefacts are fully disposable and the deployment process is fully automated. Not many organisations practically operate at that level of maturity and scale, with FAANG or similar scale being a notable exception. It is often cited as an aspiration, yet the road to that level of maturity is windy and is fraught with many shortcuts in real life which result in the binary provenance being ignored or rendering it irrelevant. The expected aftermath is, of course, a security incident.


What is the point you're trying to make?

I claimed that Binary Provenance was important to organizations such as Google where it is important to know exactly what has gone into the artefacts that have been deployed into production. You then replied "it depends" but, when pressed, defended your claim by saying, in effect, that binary provenance doesn't work in organizations that have immaturate engineering practices where they don't actually follow the practice of enforcing Binary Provenance.

But I feel like we already knew that practices don't work unless organizations actually follow them.

So what was your point?


My point is that static linking alone and by itself does not meaningfully improve binary provenance and is mostly expensive security theatre from a provenance standpoint due to a statically linked binary being more opaque from a component attribution perspective – unless an inseparable SBOM (which is cryptographically tied to the binary), plus signed build attestations are present.

Static linking actually destroys the boundaries that a provenance consumer would normally want due to erasure of the dependency identities rendering them irrecoverable in a trustworthy way from the binary by way of global code optimisation, inlining (sometimes heavy), LTO, dead code elimination and alike. It is harder to reason about and audit a single opaque blob than a set of separately versioned shared libraries.

Static linking, however, is very good at avoiding «shared/dynamic library dependency hell» which is a reliability and operability win. From a binary provenance standpoint, it is largely orthogonal.

Static linking can improve one narrow provenance-adjacent property: fewer moving parts at deploy and run time.

The «it depends» part of the comment concerned the FAANG-scale level of infrastructure and operational maturity where the organisation can reliably enforce hermetic builds and dependency pinning across teams, produce and retain attestations and SBOM's bound to release artefacts, rebuild the world quickly on demand and roll out safely with strong observability and rollback. Many organisations choose dynamic linking plus image sealing because it gives them similar provenance and incident response properties with less rebuild pressure at a substantially smaller cost.

So static linking mainly changes operational risk and deployment ergonomics, not evidentiary quality about where the code came from and how it was produced, whereas dynamic linking, on the other hand, may yield better provenance properties when the shared libraries themselves have strong identity and distribution provenance.

NB Please do note that the diatribe is not directed at you in any way, it is an off-hand remark and a reference to people who prescribe purported benefits to the static linking that it espouses because «Google does» it without taking into account the overall context, maturity and scale of the operating environment Google et al operate at.


> […] 2GB maximum offsets in the .text section

… on the x86 ISA because it encodes the 32-bit jump/call offset directly in the opcode.

Whilst most RISC architecture do allow PC-relative branches, the offset is relatively small as 32-bit opcodes do not have enough room to squeeze a large offset in.

«Long» jumps and calls are indirect branches / calls done via registers where the entirety of 64 bits is available (address alignment rules apply in RISC architectures). The target address has to be loaded / calculated beforehand, though. Available in RISC and x86 64-bit architectures.


They are after the personal use VPN clients, but corporate users will follow soon.

Using the corporate VPN for personal purposes, including social media, is generally against corporate policy and is frowned upon (at least officially) in most businesses and organisations. It is also fraught with complications and could lead to disciplinary action or other unpleasant consequences. Just because the policy is not enforced does not mean it won’t be in the future.

If governments start targeting personal VPN's, it is only a matter of time before businesses crack down on unauthorised corporate VPN use as it will increase their risk of legal action stemming from employees’ missteps or misdeeds.


> A subset of an ISA will be incompatible with the full ISA and therefore be a new ISA. No existing software will run on it. So this won't really help anyone.

This isn't an issue in any way. Vendors have been routinely taking out rarely used instructions from the hardware and simulating them in the software for decades as part of the ongoing ISA revision.

Unimplemented instruction opcodes cause a CPU trap to occur where the missing instruction (s) is then emulated in the kernel's emulation layer.

In fact, this is what was frequently done for «budget» 80[34]86 systems that lacked the FPU – it was emulated. It was slow as a dog but worked.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: