Something I had pounded into me when I drove too slowly and cautiously during my first driving test, and failed.
Those Waymos weren't moving which is a pretty egregious example of obstructing traffic.
An old rule of thumb is every time a service expands by an order of magnitude there are new problems to solve. I suspect and hope this is just Waymo getting to one of those points with new problems to solve, and they will find a way to more graciously handle this in the future.
Using Openwrt which pretty much all home routers are built on, all I have to do is tell my router which offset to give my subnets from the prefix and it does the rest.
Both for subdividing up the prefix from the ISP and my ULA prefix I use for internal devices.
I have changed ISPs I think 3 times with no ill effects. Plus it works when my ISP occasionally gives me a new prefix.
The only tweaking I had to do was when I went from an ISP that game me a /48 to one that only gave me a /56. I had been greedy and was handing a /56 to my internal router. I changed that to a /60
and updates it's expectations about which subnets it could hand out and all was good.
But I expect two layers of home routers without NAT is a bit of an exception.
There is an aspect of the history of Forth and C I have been trying to wrap my head around.
The early B compiler was reported to generate threaded code (like Forth). The threaded code was abandoned fairly early in the port to the PDP11 from the PDP7 as it was deemed to slow to write an operating system in.
At which point unix and C lost a very interesting size optimization. With the net result that Forth was more portable and portable between machines circa 1970 and Unix had to wait until circa 1975 with UnixV6.
I have been trying to go back through the history to see if I could understand why threaded code was deemed too slow. Today the most part of executables is code that is not run often and would probably benefit from a smaller representation (less memory and cache space making the system faster overall). So this is a practical question even today.
I found a copy of unix for the PDP7. Reproduced from old printouts typed in. If I have read the assembly correctly the B compiler was not using an efficient form of threaded code at all.
The PDP7 is an interesting machine. It's cells were 18 bits wide. The adress bus was 12bits wide. Which meant there was room for an opcode and a full address in every cell.
As I read the B compiler it was using a form of token threading with everything packed into a single 18 bit cell. The basic operations of B were tokens and an if token encoded with a full address in the cell. Every token had to be decoded via a jump table, the address of the target code was then plugged into a jump instruction which was immediately run.
Given the width of the cells, I wonder what the conclusions about performance of B would have been if subroutine threading or a similar technique using jmp instructions would have been.
Does anyone know if Forth suffers measurably in inner loops from have to call words that perform basic operations?
Is this where a Forth programmer would be accustomed to write the inner loop in assembly to avoid the performance penalty?
I can probably shed some light on that. I've used Forth on 8 bit platforms (6502, 6809), 16 bit platforms (80286) and 32 bit platforms (68K), as well as assembly, and on the 16 and 32 bit platforms C. Putting these side-by-side and assuming roughly equivalent programmer competence levels at the time assembler would win out, C would get you to maybe half the speed of assembly on a good day and Forth was about 10x slower than that.
Which was still incredibly fast for the day, given that Forth was compiled to an intermediary format with the Forth interpreter acting as a very primitive virtual machine. This interpretation step had considerable overhead, especially in inner loops with few instructions the overhead would be massive. For every one instruction doing actual work you'd have a whole slew of them assigned to bookkeeping and stack management. What in C would compile to a few machine instructions (which a competent assembly programmer of the time would be able to significantly improve upon) would result in endless calls to lower and lower levels.
There were later Forth implementations that improved on this by compiling to native code but I never had access to those when I was still doing this.
For a lark I wrote a Forth in C rather than bootstrapping it through assembly and it performed quite well, Forth is ridiculously easy to bring up, it is essentially a few afternoons work to go from zero to highway speeds on a brand new board that you have a compiler for. Which is one of the reasons it is still a favorite for initial board bring-up.
One area where Forth usually beat out C by a comfortable margin was code size, Forth code tends to be extremely compact (and devoid of any luxury). On even the smallest micro controllers (8051 for instance, and later, MicroChip and such) you could get real work done in Forth.
I looked and today Swift Forth from Forth Inc, has some version of a compiler that generates not horrible code so I expect the 10x slowdown compared to C is a thing of the past.
It is no so much the parts of the code that run infrequently that contribute to poor performance, but the very tiny <1% of all code that does run frequently, and should be running completely in cache. So code size doesn't have an enormous impact on speed of execution.
The overhead of threading seems pretty obvious: call and return instructions are expensive compared to the cost of the one equivalent instruction that would have been executed in a compiled implementation. And placing arguments on a stack means that all operands have to to be read from and written to memory, incurring additional ferocious overhead, whereas a compiler would enregister values, particularly in performance-critical code. Not unreasonable to expect that Forth code is going to run at least an order of magnitude slower than compiled code.
Some of those could be partially fixed in hardware. Examples:
- returns can run in as good as zero time
- when calling a word, the CPU could prefetch the cache line containing the next word to be called, on the assumption that it would be called soon (an assumption that would be correct for many “almost leaf” calls)
- the top of the stack could be kept in register-speed memory.
“The internal structure of the NC4016 is designed for single clock cycle instruction execution. All primitive operations except memory fetch, memory store, and long literal fetch execute in a single clock cycle. This requires many more on-chip interconnection paths than are present on the Canonical Stack Machine, but provides much better performance.
[…]
The NC4016 subroutine return bit allows combining a subroutine return with other instructions in a similar manner. This results in most subroutine exit instructions executing "for free" in combination with other instructions. An optimization that is performed by NC4016 compilers is tail-end recursion elimination. Tail-end recursion elimination involves replacing a subroutine call/subroutine exit instruction pair by an unconditional branch to the subroutine that would have been called.”
(¿Almost?) all modern hardware is designed for other language paradigms, though, so these won’t help with hardware you can buy.
> Does anyone know if Forth suffers measurably in inner loops from have to call words that perform basic operations?
Yes, the slowdown is of the order of 8× for DTC, ITC, and bytecode ("token threading"). Eliminating the jump table reduces the overhead a bit, but it's still order 8×.
The B compiler bundled a copy of the bytecode interpreter into each executable; that might have made it less appealing as a size optimization. For a big enough program it would still have won.
Subroutine threading is really just compact native code, but it still suffers from typically about 4× overhead for basic operations like dup, @, +, or exit (the traditional name for the runtime effect of ;). The primitive operations these execute are typically one or two cycles on a RISC such as a Cortex-M4, while a subroutine call and return are two more cycles, often plus two to four cycles of pipeline bubble (if the processor doesn't have good enough branch prediction). Presumably on the PDP-7 a subroutine call would have needed an additional memory cycle to store the return address into memory and another one to fetch it, plus two more memory cycles to fetch the call and return instructions. (I'm not familiar with the -7's instruction set, so correct me if I'm wrong.)
With respect to dup, though, commonly dup, drop, swap, and over represent operations that don't appear in optimized native code—they just tell the following operations which data to operate on, a purpose which is normally achieved by operand fields in native code. So the runtime overhead of stack-bytecode interpretation is a worse than it appears at first: each bytecode instruction takes time of the order of 4× or 8× the time of as a native instruction doing the same thing, but you have to run about twice as many bytecode instructions because about half of them are stack manipulation. So your total slowdown is maybe 8× or 16×.
You may also be interested in looking at the program dc, which IIRC was one of the programs Unix was originally written to run. It's a stack bytecode designed to be written by hand, like HP desk calculators of the time but with arbitrary precision.
CLU implemeted abstract data types. What we commonly call generics today.
The Liskov substitute principle in that context pretty much falls out naturally. As the entire point is to substitute in types into your generic data structure.
Phi-functions and block arguments are just different views of the same thing. Sometimes it is more convenient to use one or the other when thinking of a problem.
If you lay out phi-functions and their parameters on a grid, you'd get a "phi-matrix" where
phi-functions are rows and block arguments are the columns.
If you don't do an out-of-SSA transform before register allocation, and effectively treat block parameters like function calls then you're pushing the complexity to the register allocator.
An out-of-SSA transform before register allocation would coalesce not just registers but also variables in spill slots (thus avoiding memory-memory moves), it would reduce the complexity of parallel moves. A more advanced transform could also hoist moves out from before the hottest branch which could potentially lead to un-splitting previously split critical edges.
I prefer the Oort cloud colonization plan. Since Oort clouds stretch so far out they touch between the stars.
Once the problem of living on a comet has been solved, you get something like the Polynesian islanders. People would move from one commet to the next. One generation at a time, for more living space and more resources.
I wonder what people in a generation ship would do when they arrive at a star and discover the comets of the star system are already inhabited.
I am not up to speed on these new algorithms. I still remember there was a light weight cryptography algorithm a few years ago championed by the NSA that had a subtle (possibly deliberate) flaw in it.
When dealing with cryptography it is always necessary to remember cryptography is developed and operates in an adversarial environment.
Speck? To my knowledge there aren't any serious flaws despite a lot of public cryptanalysis. I think what sank Speck was that it came out a few years after after the Dual_EC_DRBG fiasco and nobody was ready to trust an NSA developed cipher yet - which is fair enough. The NSA burned their credibility for decades with Dual_EC_DRBG.
I mean, yeah, but also Simon and Speck aren't as good as the new generation of low-footprint designs like Ascon and Xoodyak. We know more about how to do these things now than we did 15 years ago.
In what ways is it better? Security margin or something? I thought Speck has held up pretty well to cryptanalysis (unlike you I'm not in the security field so maybe I'm wrong).
I quite liked the remarkable simplicity of Speck. Performance was better than Ascon in my limited testing. It seems like it should be smaller on-die or in bytes of code, and with possibly lower power consumption. And round key generation was possible to compute on-the-fly (reusing the round code!) for truly tiny processors.
Something I had pounded into me when I drove too slowly and cautiously during my first driving test, and failed.
Those Waymos weren't moving which is a pretty egregious example of obstructing traffic.
An old rule of thumb is every time a service expands by an order of magnitude there are new problems to solve. I suspect and hope this is just Waymo getting to one of those points with new problems to solve, and they will find a way to more graciously handle this in the future.