The conclusions section of the paper is a good summary:
"In the process, we learned ten lessons about DSAs and
DNNs in general and about DNN DSAs specifically that
shaped the design of TPUv4i:
1. Logic improves more quickly than wires and SRAM
⇒ TPUv4i has 4 MXUs per core vs 2 for TPUv3 and 1 for
TPUv1/v2.
2. Leverage existing compiler optimizations
⇒ TPUv4i evolved from TPUv3 instead of being a brand
new ISA.
3. Design for perf/TCO instead of perf/CapEx
⇒ TDP is low, CMEM/HBM are fast, and the die is not big.
4. Backwards ML compatibility enables rapid deployment
of trained DNNs
⇒TPUv4i supports bf16 and avoids arithmetic problems by
looking like TPUv3 from the XLA compiler’s perspective.
5. Inference DSAs need air cooling for global scale
⇒ Its design and 1.0 GHz clock lowers its TDP to 175W.
6. Some inference apps need floating point arithmetic
⇒ It supports bf16 and int8, so quantization is optional.
7. Production inference normally needs multi-tenancy
⇒ TPUv4i’s HBM capacity can support multiple tenants.
8. DNNs grow ~1.5x annually in memory and compute
⇒ To support DNN growth, TPUv4i has 4 MXUs, fast onand off-chip memory, and ICI to link 4 adjacent TPUs.
9. DNN workloads evolve with DNN breakthroughs
⇒ Its programmability and software stack help pace DNNs.
10. The inference SLO is P99 latency, not batch size
⇒ Backwards ML compatible training tailors DNNs to
TPUv4i, yielding batch sizes of 8–128 that raise throughput
and meet SLOs. Applications do not restrict batch size."
>>> 8. DNNs grow ~1.5x annually in memory and compute
Wow! That's a massive growth rate for ML. TPUv3 was already faster than A100 in MLPerf. But this suggests a real breakthrough is needed to keep pace with future requirements. Each MXU already handles 16k ops per tick. And with the additional constraint of optimizing per watt rather than dollar, its quite the challange ;)
> But this suggests a real breakthrough is needed to keep pace with future requirements.
Not necessarily. Look at papers like the lottery ticket hypothesis - big ML models may be doing better simply because gradient descent just isn't doing a good enough job. Better optimizers would go a long way than just throwing compute at the the problem. Even if you can, it's impractical to use something like GPT-3 all the time.
What is notable about their Itanium efforts is that they chose to use ELF and DWARF and their object file and debugging formats. I think that this was actually quite important as it made it far easier for the x86 port: LLVM has robust support for ELF & DWARF.
I think another thing which helped is that they wrote more of OpenVMS in C which avoided the need for an x86 PL/I compiler, etc.
I lead a team to implement CIFS server in VMS on Itanium after deciding not to port ASV (Advanced Server - CIFS server for Alpha since it had a lot of hand tuned assembly for a specific architecture).
We ported Samba and implemented a lot of C runtime functions to make porting less intrusive. As part of porting Samba to VMS, I made the early port of CVS client and streamline merging upstream changes with our VMS specific changes.
Some of the most infamous 4 letter words we cussed all time was the missing 'fork' and 'fcntl' and all the work arounds we had to put in to get Samba working...
I did meet quite a few very competent engineers from DEC times as part of my work and also interacted with David Butenhof of POSIX threads fame when debugging a thread library he and their team had developed and was used by ASV.
I believe that Rice's theorem is about computability, not about whether or not it is possible to validate which CPU instructions a program can contain.
With certain restrictions, it is possible to do this: Google Native Client [1] has a verifier which checks that programs it executed did not jump into the middle of other instructions, forbade run-time code generation inside of such programs, etc.
(What other kinds of instructions? Genuinely asking.)
I don't think Rice's Theorem applies here. As a counterexample: On a hypothetical CPU where all instructions have fixed width (e.g. 32 bits), if accessing a register requires the instruction to have, say, the 10th bit set, and all other instructions don't, and if there is no way to generate new instructions (e.g. the CPU only allows execution from ROM), then it is trivial to check whether there is any instruction in ROM that has bit 10 set.
The next part I'm less sure how to state it rigorously (I'm not in the field): In our hypothetical CPU, I think disallowing that instruction either lets you remain being Turing Complete or not. In the former case, it's still the case that you can compute everything a Turing Machine can.
You'd have to add one extra condition to your hypothetical CPU: that it can't execute unaligned instructions. Given that, then yes, that lets you bypass Rice's theorem, even though it is indeed still Turing-complete.
But the M1 does have a way to "generate new instructions" (i.e., JIT), so that counterexample doesn't hold for it.
Yes, indeed, I should have stated "cannot execute unaligned instructions". Or have said 8 bit instead, then it would be immediately obvious what I mean. (You cannot jump into the middle of a byte because you cannot even address it.)
But I wanted to show how Rice's Theorem does not generally apply here. You can make up other examples: A register that needs an instruction with a length of 1000 bytes, yet the ROM only has 512 bytes space etc...
As for JIT, also correct (hence my condition), though that's also a property of the OS and not just the M1 (and on iOS for example, it is far more restricted what code is allowed to do JIT, as was stated in the thread already).
With the way Apple allows implementation of JIT on the M1 (with their custom MAP_JIT flag and pthread_jit_write_protect_np) it is actually possible to do this analysis even with JIT code. Since it enforces W^X (i.e. pages cannot be writable or executable at the same time) it gives the OS opportunity to inspect the code synchronously before it is rendered executable. Rosetta 2’s JIT support already relies on this kind of inspection to do translation of JIT apps.
It does when running native ARM code (but not x86 code), but AFAIK nothing stops Apple from changing this to being kernel mediated by updating libSystem in the ARM case as well. Of course I doubt they would take the performance hit just to get rid of a this issue.
1) the program does not contain an instruction that touches s3_5_c15_c10_1
2) the program contains an instruction that touches s3_5_c15_c10_1, but never executes that instruction
3) the program contains an instruction that touches s3_5_c15_c10_1, and uses it
Rice's theorem means we cannot tell whether a program will touch the register at runtime (as that's a dynamic property of the program). But that's because we cannot tell case 2 from case 3. It's perfectly decidable whether a program is in case 1 (as that's a static property of the program).
Any sound static analysis must have false positives -- but those are exactly the programs in case 2. It doesn't mean we end up blocking other kinds of instructions.
My reading of the C++ standard is that this behavior is effectively mandated and that one can write a program which can tell if an ABI observed the proposed optimization.
[expr.call]: "The lvalue-to-rvalue, array-to-pointer, and function-to-pointer standard conversions are performed on the argument expression."
[conv.lval]: "... if T has a class type, the conversion copy-initializes the result object from the glvalue."
The way a program can tell if a compiler is compliant to the standard is like so:
struct S { int large[100]; };
int compliant(struct S a, const struct S *b);
int escape(const void *x);
int bad() {
struct S s;
escape(&s);
return compliant(s, &s);
}
int compliant(struct S a, const struct S *b) {
int r = &a != b;
escape(&a);
escape(b);
return r;
}
There are three calls to 'escape'. A programmer may assume that the first and third call to escape observes a different object than the second call to escape and they may assume that 'compliant' returns '1'.
The compiler would be forced to create copies in that case. In general (using my proposed ABI), taking the address of an object will cause this, because it is possible to mutate an object through its address.
It's still a win because 1) you can avoid making copies in many places, and 2) code size decreases because the copy happens one time in the callee rather than many times for every caller.
So the caller might have to copy if the the pointer escapes and the callee might have to copy if it needs to mutate the value. In practice in many cases you might end up with more copies than the "bad" ABI.
But similar things can already happen where copy elision is optional. It's one the niches the C++ standard carves out regarding the as-if rule, and adding one for this new purpose is conceivable.
It is a project aimed at making the design of electronic logical easier.
Often, such hardware is written using hardware description languages [1] like Verilog or VHDL. These languages are very low level and, in the opinion of some, a little clumsy to use.
XLS aims to provide a system for High-level synthesis [2]. The benefit of such systems is that you can more easily map interesting algorithms to hardware without being super low level.
I remember years ago reading about Handel-C. A lot like Go with channels and threads and function calls. The way it synthesized the hardware was pretty simple conceptually. You could easily understand how the program flow was converted into a state machine in the hardware.
Not sure what happened it it. Maybe it did not optimize things enough.
I worked on the Handel-C compiler :-) Then later used the language for a few years.
Its approach was intentionally simple conceptually. You could tell at a glance how many synchronous clock cycles each step would take, and roughly what logic would be produced, so it worked quite well for deterministic I/O and simple logic.
I found it a bit of a pain for high-throughput pipelining though, and personally prefer a compiler that has more freedom to auto-balance pipelines and retime logic.
I think Handel-C occupied a middle ground between other HLS, and Verilog/VHDL. It had the concise, C-like syntax of the former, with the predictability of the latter.
What happened to it was it transitioned from university to spin-out company Celoxica, and then was eventualy bought by Agility; then Mentor Graphics bought Handel-C while Agility folded, and Mentor seemed to mothball it.
For a while in the middle there was a decent business with great customers and a decent market cap, and something I'm not privy to resulted in the business folding. I don't think it failed due to insufficient code optimisation :-)
0 is unsigned. I would reject 1/inf — it would be NAN. If the user wants to play silly games with derivatives, computer algebra systems are that way: —>.
The catch here, in many ways, is Google owns the advertiser marketplace. Literally, if you want to sell ads on common android settings based on a user's location, you have to do it with Google's data.
So, in many ways they don't sell a person's data to advertisers, per se. However, they do control access to people located in a given area. Not sure there is much of a meaningful difference, for most people.
Yes which is also in their interest because once personal information is sold someone else can use it without going through Google. And they benefit by having more personal information than anyone else. Instead they sell access to advertising on very personal information to you via them as a proxy. I don't see that being that much better and it is almost impossible to avoid. Which is why Google is worried if the government starts cracking down further on this it will hit them too. I don't want Google selling any information about my location or carriers as well directly or indirectly
Facebook also doesn't sell personal information to advertisers. Both Facebook and Google let advertisers target ads based on location and other personal info, which is not the same thing as actually providing that information to third parties.
I suspect JumpCrisscross is referring to the data leak scandals via the API that have been in the news in the last year. Of course, this was also not Facebook selling information. Rather, it was Facebook allowing users to share information about their friends automatically and without the friends' consent via the API. Not great, but not selling user information either.
But they do. They let them target ads based on your location. It's the same thing as selling personal data. If someone clicks a location targeted ad the advertiser know the persons location.
"Google Ads location targeting allows your ads to appear in the geographic locations that you choose: countries, areas within a country, a radius around a location, or location groups, which can include places of interest, your business locations, or tiered demographics.
Location targeting helps you focus your advertising on the areas where you'll find the right customers, and restrict it in areas where you won't. This specific type of targeting could help increase your return on investment (ROI) as a result."
In a sense Google is spyware and adware at the same time. It shame these two terms are used less now then they used too, since they lost their meaning when so many apps and your Android phone itself is one.
And that's important. I'd rather just Google (which isn't great) have that information than literally anyone willing to pay $0.50 for it (which is infinitely worse).
You're not going to be able to uniquely target someone in a radius, since the minimum is 1km (and I guess Google would artificially widen the radius to prevent uniquely identifying people in sparsely populated areas). You've also got to entice them to click, and that's notoriously difficult.I mean it might work occasionally but it's never going to be the shortest path to surveiling an individual. I would guess that ~nobody is successfully using ads to determine individual people's locations.
Yes. I'm not saying it is as bad as selling the information verbatim to banksters to collect someone's debt.
However, it's still selling personal information to adbuyers. And, I don't think you can say "nobody is successfully using ads to determine ad clickers general location with a X mile radius".
This surveillance thing is getting out of hand and Google is annoyed that shady corporations tries to get ground which might bring new laws.
> And, I don't think you can say "nobody is successfully using ads to determine ad clickers general location with a X mile radius".
Of course. And I'm sure that you can find some people who would get riled up that they incremented a counter in a location-oriented semi-anonymous bucket of clicks. But I think most people are substantially less riled up about that than they would be if Google were actually disclosing their individual locations directly to buyers. It seems like a lot of the anti-Google folks on here equivocate between these two, and I speculate it's because consciously or subconsciously they realize the latter narrative is much more emotionally compelling to a much broader segment of the population. To me, that's deceptive rhetoric.
There is location leakage but it's clearly not the same because:
- they don't get your location if you don't click on the ad. Since people don't click on ads very often, only rarely does an advertisers get a person's location, and without a name attached.
- Mobile phone companies were letting people query a user's location based on their real name. How would you do that with advertising?
We need to move beyond one-bit thinking. Location-revealing services aren't all the same.
Website owners already have your general location based on IP - they're not getting 'leaked' any information from location-based ad targeting that isn't already available to them when you visit the site.
GPS: Accuracy varies depending on GPS signal and connection.
Wi-Fi: Accuracy should be similar to the access range of a typical Wi-Fi router.
Bluetooth: If Bluetooth and/or Bluetooth scanning are enabled on a device, a publicly broadcast Bluetooth signal can provide an accurate indication of location
Google's cell ID (cell tower) location database: Used in the absence of Wi-Fi or GPS. Accuracy is dependent on how many cell towers are located within an area and available data, and some devices don't support cell ID location."