More

SuchAnonMuchWow · 2025-05-27T09:36:44 1748338604

Soldering dump a ton of heat into the cell, which has chances of destroying the cell. That's why most of the cells are spot-welded: its similar to soldering, but its much quicker and is localized only on the part of the metal that need to be melted, so the heat don't have time to reach the cell itself.

SuchAnonMuchWow · 2025-02-11T09:23:47 1739265827

3. >> I was resounding told that the absolute error in the numbers are too small to be a problem. Frankly, I did not believe this.

> I would personally also tell that to the author. But there is a much more important reason why correct rounding would be a tremendous advantage: reproducibility.

This is also what the author want from his own experiences, but failed to realize/state explicitly: "People on different machines were seeing different patterns being generated which meant that it broke an aspect of our multiplayer game."

So yes, the reasons mentioned as a rationale for more accurate functions are in fact rationale for reproducibility across hardware and platforms. For example going from 1 ulp errors to 0.6 ulp errors would not help the author at all, but having reproducible behavior would (even with an increased worst case error).

Correctly rounded functions means the rounding error is the smallest possible, and as a consequence every implementation will always return exactly the same results: this is the main reason why people (and the author) advocates for correctly rounded implementations.

SuchAnonMuchWow · 2025-01-22T09:53:48 1737539628

ARM have been moving away from chips with small area for a long time (see server SoC which are huge beasts), and are trying to become the standard platform for everyone trying to have custom hardware.

In this space, chiplets makes a lot of sense: you can have a compute chip with standard arm cores which is reused across your products, and add an extra chiplet with custom IPs depending on the product needs. That is for example what (as far as I'm aware) Huawei is doing: they reuse the chiplet with arm cores in different product, then add for example an IO+crypto die in the SoC in their routers/firewalls products, etc.

SuchAnonMuchWow · 2025-01-22T09:27:44 1737538064

More than the ISA, its the memory interconnect that require standardization. At SoC level, ARM is already a de-facto standard (ACE-Lite, CHI, ...), but its only a standard for communication inside a chip, to interconnect varius IPs.

I guess this standard aim to keep being a standard interconnect even in multi-chiplets system, to create/extend the whole ecosystem around ARM partners.

SuchAnonMuchWow · on Nov 21, 2024

In addition to the other comments, the iso C23 standard added the <stdbit.h> header to the standard library with a stdc_count_ones() function, so compiler support will become standard.

SuchAnonMuchWow · on Oct 16, 2024

The article goes into great detail and gives several example of several CEOs borrowing for actual decades, so it passes the sniff test because it does actually happen.

silotis · on Oct 16, 2024

The article gives examples of CEOs borrowing against their shares. The article provides no examples of CEOs rolling those loans over until their death.

SuchAnonMuchWow · on Oct 15, 2024

Mathematicians already explored exactly what you describe: this is the difference between classical logic and intuitionistic logic:

In classical logic statements can be true in and of themselves even if there as no proof of it, but in intuitionistic logic statements are true only if there is a proof of it: the proof is the cause for the statement to be true.

In intuitionistic logic, things are not as simple as "either there is a cow in the field, or there is none" because as you said, for the knowledge of "a cow is in the field" to be true, you need a proof of it. It brings lots of nuance, for example "there isn't no cow in the field" is a weaker knowledge than "there is a cow in the field".

ndndjdjdn · on Oct 15, 2024

It is a fascinating topic. I spent a few hours on it once. I remember vaguely that the logic is very configurable and you had a lot of choices. Like you choose law of excluded middle or not I think, and things like that depending on your taste or problem. I might be wrong it was 8 years ago and I spent a couple of weeks reading about it.

Also no suprise the rabbit hole came from Haskell where those types (huh) are attracted to this more.foundational theory of computation.

SuchAnonMuchWow · on Oct 9, 2024

The goal of this type of quantization is to move the multiplication by the fp32 rescale factor outside of the dot-product accumulation.

So the multiplications+additions are done on fp8/int8/int4/whatever (when the hardware support those operators of course) and accumulated in a fp32 or similar, and only the final accumulator is multiplied by the rescale factor in fp32.

SuchAnonMuchWow · on Oct 9, 2024

Its worse than that: the energy gains are when comparing computations made with fp32, but for fp8 the multipliers are really tiny and the adder/shifters represent a largest part of the operators (energy-wise and area-wise) and this paper will only have small gains.

On fp8, the estimated gate count of fp8 multipliers is 296 vs. 157 with their technique, so the power gain on the multipliers will be much lower (50% would be a more reasonable estimation), but again for fp8 the additions in the dot products are a large part of the operations.

Overall, its really disingenuous to claim 80% power gain and small drop in accuracy, when the power gain is only for fp32 operations and the small drop in accuracy is only for fp8 operators. They don't analyze the accuracy drop in fp32, and don't present the power saved for fp8 dot product.

bobsyourbuncle · on Oct 9, 2024

I’m new to neural nets, when should one use fp8 vs fp16 vs fp32?

reissbaker · on Oct 9, 2024

Basically no one uses FP32 at inference time. BF16/FP16 is typically considered unquantized, whereas FP8 is lightly quantized. That being said there's pretty minimal quality loss at FP8 compared to 16-bit typically; Llama 3.1 405b, for example, only benchmarks around ~1% worse when run at FP8: https://blog.vllm.ai/2024/07/23/llama31.html

Every major inference provider other than Hyperbolic Labs runs Llama 3.1 405b at FP8, FWIW (e.g. Together, Fireworks, Lepton), so to compare against FP32 is misleading to say the least. Even Hyperbolic runs it at BF16.

Pretraining is typically done in FP32, although some labs (e.g. Character AI, RIP) apparently train in INT8: https://research.character.ai/optimizing-inference/

tarasglek · on Oct 10, 2024

SambaNova does bf16

ericlewis · on Oct 9, 2024

Higher the precision the better. Use what works within your memory constraints.

jasonjmcghee · on Oct 9, 2024

With serious diminishing returns. At inference time, no reason to use fp64 and should probably use fp8 or less. The accuracy loss is far less than you'd expect. AFAIK Llama 3.2 3B fp4 will outperform Llama 3.2 1B at fp32 in accuracy and speed, despite 8x precision.

SuchAnonMuchWow · on Sept 10, 2024

To help with circular import, we switched a few years ago to lazily importing submodules on demand, and never switched back.

Just add to your __init__.py files:

import importlib

def __getattr__(submodule_name):

    return importlib.import_module('.' + submodule_name, __package__)

And then just import the root module and use it without ever needing to import individual submodules:

import foo

def bar():

    return foo.subfoo.bar() # foo.subfoo is imported when the function is first executed instead of when it is parsed, so no circular import happen

hobs · on Sept 10, 2024

Doesn't that mean your editor support is crap though?

SuchAnonMuchWow · on Sept 10, 2024

Not at all. Sublime is perfectly fine with it.

I suspect that from the usage in the code, it knows that there is a module foo and a submodule subfoo with a function bar() in it, and it can look directly in the file for the definition of bar().

It would be another story if we used this opportunity to mangle the submodules names for example, but that the kind of hidden control flow that nobody want in his codebase.

Also, it is not some dark arts of import or something: it is pretty standard at this point since its one of the most sane way of breaking circular dependencies between your modules, and the feature of overloading a module __getattr__ was introduced specifically for this usecase. (I couldn't find the specific PEP that introduced it, sorry)

aunderscored · on Sept 10, 2024

It does, which is why this is more easily done by importing exact bits or using a single file