More

jtrn · 2026-01-04T23:40:54 1767570054

Having worked as a therapist for years, treating thousands of client, and read more books and studies on this than I can count, I can for certain say that is MUCH MUCH simpler to change the environment than your personality to improve well beeing and general functioning.

To tired to pick a apart this article now. But this is feel good nonsense... Just one example, mindfulness is not even a fraction as effective as most people claim. It always fall apart when you do a proper study with actual measurable effects on life and happiness as outcome.

Also, 6 weeks is nothing. When I worked at inpatient unit we sometimes needed 6 weeks before patient reverted to baseline personality after admittance. This is just as silly as saying that you changed your lifestyle permanently with regard exercises after just a few weeks into your new year's resolution. You MIGHT have, but lets wait untill next year and see if the permanent claim is true.

jtrn · 2026-01-05T16:41:02 1767631262

Now I read it. And yes, it’s not very good.

The article conflates symptom management (state) with structural personality change (trait) and mistakes behavioral masking for genuine psychological shift.

Personality traits are defined by their stability and enduring nature, independent of active intervention. If a change requires constant, conscious maintenance (the "medicine"), it is by definition a coping strategy, not a personality trait. This indicates the underlying neurotic structure is still present, just temporarily suppressed. You don’t cure type one diabetes by taking insulin; you manage it.

Any new intervention (yoga, journaling) creates a temporary lift in mood and self-efficacy (the Placebo effect or Novelty effect). Measuring immediately at the peak of this novelty does not account for the very common regression to the mean that inevitably follows.

And here is a basic demonstration of why you do RCT with proper non-subjective measures for outcome. She writes: "I had wanted to change for the sake of this article... Answering questions like this helped push me up the percentiles." She asserts that answering the questions differently proved she changed, rather than proving she simply learned how to answer the test to get the desired result.

Introverts can behave like extroverts, but it costs them metabolic and psychic energy. Extroverts gain energy from it. She conflates social skills (which can be learned) with extroversion (a biological orientation toward reward sensitivity).

And even if I practice mindfulness for years myself, I’m extremely skeptical about over the top claims surrounding it. But a highly neurotic patient can learn mindfulness to manage panic attacks. They are still high in trait Neuroticism (highly sensitive to threat), but they have better "software" to handle the hardware. The article claims the hardware itself has been swapped out, which is an annoying oversimplification for readers.

So in summary, this article goes against everything I have seen in my practice, it doesn’t understand the concepts in question, and it’s not even internally logically coherent. So basically just as bad as every other mainstream article I ever read on psychology.

tailspin2019 · 2026-01-05T00:39:09 1767573549

Would you be willing to expand on what you mean by “change the environment”?

jtrn · 2026-01-05T16:52:03 1767631923

Think of it like moving a husky dog from Siberia to the Sahara. Nothing has changed with the dog, but it's not functioning quite as well after the move.

The classic examples are things like going to a library where everyone else is silent and studying, finding a job that suits your temperament, like accounting for the sensitive type and ambulance driving for the sensation-seeking type. Create a social contract for exercise meet-ups, and downgrade to a dumb phone to overcome doom scrolling. Having the router block all Internet traffic after 10 o’clock. And so on.

Moving from one place to another is immensely helpful for some. Changing friends, and even starting or stopping medication, counts as environmental changes in this instance, since it's part of the arena on which the personality plays itself out.

Just make sure to stay away from all this “nudge” people. They never have anything helpful to actually contribute to in this.

callc · 2026-01-05T01:08:43 1767575323

Not OP, but here’s what I think they mean: if you want to eat healthier then get rid of all the junk food in your home. Only buy healthy food. No more deciding between health and unhealthy. You made the choice for your future self by changing your environment.

ytoawwhra92 · 2026-01-05T03:10:06 1767582606

Move house/town/region/country. Get a new job. Make new friends. Leave bad romantic relationships. Get a pet. Go outside. Etc.

AbstractH24 · 2026-01-05T04:39:03 1767587943

Bring out different parts of you rather than change who you are.

pardon_me · 2026-01-05T11:52:53 1767613973

I agree. Approximately three months are necessary before starting to claim "personality changes" are at all permanent.

cainxinth · 2026-01-05T14:02:44 1767621764

Words are cheap. When someone tells me they’ve changed, I need to see at least a year of consistent behavior before I take that claim seriously. Far more often, what looks like change is just a honeymoon period that fades, with old habits resurfacing and regression to the mean taking over.

srid · 2026-01-05T03:24:43 1767583483

> mindfulness is not even a fraction as effective as most people claim. It always fall apart when you do a proper study with actual measurable effects on life and happiness as outcome.

This is interesting.

Is it also the case for those seeking bliss in the name of "jhana" (cf. Jhourney)?

jtrn · 2026-01-05T17:37:14 1767634634

Given that you asked about this topic, I assume you are interested in it, so I apologize in advance if this comes across as negative. I come at this as a ruthless pragmatist. My background is in Metacognitive Therapy (MCT) and long-term mindfulness, so I’ve read the literature. My only metric is: “What is your goal, and does this tool actually help you achieve it?” From that perspective, I have four major issues with the modern "Jhana" movement:

There are currently no large-scale, randomized controlled trials (RCTs) proving the clinical effectiveness of the "Jhourney" method or "Sutta Jhanas" for mental health in the general population. Until we see data, this is experimental, not medical.

Then it strikes me as "Drug-Free Hedonism." It is a classic case of Spiritual Materialism: instead of buying a Ferrari to feel good, you "buy" a Jhana state. It is still the ego seeking gratification, just using a different currency. The marketing language ("ecstatic," "orgasmic," "dopamine hit") explicitly invites a consumerist mindset. You aren't dissolving the self; you are just consuming a peak experience.

The "State vs. Trait" Fallacy In my observation, the people drawn to these niche practices are already optimizing/biohacking types. They are selecting a practice that reinforces their existing personality rather than transforming it. I see very little evidence that accessing these temporary states leads to permanent positive traits or behavioral changes once the "high" wears off.

On a personal moral level, I find the logic of extreme contemplative devotion flawed. Historically, the "true Buddhist" monastic model relies on others for food and sustenance. You can frame this as "spiritual focus," but a pragmatist could easily frame it as a lack of self-sufficiency—or even laziness. I rarely see a tangible improvement in productivity, self-sufficiency, or general functioning in people who dive deep into this mysticism.

That said, I do believe in the "software upgrade" of mindfulness—specifically the ability to step back, observe, and evaluate thoughts without engaging them (the core of Metacognitive Therapy). There is decent evidence for that. But that is a tool for functioning, which is very different from chasing bliss.

etyhhgfff · 2026-01-05T19:27:20 1767641240

The buddihst and meditation community is very aware that chasing bliss is not the goal. Jhanas are a tool. A good teacher would advice you to develop enough concentration and use that for insight meditation (vipassana). Mindfulness based stress reduction MBSR has good scientific support, and further studies and meta studies confirm this more often than not.

From personal experience I agree 6 weeks are far too short for any meaningful long teem change, but meditating over the past two years, 45 min daily, had a noticeable impact in my case and there are similar reports all over the internet.

port11 · 2026-01-10T08:12:32 1768032752

That’s much longer than my practice, so it makes me wonder if the less impressive results of meditation are caused by people like me that do 5–15 minutes.

I’ve deep trauma that psychotherapy helped, but I can’t say meditation does anything for me besides calming me down for the next hour or two.

etyhhgfff · 2026-01-11T19:45:43 1768160743

Maybe you try a three day meditation retreat and see if you can benefit from going much deeper.

I'm nowhere qualified to give an objective answer, but for me, doing a 15 min meditation feels more comparable to a power nap. I feel refreshed, but I won't gain insights about myself.

srid · 2026-01-06T15:46:51 1767714411

I appreciate your response, it didn't come across as negative at all.

Regarding peak experiences, not all are same: https://srid.ca/peak

I'll checkout MCT. My own style of awareness is https://srid.ca/aware-cum-attentive

jtrn · 2026-01-03T23:49:50 1767484190

No point in arguing with people who bring a snowball into Congress to disprove global warming.

jtrn · 2026-01-03T23:47:55 1767484075

Could view it as push/pull dynamics: pushed away by toxicity, pulled to good answers from AI.

jtrn · 2025-12-27T14:34:58 1766846098

I was thinking of testing it, but I have an irrational hatred for Conda.

optionalsquid · 2025-12-27T14:49:57 1766846997

You could use pixi instead, as a much nicer/saner alternative to conda: https://pixi.sh

Though in this particular case, you don't even need conda. You just need python 3.13 and a virtual environment. If you have uv installed, then it's even easier:

    git clone https://github.com/apple/ml-sharp.git
    cd ml-sharp
    uv sync
    uv run sharp

jtrn · 2025-12-27T22:33:44 1766874824

The hate is so irrational I can’t stop feeling that any project that even uses Conda HAS to be terrible. Like a chef that creates a recipe with shit as an ingredient. I could exchange the shit for sugar, but why bother, the chef is obviously insane. I’m really sorry if anyone that worked in this ever reads this. But Conda is just triggering me.

jtreminio · 2025-12-27T14:46:48 1766846808

You can simply use a `uv` env instead?

moron4hire · 2025-12-27T14:36:26 1766846186

You aren't being irrational.

nativeit · 2025-12-27T17:06:45 1766855205

Perhaps they lived outside of the kingdom, with an evil Stepmother who moved very slow, struggled with complex dependency collisions, and took up a bunch of unnecessary space? Such an experience could leave one very traumatized towards Conda, even though their real problems are the unresolved issues with their stepmother…

quleap · 2025-12-28T05:58:03 1766901483

I hate pip, a million times worse than conda

jtrn · 2025-12-25T16:28:28 1766680108

This is totally improper reporting of the study.

When enough words, framing, and unstated important premises are omitted, it crosses over from the realm of incomplete or misleading into plain outright lying in my worldview.

They claim "Reverses advanced AD in mice." What they did is reverse symptoms in genetic models.

They claim to "Restore NAD+ homeostasis," ignoring that NAD might not even be causally related to Alzheimer’s, just a side effect. It’s like saying we cured a house fire because we efficiently removed the ashes after the fire. It’s the Tau thing all over again.

The claim: "Conservative molecular signatures" when in reality, 5xFAD mice are poor predictors of human clinical success to such a degree that it’s statistically more common for mice studies to NOT transfer to human biology than to do so.

They also make unsupported claims like "Safer than NAD+ precursors (supplements)," when this is a pre-clinical assumption. No human toxicity trials are mentioned in this context, and there are always MASSIVE differences when switching to real human studies. It might be correct, but there’s no basis to say that based on this study.

Also, the senior author owns the company. The paper has the hallmarks of a "pitch deck" for the drug.

In short, it seems to me that the claim of "Full Neurological Recovery" is misleading to patients. It fails to prove that fixing NAD+ in humans will stop the disease, only that it works in mice engineered to have the disease, and only by assuming that their specific measure is a 1-to-1 with the clinical presentation of the disease. The results are likely the "best case scenario" presented to support the commercialization of P7C3-A20.

Here is the COMMON SENSE question peer-reviews should have asked. Is low NAD+ the fire, or just the ashes?

Why should we believe this works in humans when the last 500 'cures' in 5xFAD mice failed?

Are you regrowing a brain, or just cleaning up a dirty one?

How does one molecule fix five unconnected problems simultaneously? The Context: The drug fixed inflammation, blood-brain barrier, amyloid/tau (protein folding), and memory (neuronal signaling). Drugs rarely hit four distinct biological systems with 100% success....

Where is the toxicology report that proves 'safer than supplements'?

qnleigh · 2025-12-26T08:51:04 1766739064

You're asking good questions, but it's unreasonable to expect one paper to answer all of them. Probably the article should have emphasized more strongly that mouse models are imperfect, but they do show efficacy in two different mouse models, which counts for something.

> It fails to prove that fixing NAD+ in humans will stop the disease, only that it works in mice engineered to have the disease

This in particular is just not possible without clinical trials in humans. But you can't have a clinical trial without evidence of efficiency, so you need to start with mouse models, even if they are imperfect. Sadly we don't know if any of the existing mouse models are any good, but it's the best we've got.

jtrn · 2025-12-26T09:30:47 1766741447

They put themselves in the position of having to answer these questions given their claims. I’m not suggesting they should figure out everything about everything, but given the over the top claims made, these are the least they should be able to answer.

JumpCrisscross · 2025-12-25T19:31:11 1766691071

> the senior author owns the company

Unfortunately, the moment I saw reference to NAD+ I started looking for this. Thank you.

exmadscientist · 2025-12-25T18:33:19 1766687599

I can't upvote your comment enough.

Please... people, do not get your hopes up over this one PR pump piece released on Christmas Day. This is not a legit study. (It might, perhaps, somehow, still be correct, perhaps... even broken clocks are right twice a day... but it's still not a legit study.)

nandomrumber · 2025-12-25T19:38:15 1766691495

Stopped clocks.

Broken clocks can be broken in ways that causes them to be up to and including never correct.

;)

storus · 2025-12-25T23:28:55 1766705335

Well, NAD+ kinda has the potential to fix multiple things at once if those things are caused by energetic imbalance/deficit. Re-balancing NAD+ could just fix multiple failing systems that were running low on energy and not doing their job properly as a consequence.

jtrn · 2025-12-25T15:52:32 1766677952

Im a bit out of the loop with this, but hope its not like that time with python 3.14, when it was claimed a geometric mean speedup of about 9-15% over the standard interpreter when built with Clang 19. It turned out the results were inflated due to a bug in LLVM 19 that prevented proper "tail duplication" optimization in the baseline interpreter's dispatch loop. Actual gains was aprox 4%.

Edit: Read through it and have come to the conclusion that the post is 100% OK and properly framed: He explicitly says his approach is to "sharing early and making a fool of myself," prioritizing transparency and rapid iteration over ironclad verification upfront.

One could make an argument that he should have cross-compiler checks, independent audits, or delayed announcements until results are bulletproof across all platforms. But given that he is 100% transparent with his thinking and how he works, it's all good in the hood.

kenjin4096 · 2025-12-25T16:28:05 1766680085

Thanks :), that was indeed my intention. I think the previous 3.14 mistake was actually a good one on hindsight, because if I didn't publicize our work early, I wouldn't have caught the attention of Nelson. Nelson also probably wouldn't have spent one month digging into the Clang 19 bug. This also meant the bug wouldn't have been caught in the betas, and might've been out with the actual release, which would have been way worse. So this was all a happy accident on hindsight that I'm grateful for as it means overall CPython still benefited!

Also this time, I'm pretty confident because there are two perf improvements here: the dispatch logic, and the inlining. MSVC can actually convert switch-case interpreters to threaded code automatically if some conditions are met [1]. However, it does not seem to do that for the current CPython interpreter. In this case, I suspect the CPython interpreter loop is just too complicated to meet those conditions. The key point also that we would be relying on MSVC again to do its magic, but this tail calling approach gives more control to the writers of the C code. The inlining is pretty much impossible to convince MSVC to do except with `__forceinline` or changing things to use macros [2]. However, we don't just mark every function as forceinline in CPython as it might negatively affect other compilers.

[1]: https://github.com/faster-cpython/ideas/issues/183 [2]: https://github.com/python/cpython/issues/121263

jtrn · 2025-12-25T17:16:51 1766683011

I wish all self-promoting scientists and sensationalizing journalists had a fraction of the honesty and dedication to actual truth and proper communication of truths as you do. You seem to feel that it’s more important to be transparent about these kinds of technical details than other people are about their claims in clinical medical research. Thank you so much for all you do and the way you communicate about it.

Also, I’m not that familiar with the whole process, but I just wanted to say that I think you were too hard on yourself during the last performance drama. So thank you again and remember not to hold yourself to an impossible standard no one else does.

halflings · 2025-12-25T18:39:02 1766687942

+1, reading through the post, the PR updating the documentation... thanks for being transparent, but also don't be so hard on yourself!

That was a very niche error, that you promptly corrected, no need to be so apologetic about it! And thanks for all the hard work making Python faster!

kenjin4096 · 2025-12-25T17:31:23 1766683883

Thank you very much for the kind words, that means a lot to me!

haberman · 2025-12-25T19:03:13 1766689393

I’ll repeat what I said at that time: one of the benefits of the new design is that it’s less vulnerable to the whims of the optimizer: https://news.ycombinator.com/item?id=43322451

If getting the optimal code is relying on getting a pile of heuristics to go in your favor, you’re more vulnerable to the possibility that someday the heuristics will go the other way. Tail duplication is what we want in case, but it’s possible that a future version of the compiler could decide that it’s not desired because of the increased code size.

With the new design, the Python interpreter can express the desired shape of the machine code more directly, leaving it less vulnerable to the whims of the optimizer.

kenjin4096 · 2025-12-25T19:19:26 1766690366

Yeah, I believe that statement and it seems to hold true for MSVC as well. Thanks for your work inspiring all of this btw!

jtrn · 2025-12-24T23:17:10 1766618230

And more people turning up at the clinic doesn’t mean there are more sick people, but it’s an indication.

Funny that you ignored “indicates,” which is a perfectly valid inference here, and also it’s true.

If you want to be pedant, don’t self-own.

jtrn · 2025-12-24T22:23:48 1766615028

This is only relevant for those deeply involved in fundamental or early-stage battery research.

An energy density of 1270 Wh/L is indeed roughly double what is currently found in top-tier electric vehicles. However, as with many battery research avenues, it is not viable on a practical level unless a major breakthrough is discovered in addition.

Here is a list of all the issues that must be resolved before such battery technology is viable for commercial use.

It only lasts about 100 charge cycles before degrading to 80% capacity, which is not sufficient for commercial use. LiFePO4 reaches this after a minimum of 3000 cycles.

It uses silver. In addition to this likely being a deal-breaker for mass production, the paper probably downplays the mass loading of silver required to maintain that 99.6% efficiency.

Anode-free batteries have zero excess lithium. Every time you charge/discharge, you lose a tiny fraction of lithium to side reactions. The paper claims a Coulombic Efficiency of 99.6%. The fact that they hit ~82% suggests the degradation is severe and inevitable without a massive reservoir of extra lithium, which defeats the "energy density" gain.

Density suppression for 100 cycles is not proof of safety. Dendrites often grow slowly and trigger short circuits later in life (cycle 200+).

There is also the known problem with pouch cells and significant volume change ("breathing"). The paper quotes volumetric density including packaging, but does it account for the swelling that happens after 50 cycles? Often, these cells puff up like balloons, rendering them unusable in a tight battery pack.

They tested at 0.5C (2-hour charge). Fast charging (15-20 mins) typically destroys lithium metal anodes instantly by causing rapid dendrite growth. This technology is likely limited to slow-charging applications.

Finally, there is no mention of temperature effects on performance.

I don’t mean to be negative, and research like this is extremely important. But this research paper is not properly framed. It’s like an archaeologist finding a buried house and extrapolating that this could mean we found an entire city! Why can’t we just say that the archaeologist found an interesting house?

jtrn · 2025-12-24T10:08:14 1766570894

Shout out to Stirling PDF that can be self hosted and has a relatively robust and easy to use redaction tool. All for free.... For now....

jtrn · 2025-12-22T20:44:16 1766436256

My quickie: MoE model heavily optimized for coding agents, complex reasoning, and tool use. 358B/32B active. vLLM/SGLang only supported on the main branch of these engines, not the stable releases. Supports tool calling in OpenAI-style format. Multilingual English/Chinese primary. Context window: 200k. Claims Claude 3.5 Sonnet/GPT-5 level performance. 716GB in FP16, probably ca 220GB for Q4_K_M.

My most important takeaway is that, in theory, I could get a "relatively" cheap Mac Studio and run this locally, and get usable coding assistance without being dependent on any of the large LLM providers. Maybe utilizing Kimik2 in addition. I like that open-weight models are nipping at the feet of the proprietary models.

hasperdi · 2025-12-22T23:07:27 1766444847

I bought a second‑hand Mac Studio Ultra M1 with 128 GB of RAM, intending to run an LLM locally for coding. Unfortunately, it's just way too slow.

For instance, an 4‑bit quantized model of GLM 4.6 runs very slowly on my Mac. It's not only about tokens per second speed but also input processing, tokenization, and prompt loading; it takes so much time that it's testing my patience. People often mention about the TPS numbers, but they neglect to mention the input loading times.

jwitthuhn · 2025-12-23T02:16:15 1766456175

At 4 bits that model won't fit into 128GB so you're spilling over into swap which kills performance. I've gotten great results out of glm-4.5-air which is 4.5 distilled down to 110B params which can fit nicely at 8 bits or maybe 6 if you want a little more ram left over.

hasperdi · 2025-12-23T08:13:34 1766477614

Correction, my GLM-4.6 models are not Q4, I can only run lower ones eg:

- https://huggingface.co/unsloth/GLM-4.6-GGUF/blob/main/GLM-4.... - 84GB, Q1 - https://huggingface.co/unsloth/GLM-4.6-REAP-268B-A32B-GGUF/t... - 92GB, Q2

I ensure that there are enough RAM leftover ie limited context window setting, so no swapping.

As for GLM-4.5-Air, I run that daily, switching between noctrex/GLM-4.5-Air-REAP-82B-A12B-MXFP4_MOE-GGUF and kldzj/gpt-oss-120b-heretic

andai · 2025-12-23T11:27:50 1766489270

Are you getting any agentic out of gpt-oss-120b?

I can't tell if it's some bug regarding message formats or if it's just genuinely giving up, but it failed to complete most tasks I gave it.

nekitamo · 2025-12-23T11:58:39 1766491119

GPT-oss-120B was also completely failing for me, until someone on reddit pointed out that you need to pass back in the reasoning tokens when generating a response. One way to do this is described here:

https://openrouter.ai/docs/guides/best-practices/reasoning-t...

Once I did that it started functioning extremely well, and it's the main model I use for my homemade agents.

Many LLM libraries/services/frontends don't pass these reasoning tokens back to the model correctly, which is why people complain about this model so much. It also highlights the importance of rolling these things yourself and understanding what's going on under the hood, because there's so many broken implementations floating around.

hasperdi · 2025-12-23T14:03:49 1766498629

IIRC I did and failed but I didn't investigate further.

mechagodzilla · 2025-12-22T23:54:00 1766447640

I've been running the 'frontier' open-weight LLMs (mainly deepseek r1/v3) at home, and I find that they're best for asynchronous interactions. Give it a prompt and come back in 30-45 minutes to read the response. I've been running on a dual-socket 36-core Xeon with 768GB of RAM and it typically gets 1-2 tokens/sec. Great for research questions or coding prompts, not great for text auto-complete while programming.

christina97 · 2025-12-24T13:29:14 1766582954

Let's say 1.5tok/sec, and that your rig pulls 500 W. That's 10.8 tok/Wh, and assuming you pay, say 15c/kWh means you're paying in the vicinity of $13.8/mtok of output. Looking at R1 output costs on OpenRouter, it's costing about 5-7x as much as what you can pay for third party inference (which also produce tokens ~30x faster).

tyre · 2025-12-23T00:12:45 1766448765

Given the cost of the system, how long would it take to be less expensive than, for example, a $200/mo Claude Max subscription with Opus running?

mechagodzilla · 2025-12-23T00:23:13 1766449393

It's not really an apples-to-apples comparison - I enjoy playing around with LLMs, running different models, etc, and I place a relatively high premium on privacy. The computer itself was $2k about two years ago (and my employer reimbursed me for it), and 99% of my usage is for research questions which have relatively high output per input token. Using one for a coding assistant seems like it can run through a very high number of tokens with relatively few of them actually being used for anything. If I wanted a real-time coding assistant, I would probably be using something that fit in the 24GB of VRAM and would have very different cost/performance tradeoffs.

mark_l_watson · 2025-12-23T12:11:11 1766491871

For what it is worth, I do the same thing you do with local models: I have a few scripts that build prompts from my directions and the contents of one or more local source files. I start a local run and get some exercise, then return later for the results.

I own my computer, it is energy efficient Apple Silicon, and it is fun and feels good to do practical work in a local environment and be able to switch to commercial APIs for more capable models and much faster inference when I am in a hurry or need better models.

Off topic, but: I cringe when I see social media posts of people running many simultaneous agentic coding systems and spending a fortune in money and environmental energy costs. Maybe I just have ancient memories from using assembler language 50 years ago to get maximum value from hardware but I still believe in getting maximum utilization from hardware and wanting to be at least the ‘majority partner’ in AI agentic enhanced coding sessions: save tokens by thinking more on my own and being more precise in what I ask for.

Workaccount2 · 2025-12-23T02:58:07 1766458687

Never, local models are for hobby and (extreme) privacy concerns.

A less paranoid and much more economically efficient approach would be to just lease a server and run the models on that.

g947o · 2025-12-23T05:23:40 1766467420

This.

I spent quite some time on r/LocalLLaMA and yet need to see a convincing "success story" of productively using local models to replace GPT/Claude etc.

hasperdi · 2025-12-23T08:35:30 1766478930

I have several my own little success stories:

- For polishing Whisper speech to text output, so I can dictate things to my computer and get coherent sentences, or for shaping the dictation to specific format eg. "generate ffmpeg to convert mp4 video to flac with fade in and out, input file is myvideo.mp4 output is myaudio flac with pascal case" -> Whisper -> "generate ff mpeg to convert mp4 video to flak with fade in and out input file is my video mp4 output is my audio flak with pascal case" -> Local LLM -> "ffmpeg ..."

- Doing classification / selection type of work eg. classifying business leads based on the profile

Basically the win for local llm is that the running cost (in my case, second hand M1 Ultra) is so low, I can run large quantity of calls that don't need frontier models.

g947o · 2025-12-23T12:14:37 1766492077

My comment was not very clear. I specifically meant Claude Code/Codex like workflows where the agent generates/run code interactively with user feedback. My impression is that consumer grade hardware is still too slow for these things to work.

hasperdi · 2025-12-23T14:02:03 1766498523

You are right, consumer grade hardware is mostly too slow... although it's a relative thing right. For instance you can get Mac Studio Mx Ultra with 512GB RAM, run GLM-4.5-Air and have a bit of patience. It could work

FuckButtons · 2025-12-23T20:44:27 1766522667

I was able to run a batch job that lasted ~2 weeks of inference time on my m4 max by running it over night against a large dataset I wanted to mine. It cost me pennies in electricity and writing a simple python script as a scheduler.

dimava · 2025-12-23T17:23:20 1766510600

Tokens will cost same on Mac and on API because electricity is not free

And you can only generate like $20 of tokens a month

Cloud tokens made on TPU will always be cheaper and waaay faster then anything you can make at home

reissbaker · 2025-12-23T17:38:49 1766511529

This generally isn't true. Cloud vendors have to make back the cost of electricity and the cost of the GPUs. If you already bought the Mac for other purposes, also using it for LLM generation means your marginal cost is just the electricity.

Also, vendors need to make a profit! So tack a little extra on as well.

However, you're right that it will be much slower. Even just an 8xH100 can do 100+ tps for GLM-4.7 at FP8; no Mac can get anywhere close to that decode speed. And for long prompts (which are compute constrained) the difference will be even more stark.

foobar10000 · 2025-12-24T04:22:17 1766550137

A question on the 100+ tps - is this for short prompts? For large contexts that generate a chunk of tokens at context sizes at 120k+, I was seeing 30-50 - and that's with 95% KV cache hit rate. Am wondering if I'm simply doing something wrong here...

reissbaker · 2025-12-29T11:09:09 1767006549

Depends on how well the speculator predicts your prompts, assuming you're using speculative decoding — weird prompts are slower, but e.g. TypeScript code diffs should be very fast. For SGLang, you also want to use a larger chunked prefill size and larger max batch sizes for CUDA graphs than the defaults IME.

oceanplexian · 2025-12-23T11:20:10 1766488810

It doesn't matter if you spend $200, $20,000, or $200,000 a month on an Anthropic Subscription.

None of them will keep your data truly private and offline.

robotswantdata · 2025-12-23T10:12:58 1766484778

Yes they conveniently forget about disclosing prompt processing time. There is an affordable answer to this, will be open sourcing the design and sw soon.

hedgehog · 2025-12-23T00:45:45 1766450745

Have you tried Qwen3 Next 80B? It may run a lot faster, though I don't know how well it does coding tasks.

hasperdi · 2025-12-23T08:36:31 1766478991

I did, it works well... although it is not good enough for agentic coding

smcleod · 2025-12-23T12:27:45 1766492865

Need the M5 (max/ultra next year) with it's MATMUL instruction set that massively speeds up the prompt processing.

Reubend · 2025-12-22T23:35:42 1766446542

Anything except a 3bit quant of GLM 4.6 will exceed those 128 GB of RAM you mentioned, so of course it's slow for you. If you want good speeds, you'll at least need to store the entire thing in memory.

embedding-shape · 2025-12-22T20:47:00 1766436420

> Supports tool calling in OpenAI-style format

So Harmony? Or something older? Since Z.ai also claim the thinking mode does tool calling and reasoning interwoven, would make sense it was straight up OpenAI's Harmony.

> in theory, I could get a "relatively" cheap Mac Studio and run this locally

In practice, it'll be incredible slow and you'll quickly regret spending that much money on it instead of just using paid APIs until proper hardware gets cheaper / models get smaller.

biddit · 2025-12-22T20:58:01 1766437081

> In practice, it'll be incredible slow and you'll quickly regret spending that much money on it instead of just using paid APIs until proper hardware gets cheaper / models get smaller.

Yes, as someone who spent several thousand $ on a multi-GPU setup, the only reason to run local codegen inference right now is privacy or deep integration with the model itself.

It’s decidedly more cost efficient to use frontier model APIs. Frontier models trained to work with their tightly-coupled harnesses are worlds ahead of quantized models with generic harnesses.

theLiminator · 2025-12-22T21:02:38 1766437358

Yeah, I think without a setup that costs 10k+ you can't even get remotely close in performance to something like claude code with opus 4.5.

cmrdporcupine · 2025-12-22T21:09:13 1766437753

10k wouldn't even get you 1/4 of the way there. You couldn't even run this or DeepSeek 3.2 etc for that.

Esp with RAM prices now spiking.

coder543 · 2025-12-22T21:17:33 1766438253

$10k gets you a Mac Studio with 512GB of RAM, which definitely can run GLM-4.7 with normal, production-grade levels of quantization (in contrast to the extreme quantization that some people talk about).

The point in this thread is that it would likely be too slow due to prompt processing. (M5 Ultra might fix this with the GPU's new neural accelerators.)

embedding-shape · 2025-12-22T22:53:51 1766444031

> $10k gets you a Mac Studio with 512GB of RAM, which definitely can run GLM-4.7 with normal, production-grade levels of quantization (in contrast to the extreme quantization that some people talk about).

Please do give that a try and report back the prefill and decode speed. Unfortunately, I think again that what I wrote earlier will apply:

> In practice, it'll be incredible slow and you'll quickly regret spending that much money on it

I'd rather place that 10K on a RTX Pro 6000 if I was choosing between them.

rynn · 2025-12-22T23:28:08 1766446088

> Please do give that a try and report back the prefill and decode speed.

M4 Max here w/ 128GB RAM. Can confirm this is the bottleneck.

https://pastebin.com/2wJvWDEH

I weighed about a DGX Spark but thought the M4 would be competitive with equal RAM. Not so much.

cmrdporcupine · 2025-12-22T23:33:54 1766446434

I think the DGX Spark will likely underperform the M4 from what I've read.

However it will be better for training / fine tuning, etc. type workflows.

rynn · 2025-12-23T00:29:11 1766449751

> I think the DGX Spark will likely underperform the M4 from what I've read.

For the DGX benchmarks I found, the Spark was mostly beating the M4. It wasn't cut and dry.

coder543 · 2025-12-23T00:36:16 1766450176

The Spark has more compute, so it should be faster for prefill (prompt processing).

The M4 Max has double the memory bandwidth, so it should be faster for decode (token generation).

coder543 · 2025-12-22T23:10:59 1766445059

> I'd rather place that 10K on a RTX Pro 6000 if I was choosing between them.

One RTX Pro 6000 is not going to be able to run GLM-4.7, so it's not really a choice if that is the goal.

embedding-shape · 2025-12-23T09:05:38 1766480738

No, but the models you will be able to run, will run fast and many of them are Good Enough(tm) for quite a lot of tasks already. I mostly use GPT-OSS-120B and glm-4.5-air currently, both easily fit and run incredibly fast, and the runners haven't even yet been fully optimized for Blackwell so time will tell how fast it can go.

bigyabai · 2025-12-22T23:49:07 1766447347

You definitely could, the RTX Pro 6000 has 96 (!!!) gigs of memory. You could load 2 experts at once at an MXFP4 quant, or one expert at FP8.

coder543 · 2025-12-22T23:55:40 1766447740

No… that’s not how this works. 96GB sounds impressive on paper, but this model is far, far larger than that.

If you are running a REAP model (eliminating experts), then you are not running GLM-4.7 at that point — you’re running some other model which has poorly defined characteristics. If you are running GLM-4.7, you have to have all of the experts accessible. You don’t get to pick and choose.

If you have enough system RAM, you can offload some layers (not experts) to the GPU and keep the rest in system RAM, but the performance is asymptotically close to CPU-only. If you offload more than a handful of layers, then the GPU is mostly sitting around waiting for work. At which point, are you really running it “on” the RTX Pro 6000?

If you want to use RTX Pro 6000s to run GLM-4.7, then you really need 3 or 4 of them, which is a lot more than $10k.

And I don’t consider running a 1-bit superquant to be a valid thing here either. Much better off running a smaller model at that point. Quantization is often better than a smaller model, but only up to a point which that is beyond.

bigyabai · 2025-12-23T00:34:24 1766450064

You don't need a REAP-processed model to offload on a per-expert basis. All MoE models are inherently sparse, so you're only operating on a subset of activated layers when the prompt is being processed. It's more of a PCI bottleneck than a CPU one.

> And I don’t consider running a 1-bit superquant to be a valid thing here either.

I don't either. MXFP4 is scalar.

coder543 · 2025-12-23T00:43:34 1766450614

Yes, you can offload random experts to the GPU, but it will still be activating experts that are on the CPU, completely tanking performance. It won't suddenly make things fast. One of these GPUs is not enough for this model.

You're better off prioritizing the offload of the KV cache and attention layers to the GPU than trying to offload a specific expert or two, but the performance loss I was talking about earlier still means you're not offloading enough for a 96GB GPU to make things how they need to be. You need multiple, or you need a Mac Studio.

If someone buys one of these $8000 GPUs to run GLM-4.7, they're going to be immensely disappointed. This is my point.

embedding-shape · 2025-12-23T09:06:58 1766480818

> If someone buys one of these $8000 GPUs to run GLM-4.7, they're going to be immensely disappointed. This is my point.

Absolutely, same if they get a $10K Mac/Apple computer, immense disappointment ahead.

Best is of course to start looking at models that fit within 96GB, but that'd make too much sense.

virgildotcodes · 2025-12-23T11:07:47 1766488067

$10k is > 4 years of a $200/mo sub to models which are currently far better, continue to get upgraded frequently, and have improved tremendously in the last year alone.

This almost feels like a retro computing kind of hobby than anything aimed at genuine productivity.

embedding-shape · 2025-12-23T11:33:48 1766489628

I don't think the calculation is that simple. With your own hardware, there literally is no limits of runtime, or what models you use, or what tooling you use, or availability, all of those things are up to you.

Maybe I'm old school, but I prefer those benefits over some cost/benefit analysis across 4 years which by the time we're 20% through it, everything has changed.

But I also use this hardware for training my own models, not just inference and not just LLMs, I'd agree with you if we were talking about just LLM inference.

naasking · 2025-12-23T14:08:17 1766498897

They are better in some ways, but they're also neutered.

benjiro · 2025-12-22T22:06:35 1766441195

> $10k gets you a Mac Studio with 512GB of RAM

Because Apple has not adjusted their pricing yet for the new ram pricing reality. The moment they do, its not going to be a $10k system anymore but in the $15k+...

The amount of wafers going to AI is insane and will influence not just memory prices. Do not forget, the only reason why Apple is currently immunity to this, is because they tend to make long term contracts but the moment those expire ... then will push the costs down consumers.

tonyhart7 · 2025-12-22T22:20:52 1766442052

generous of you to predict apple only make it 50% expensive

reissbaker · 2025-12-22T20:53:44 1766436824

No, it's not Harmony; Z.ai has their own format, which they modified slightly for this release (by removing the required newlines from their previous format). You can see their tool call parsing code here: https://github.com/sgl-project/sglang/blob/34013d9d5a591e3c0...

embedding-shape · 2025-12-23T09:07:46 1766480866

Man, really? Why, just why? If it's similar, why not just the same? It's like they're purposefully adding more work for the ecosystem to support their special model instead of just trying to add more value to the ecosystem.

reissbaker · 2025-12-23T17:21:50 1766510510

The parser is a small part of running an LLM, and Zai's format is superior to Harmony: it avoids having the model escape JSON in most cases by using XML, so e.g. long code edits are more in-domain compared to pretraining data (where code is typically not nested in JSON and isn't JSON-escaped). FWIW almost everyone has their own format.

Also, Harmony is a mess. The common API specs adopted by the open-source community don't have developer roles, so including one is just bloat for the Responses API no one outside of OpenAI adopted. And why are there two types of hidden CoT reasoning? Harmony tool definition syntax invents a novel programming language that the model has never seen in training, so you need even more post-training to get it to work (Zai just uses JSON Schema). Etc etc. It's just bad.

Re: removing newlines from their old format, it's slightly annoying, but it does give a slight speed boost, since it removes one token per call and one token per argument. Not a huge difference, but not nothing, especially with parallel tool calls.

embedding-shape · 2025-12-23T18:50:37 1766515837

Sometimes worse is better, I don't really care what the specific format is, just that providers/model releasers would use more of the same, because compatibility sucks when everyone has their very own format. Conveniently for them, it gets harder to compare models when everyone has different formats too.

rz2k · 2025-12-22T21:33:29 1766439209

In practice the 4bit MLX version runs at 20t/s for general chat. Do you consider that too slow for practical use?

What example tasks would you try?

embedding-shape · 2025-12-23T09:10:06 1766481006

Whenever reasoning/thinking is involved, 20t/s is way too slow for most non-async tasks, yeah.

Translation, classification, whatever. If the response is 300 tokens for the reasoning and 50 tokens for the final reply, you're sitting and waiting 17,5 seconds for processing one item. In practice, you're also forgetting about prefill, prompt processing, tokenization and such. Please do share all relevant numbers :)

__natty__ · 2025-12-22T21:08:01 1766437681

I can imagine someone from the past reading this comment and having a moment of doubt

reissbaker · 2025-12-22T20:51:08 1766436668

s/Sonnet 3.5/Sonnet 4.5

The model output also IMO look significantly more beautiful than GLM-4.6; no doubt in part helped by ample distillation data from the closed-source models. Still, not complaining, I'd much prefer a cheap and open-source model vs. a more-expensive closed-source one.

Tepix · 2025-12-23T00:13:55 1766448835

I‘m going to try running it on two Strix Halo systems (256GB RAM total) networked via 2 USB4/TB3 ports.

cmrdporcupine · 2025-12-23T02:19:32 1766456372

Curious to see how this works out for you. Let us know.

pixelpoet · 2025-12-23T03:01:22 1766458882

Also curious with two Strix Halo machines at the ready for exactly this kind of usage

Tepix · 2025-12-23T20:23:49 1766521429

Don't wait for me. Donato Capitella has done this and created videos on his youtube channel at https://www.youtube.com/@donatocapitella

cmrdporcupine · 2025-12-23T21:20:03 1766524803

That's GLM 4.6 tho, not 4.7?

Still, informative. And stupidly I'd seen this video before. It sounds like the TLDR is: not quite.

Tepix · 2025-12-26T11:01:21 1766746881

It will probably be very similar in terms of speed.

mft_ · 2025-12-22T22:37:24 1766443044

I’m never clear, for these models with only a proportion active (32B here) to what extentt this reduces the RAM a system needs, if at all?

l9o · 2025-12-22T22:41:55 1766443315

RAM requirements stay the same. You need all 358B parameters loaded in memory, as which experts activate depends on each token dynamically. The benefit is compute: only ~32B params participate per forward pass, so you get much faster tok/s than a dense 358B would give you.

atq2119 · 2025-12-23T11:14:44 1766488484

The benefit is also RAM bandwidth. That probably adds to the confusion, but it matters a lot for decode. But yes, RAM capacity requirements stay the same.

deepsquirrelnet · 2025-12-22T22:42:40 1766443360

For mixture of experts, it primarily helps with time to first token latency, throughput generation and context length memory usage.

You still have to have enough RAM/VRAM to load the full parameters, but it scales much better for memory consumed from input context than a dense model of comparable size.

aurohacker · 2025-12-23T00:08:27 1766448507

Great answers here, in that, for MoE, there's compute saving but no memory savings even tho the network is super-sparse. Turns out, there is a paper on the topic of predicting in advance the experts to be used in the next few layers, "Accelerating Mixture-of-Experts language model inference via plug-and-play lookahead gate on a single GPU". As to its efficacy, I'd love to know...

noahbp · 2025-12-22T22:50:40 1766443840

It doesn't reduce the amount of RAM you need at all. It does reduce the amount of VRAM/HBM you need, however, since having all parameters/experts in one pass loaded on your GPU substantially increases token processing and generation speed, even if you have to load different experts for the next pass.

Technically you don't even need to have enough RAM to load the entire model, as some inference engines allow you to offload some layers to disk. Though even with top of the line SSDs, this won't be ideal unless you can accept very low single-digit token generation rates.

lumost · 2025-12-23T14:12:04 1766499124

This model is much stronger than 3.5 sonnet, 3.5 sonnet scored 49% on swe-bench verified vs. 72% here. This model is about 4 points ahead of sonnet4, but behind sonnet 4.5 by 4 points.

If I were to guess, we will see a convergence on measurable/perceptible coding ability sometime early next year without substantially updated benchmarks.

andai · 2025-12-23T11:25:49 1766489149

>heavily optimized for coding agents

I tested the previous one GLM-4.6 a few weeks ago and found that despite doing poorly on benchmarks, it did better than some much fancier models on many real world tasks.

Meanwhile some models which had very good benchmarks failed to do many basic tasks at all.

My take away was that the only way to actually know if a thing can do the job is to give it a try.

DeathArrow · 2025-12-23T09:53:44 1766483624

I think you will be much better with a couple of RTX 5090,4090 or 3090. I think Macs will be too slow for inference.

sa-code · 2025-12-23T02:50:58 1766458258

This is true assuming there will be updates consistently. One of the advantages of the proprietary models is that the are updated often EKG and the cutoff date moves into the future

This is important because libraries change, introduce new functionality, deprecate methods and rename things all the time, e.g. Polars.

whimsicalism · 2025-12-22T23:20:45 1766445645

commentators here are oddly obsessed with local serving imo, it's essentially never practical. it is okay to have to rent a GPU, but open weights are definitely good and important.

nutjob2 · 2025-12-23T00:05:41 1766448341

It's not odd, people don't want to be dependent and restricted by vendors, especially if they're running a business based on the tool.

What do you do when your vendor arbitrarily cuts you off from their service?

nl · 2025-12-23T01:00:04 1766451604

You switch to one of the many, many other vendors serving the same open model?

Zetaphor · 2025-12-23T03:30:15 1766460615

There can be quality differences across vendors for the same model due to things like quantization or configuration differences in their backend. By running locally you ensure you have consistency in addition to availability and privacy

whimsicalism · 2025-12-23T00:26:49 1766449609

i am not saying the desire to be uncoupled from token vendors is unreasonable, but you can rent cloud GPUs and run these models there. running on your own hardware is what seems a little fantastical at least for a reasonable TPS

pixelpoet · 2025-12-23T03:05:03 1766459103

I don't understand what is going on with people willing to give up their computing sovereignty. You should be able to own and run your own computation, permissionlessly as much as your electricity bill and reasonable usage goes. If you can't do it today, you should aim for it tomorrow.

Stop giving infinite power to these rent-seeking ghouls! Be grateful that open models / open source and semi-affordable personal computing still exists, and support it.

Pertinent example: imagine if two Strix Halo machines (2x128 GB) can run this model locally over fast ethernet. Wouldn't that be cool, compared to trying to get 256 GB of Nvidia-based VRAM in the cloud / on a subscription / whatever terms Nv wants?

RickyLahey · 2025-12-23T10:42:28 1766486548

i don't understand what is going on with people not training their own models

jtrn · 2025-12-23T11:30:21 1766489421

I think you and I have a different definition of "obsessed." Would you label anyone interested in repairing their own car as obsessed with DIY?

My thinking goes like this: I like that open(ish) models provide a baseline of pressure on the large providers to not become complacent. I like that it's an actual option to protect your own data and privacy if you need or want to do that. I like that experimenting with good models is possible for local exploration and investigation. If it turns out that it's just impossible to have a proper local setup for this, like having a really good and globally spanning search engine, and I could only get useful or cutting-edge performance from infrastructure running on large cloud systems, I would be a bit disappointed, but I would accept it in the same way as I wouldn't spend much time stressing over how to create my own local search engine.

Tepix · 2025-12-23T20:25:21 1766521521

I find it odd to give a company access to my source code. Why would I do that? It's not like they should be trusted more than necessary.