More

randomifcpfan · 2025-12-01T01:44:01 1764553441

Current frontier agents can one shot solve all 2024 AoC puzzles, just by pasting in the puzzle description and the input data.

From watching them work, they read the spec, write the code, run it on the examples, refine the code until it passes, and so on.

But we can’t tell whether the puzzle solutions are in the training data.

I’m looking forward to seeing how well current agents perform on 2025’s puzzles.

suddenlybananas · 2025-12-01T08:53:54 1764579234

They obviously have the puzzles in the training data, why are you acting like this is uncertain?

randomifcpfan · 2025-10-31T01:28:27 1761874107

https://en.wikipedia.org/wiki/Goobuntu

In 2018, Google replaced Goobuntu with gLinux, a Linux distribution based on Debian Testing

https://en.wikipedia.org/wiki/GLinux

randomifcpfan · 2025-08-04T04:19:49 1754281189

Here’s a study that found that for small problems Gemini is almost equally good at Python and Rust. Looking at the scores of all the languages tested, it seems that the popularity of the language is the most important factor:

https://jackpal.github.io/2025/03/29/Gemini_2.5_Pro_Advent_o...

whytevuhuni · 2025-08-04T06:21:38 1754288498

But isn't it the case that Python is vastly more popular than Rust?

If Gemini is equally good at them in spite of that, doesn't that mean it'd be better at Rust than at Python if it had equal training in both?

randomifcpfan · 2025-08-04T11:31:27 1754307087

The study points out, “Python and Rust are the two most popular languages used by Advent of Code participants. This may explain why Rust fares so well.”

whytevuhuni · 2025-08-04T15:35:58 1754321758

Ah, that makes a lot more sense. Thanks!

randomifcpfan · 2025-01-28T14:10:07 1738073407

In my application, code generation, the distilled DeepSeek models (7B to 70B) perform poorly. They imitate the reasoning of the r1 model, but their conclusions are not correct.

The real r1 model is great, better than o1, but the distilled models are not even as good as the base models that they were distilled from.

randomifcpfan · 2025-01-26T04:45:57 1737866757

The DeepSeek R1 paper explains how they trained their model in enough detail that people can replicate the process. Many people around the world are doing so, using various sizes of models and training data. Expect to see many posts like this over the next three months. The attempts that use small models will get done first. The larger models take much longer.

Small r1 style models are pretty limited, so this is interesting primarily from an “I reproduced the results” point of view, not a “here is a new model that’s useful” pov.

rahimnathwani · 2025-01-26T05:18:00 1737868680

From the Deepseek R1 paper:

  For distilled models, we apply only SFT and do not include an RL stage, even though incorporating RL could substantially boost model performance. Our primary goal here is to demonstrate the effectiveness of the distillation technique, leaving the exploration of the RL stage to the broader research community.

The impression I got from the paper, although I don't think it was explicitly stated, is that they think distillation will work better than training the smaller models using RL (as OP did).

nielsole · 2025-01-26T05:42:52 1737870172

> We demonstrate that the reasoning patterns of larger models can be distilled into smaller models, resulting in better performance compared to the reasoning patterns discovered through RL on small models

I found this statement from the paper to be at odds with what you cited, but I guess they mean SFT+RL would be better than either just SFT and RL

rahimnathwani · 2025-01-26T06:10:03 1737871803

I think they're saying that some reasoning patterns which large models can learn using only RL (i.e. without the patterns existing in the training data), can't be learned by smaller models in the same way. They have to be 'taught' through examples provided during SFT.

randomifcpfan · 2024-12-28T20:12:59 1735416779

College degrees from reputable colleges used to serve this purpose, but grade inflation has greatly weakened this signal.

programjames · 2024-12-28T20:26:10 1735417570

You also want colleges to signal to their applicants, not force them to also signal for their alumni. The two will naturally be correlated, but you can do better by specializing.

randomifcpfan · on Sept 29, 2024

“You should consider using this in your requirements” implies that this is not a hard rule, it’s just an ignorable suggestion. It would be interesting to audit gov.uk web pages over time to see whether this advice is being followed.

philipwhiuk · on Sept 29, 2024

Having

> You should consider

Is Gov UK's way of allowing people internally to point to it and say 'Well, did you consider it?'.

UK Digital don't have any direct power to force change - they have to use sensible advice and internal process to encourage better design.

etothepii · on Sept 29, 2024

Don't forget the rules of British English that make it very clear that the grammatical construction: "you should consider" means "you must in all circumstances save for the immediate alternate outcome being a genocide."

randomifcpfan · on Sept 29, 2024

Thanks for explaining! That’s quite different from the US English (and RFC English) meaning of “should”.

etothepii · on Sept 29, 2024

This translation guide is usually helpful.

https://polish2english.files.wordpress.com/2011/11/55551980-...

I'm afraid I don't know what RFC English is and neither does Google.

valley_guy_12 · on Sept 29, 2024

Huh, for me the very first non-ad result for googleing RFC English is https://en.wikipedia.org/wiki/Request_for_Comments, which is the correct citation.

etothepii · on Sept 30, 2024

For me it was to do with Rugby Football.

layer8 · on Sept 29, 2024

To be fair, here is the RFC meaning:

   SHOULD   This word, or the adjective "RECOMMENDED", mean that there
   may exist valid reasons in particular circumstances to ignore a
   particular item, but the full implications must be understood and
   carefully weighed before choosing a different course.

It means you can’t simply ignore it, and instead have to have compelling reasons to justify any deviation.

valley_guy_12 · on Sept 29, 2024

Unfortunately, in many organizations, "the library we use doesn't follow this recommendation" is a valid compelling reason. Which means that in practice "SHOULD" effectively means "WOULD BE NICE IF".

randomifcpfan · on Sept 8, 2024

I remember seeing the PERQ at trade shows. The best thing about the PERQ was its monitor, which was unusually sharp for that era. It used a yellow-white long persistence phosphor. A CMU grad student friend told me that the monitor designer was “a close personal friend of the electron”, implying that the analog circuitry of the PERQ monitor was especially high quality.

randomifcpfan · on Aug 7, 2024

They identify 6 bugs/mistakes, of which, not doing staged releases, was the final mistake.

They stop short of identifying the real root issues of running at kernel level, and of not auto-backing-out updates that cause crashes, perhaps because those causes are harder to fix.

randomifcpfan · on June 16, 2024

A tool for updating bazel build target dependencies. It inspects build files and source code, then adds/removes dependencies from build targets as needed. It requires using global include paths in C/C++ sources. It is not perfect, but it is pretty nice!

rsc · on June 16, 2024

If you're using Go with Bazel, gazelle is available outside Google: https://github.com/bazelbuild/bazel-gazelle

Enabling tools like these was exactly the point of the enforced formatting. It worked extremely well.

dijit · on June 17, 2024

I should add that it seems Gazelle is being expanded to other programming languages other than Go.

For example: https://github.com/Calsign/gazelle_rust