> Give each call different tools. Make sub-agents talk to each other, summarize each other, collate and aggregate. Build tree structures out of them. Feed them back through the LLM to summarize them as a form of on-the-fly compression, whatever you like.
You propose increasing the complexity of interactions of these tools, and giving them access to external tools that have real-world impact? As a security researcher, I'm not sure how you can suggest that with a straight face, unless your goal is to have more vulnerable systems.
Most people can't manage to build robust and secure software using SOTA hosted "agents". Building their own may be a fun learning experience, but relying on a Rube Goldberg assembly of disparate "agents" communicating with each other and external tools is a recipe for disaster. Any token could trigger a cascade of hallucinations, wild tangents, ignored prompts, poisoned contexts, and similar issues that have plagued this tech since the beginning. Except that now you've wired them up to external tools, so maybe the system chooses to wipe your home directory for whatever reason.
People nonchalantly trusting nondeterministic tech with increasingly more real-world tasks should concern everyone. Today it's executing `ping` and `rm`; tomorrow it's managing nuclear launch systems.
(Mine was intended as ironic, suggesting that a circle of development ideas would eventually complete. I interpreted the previous comments as satirically pointing at the fact that the notion of "UNIX-like tools" owes to the fact that there is actually such a thing as UNIX.)
Isn't the alternative far more likely? These tools were trained on the way people write in certain settings, which includes a lot of curated technical articles like this one, and we're seeing that echoed in their output.
There's no "LLM style". There's "human style mimicked by LLMs". If they default to a specific style, then that's on the human user who chooses to go with it, or, likely, doesn't care. They could just as well make it output text in the style of Shakespeare or a pirate, eschew emojis and bulleted lists, etc.
If you're finding yourself influenced by LLMs—don't be. Here's why:
There is a "default LLM style", which is why I call it that. Or technically, one per LLM, but they seem to have converged pretty hard since they're all convergently evolving in the same environment.
It's trivial to prompt it out of that style. Word about how to do it and that you should do it has gotten around in the academic world where the incentives to not be caught are high. So I don't call it "the LLM style". But if you don't prompt for anything in particular, yes, there is a very very strong "default LLM style".
XSLT as a feature is being removed from web browsers, which is pretty significant. Sure it can still be used in standalone tools and libraries, but having it in web browsers enabled a lot of functionality people have been relying on since the dawn of the web.
> hardwiring into the browser an implementation that's known to be insecure and is basically unmaintained is what's going away
So why not switch to a better maintained and more secure implementation? Firefox uses TransforMiix, which I haven't seen mentioned in any of Google's posts on the topic. I can't comment on whether it's an improvement, but it's certainly an option.
> The people doing the wailing/rending/gnashing about the removal of libxslt needed to step up to fix and maintain it.
Really? How about a trillion dollar corporation steps up to sponsor the lone maintainer who has been doing a thankless job for decades? Or directly takes over maintenance?
They certainly have enough resources to maintain a core web library and fix all the security issues if they wanted to. The fact they're deciding to remove the feature instead is a sign that they simply don't.
And I don't buy the excuse that XSLT is a niche feature. Their HTML bastardization AMP probably has even less users, and they're happily maintaining that abomination.
> It seems like something an extension ought to be capable of
I seriously doubt an extension implemented with the restricted MV3 API could do everything XSLT was used for.
> and if not, fix the extension API so it can.
Who? Try proposing a new extension API to a platform controlled by mega-corporations, and see how that goes.
Thanks for sharing! Your article illustrates well the benefits of this approach.
One drawback I see is that property-based tests inevitably need to be much more complex than example-based ones. This means that bugs are much more likely, they're more difficult to maintain, etc. You do mention that it's a lot of code, but I wonder if the complexity is worth it in the long run. I suppose that since testing these scenarios any other way would be even more tedious and error-prone, the answer is "yes". But it's something to keep in mind.
> One drawback I see is that property-based tests inevitably need to be much more complex than example-based ones.
I don’t think that’s true, I just think the complexity is more explicit (in code) rather than implicit (in the process of coming up with examples). Example-based testing usually involves defining conditions and properties to be tested, then involves constructing sets of examples to test them and which attempt to anticipated edge cases from the description of the requirements (black box) or from knowledge of how the code is implemented (white box).
Property based testing involves defining the conditiosn and properties, writing code that generates the conditions and for each property writing a bit of code that can refute it by passing if and only if it is true of the subject under test for a particular set of inputs.
With a library like Hypothesis which both has good generators for basic types and good abstractions for combining and adapting generators, the latter seems to be less complex overall, as well as moving the complexity into a form where it is explicit and easy to maintain/adapt, whereas adapting example-based tests to requirements changes involves either throwing out examples and starting over or revalidating and updating examples individually.
> Property based testing involves defining the conditiosn and properties, writing code that generates the conditions and for each property writing a bit of code that can refute it by passing if and only if it is true of the subject under test for a particular set of inputs.
You're downplaying the amount of code required to properly setup a property-based test. In the linked article, the author implemented a state machine to accurately model the SUT. While this is not the most complex of systems, it is far from trivial, and certainly not a "bit of code". In my many years of example-based unit/integration/E2E testing, I've never had to implement something like that. The author admits that the team was reluctant to adopt PBT partly because of the amount of code.
This isn't to say that example-based tests are simple. There can be a lot of setup, mocking, stubbing, and helper code to support the test, but this is usually a smell that something is not right. Whereas with PBT it seems inevitable in some situations.
But then again, I can see how such tests can be invaluable, very difficult and likely more complex to implement otherwise. So, as with many things, it's a tradeoff. I think PBT doesn't replace EBT, nor vice-versa, but they complement eachother.
You’re right it’s always a trade off. One unexpected but very welcomed side effect of having those stateful property tests is we could use them to design high fidelity stubs. I wrote a follow-up blog post about it https://blog.tiserbox.com/posts/2024-07-08-make-good-stubs-w...
My experience is that PBT tests are mostly hard in devising the generators, not in the testing itself.
Since it came up in another thread (yes, it's trivial), a function `add` is no easier or harder to test with examples than with PBT, here are some of the tests as both PBT-style and example-based style:
@given(st.integers())
def test_left_identity_pbt(a):
assert add(a, 0) == a
def test_left_identity():
assert add(10, 0) == 10
@given(st.integers(), st.integers())
def test_commutative(a, b):
assert add(a, b) == add(b, a)
@parametrize("a,b", examples)
def test_commutative():
assert add(a, b) == add(b, a)
They're the same test, but one is more comprehensive than the other. And you can use them together. Supposing you do find an error, you add it to your example-based tests to build out your regression test suite. This is how I try to get people into PBT in the first place, just take your existing example-based tests and build a generator. If they start failing, that means your examples weren't sufficiently comprehensive (not surprising). Because PBT systems like Hypothesis run so many tests, though, you may need to either restrict the number of generated examples for performance reason or breakup complex tests into a set of smaller, but faster running, tests to get the benefit.
Other things become much simpler, or at least simpler to test comprehensively, like stateful and end-to-end tests (assuming you have a way to programmatically control your system). Real-world, I used Hypothesis to drive an application by sending a series of commands/queries and seeing how it behaved. There are so many possible sequences that manually developing a useful set of end-to-end tests is non-trivial. However, with Hypothesis it just generated sequences of interactions for me and found errors in the system. After each command (which may or may not change the application state) it issued queries in the invariant checks and verified the results against the model. Like with example-based testing, these can be turned into hard-coded examples in your regression test suite.
For sure, the hardest part is to create meaningful generators for the problem at hand which can test interesting cases in a finite amount of time. That’s where the combinatory explosion takes place in my experience.
I wanted to highlight one unexpected but very welcomed side effect of having those stateful property tests is we could use them to design high fidelity stubs. I wrote a follow-up blog post about it https://blog.tiserbox.com/posts/2024-07-08-make-good-stubs-w...
> Since it came up in another thread (yes, it's trivial), a function `add` is no easier or harder to test with examples than with PBT
Come on, that example is practically useless for comparing both approaches.
Take a look at the article linked above. The amount of non-trivial code required to setup a PBT should raise an eyebrow, at the very least.
It's quite possible that the value of such a test outweighs the complexity overhead, and that implementing all the test variations with EBT would be infeasible, but choosing one strategy over the other should be a conscious decision made by the team.
So as much as you're painting PBT in a positive light, I don't see it that clearly. I think that PBT covers certain scenarios better than EBT, while EBT can be sufficient for a wide variety of tests, and be simpler overall.
But again, I haven't actually written PBTs myself. I'm just going by the docs and articles mentioned here.
> Come on, that example is practically useless for comparing both approaches.
Come on, I admitted it was trivial. It was a quick example that fit into a comment block. Did you expect a dissertation?
> that implementing all the test variations with EBT would be infeasible
That's kind of the point to my previous comment. PBTs will generate many more examples than you would create by hand. If you have EBTs already, you're one step away from PBTs (the generators, I never said this was trivial just to preempt another annoying "Come on"). And then you'll have more comprehensive testing than you would have had sticking to just your carefully handcrafted examples. This isn't the end of property-based testing, but it's a really good start and the easiest way to bring it into an existing project because you can mostly reuse the existing tests.
Extending this, once you get used to it, to stateful testing (which many PBT libraries support, including Hypothesis) you can generate a lot of very useful end-to-end tests that would be even harder to come up with by hand. And again, if you have any example-based end-to-end tests or integration tests, you need to come up with generators and you can start converting them into property-based tests.
> but choosing one strategy over the other should be a conscious decision made by the team.
Ok. What prompted this? I never said otherwise. It's also not an either/or situation, which you seem to want to make it. As I wrote in that previous comment, you can use both and use the property-based tests to bolster the example-based tests, and convert counterexamples into more example-based tests for your regression suite.
Have you tried mise[1]? The last thing you probably want is to add another abstraction on top of this mess, but I've had good experiences with it, and it manages Rust, Go, Python, etc. environments very well.
IME getting any modern toolchain setup on different distros can be problematic, especially if you mix in the often outdated distro packages. So using isolated environments with a tool specifically built for that works better.
I currently use `bluetoothctl` with a wrapper script and `expect` so that I can quickly fire off `bluetooth.sh <device name substring>`, and it does the tedium of ensuring that the connection is established regardless of my audio settings. I do still use `bluetoothctl` for manually scanning and pairing, but once a device is paired, I don't run it directly. So it would be great if I could solve both things with the same tool.
I would only really need an interactive TUI for scanning and pairing. Maybe not even for that. E.g. if I run `bt pair <some device name>` then the tool could scan available devices and try to pair with one that fuzzily matches the provided name. And `scan` could work in a similar way. E.g. `bt scan --duration 10s` could show found devices within a specific time.
I'm not a big fan of interactive UIs if the same can be accomplished non-interactively. This allows the tool to be scripted, and can be much quicker to do what you want.
Does such a tool exist for Bluetooth? I'm tempted to whip something up myself, though I really have enough side projects as it is...
> And then my next windmill that I'm looking at is variable-sized text in the terminal. So when I'm catting a markdown file, I want to see the headings big.
Is this something people actually want?
One of the reasons I enjoy using the terminal is because the text is of a fixed size and monospaced. Even colors and bold text can be distracting at times. I certainly don't want my terminal to render Markdown...
I imagine the feature could be disabled, but still. I'm all for improving terminal UIs, but let's not turn them into a web browser.
You propose increasing the complexity of interactions of these tools, and giving them access to external tools that have real-world impact? As a security researcher, I'm not sure how you can suggest that with a straight face, unless your goal is to have more vulnerable systems.
Most people can't manage to build robust and secure software using SOTA hosted "agents". Building their own may be a fun learning experience, but relying on a Rube Goldberg assembly of disparate "agents" communicating with each other and external tools is a recipe for disaster. Any token could trigger a cascade of hallucinations, wild tangents, ignored prompts, poisoned contexts, and similar issues that have plagued this tech since the beginning. Except that now you've wired them up to external tools, so maybe the system chooses to wipe your home directory for whatever reason.
People nonchalantly trusting nondeterministic tech with increasingly more real-world tasks should concern everyone. Today it's executing `ping` and `rm`; tomorrow it's managing nuclear launch systems.