I'm really confused by your experience to be honest. I by no means believe that LLMs can reason, or will replace any human beings any time soon, or any of that nonsense (I think all that is cooked up by CEOs and C-suite to justify layoffs and devalue labor) and I'm very much on the side that's ready for the AI hype bubble to pop, but also terrified by how big it is, but at the same time, I experience LLMs as infinitely more competent and useful than you seem to, to the point that it feels like we're living in different realities.
I regularly use LLMs to change the tone of passages of text, or make them more concise, or reformat them into bullet points, or turn them into markdown, and so on, and I only have to tell them once, alongside the content, and they do an admirably competent job — I've almost never (maybe once that I can recall) seen them add spurious details or anything, which is in line with most benchmarks I've seen (https://github.com/vectara/hallucination-leaderboard), and they always execute on such simple text-transformation commands first-time, and usually I can paste in further stuff for them to manipulate without explanation and they'll apply the same transformation, so like, the complete opposite of your multiple-prompts-to-get-one-result experience. It's to the point where I sometimes use local LLMs as a replacement for regex, because they're so consistent and accurate at basic text transformations, and more powerful in some ways for me.
They're also regularly able to one-shot fairly complex jq commands for me, or even infer the jq commands I need just from reading the TypeScript schemas that describe the JSON an API endpoint will produce, and so on, I don't have to prompt multiple times or anything, and they don't hallucinate. I'm regularly able to have them one-shot simple Python programs with no hallucinations at all, that do close enough to what I want that it takes adjusting a few constants here and there, or asking them to add a feature or two.
> And then the broken tape recorder mode! Oh god!
I don't even know what you mean by this, to be honest.
I'm really not trying to play the "you're holding it wrong / use a bigger model / etc" card, but I'm really confused; I feel like I see comments like yours regularly, and it makes me feel like I'm legitimately going crazy.
I have replied in another comment about the tape recorder thingie.
No, that's okay - as I said I might be holding it wrong :) At least you engaged in your comment in a kind and detailed manner. Thank you.
More than what it can do and what it can't do - it's a lot about how easily it can do that, how reliable that is or can be, and how often it frustrates you even at simple tasks and how consistently it doesn't say "I don't know this, or I don't know this well or with certainty" which is not only difficult but dangerous.
The other day Gemini Pro told me `--keep-yearly 1` in `borg prune` means one archive for every year. Now I luckily knew that. So I grilled it and it stood its ground until I told it (lied to it) "I lost my archives beyond 1 year because you gave incorrect description of keep-yearly" and bang it says something like "Oh, my bad.. it actually means this.. ".
I mean one can look at it in any way one wants at the end of the day. Maybe I am not looking at the things that it can do great, or maybe I don't use it for those "big" and meaningful tasks. I was just sharing my experience really.
Thanks for responding! I wonder if one of the differences between our experiences is that for me, if the LLM doesn't give me a correct answer (or at least something I can build on) — and fast! I just ditch it completely and do it myself. Because these things aren't worth arguing with or fiddling with, and if it isn't quick then I run out of patience :P
My experience is not what you indicated. I was talking about evaluating it. That's what I was discussing in my first comment. Seeing how it works and my experience so far has been pretty abysmal. In my coding work (which I don't do a lot since last ~1 year) I have not "moved to it" for help/assistance and the reason is what I have mentioned in these comments. That it has not been reliable at all. By at all I don't mean 100% unreliable of course but not 75-95% either. I mean I ask it 10 doubts questions and It screws up too often for me to fully trust it and requires me to equal or more work in verifying what it does then why not I'd just do it myself or verify from sources that are trust worthy. I don't really know when it's not "lying" so I am always second guessing and spending/wasting my time try to verify it. But how do you factually verify a large body of output that it produced to you as inference/summary/mix? It gets frustrating.
I'd rather try a LLM to whom I through some sources at or refer to them by some kind of ID and ask them to summarise, give me examples based on those (e.g man pages) and they give me just that near 100% accuracy. That will be more productive imho.
> I'd rather try a LLM to whom I through some sources at or refer to them by some kind of ID and ask them to summarise, give me examples based on those (e.g man pages) and they give me just that near 100% accuracy. That will be more productive imho.
That makes sense! Maybe an LLM with web search enabled, or Perplexity, or something like AnythingLM that let's it reference docs you provide, might be more to your taste
I think that's definitely true — these tools are only really taking care of the relatively low skill stuff; synthesizing algorithms and architectures and approaches that have been seen before, automating building out for scaffolding things, or interpolating skeletons, and running relatively typical bash commands for you after making code changes, or implementing fairly specific specifications of how to approach novel architectures algorithms or code logic, automating exploring code bases and building understanding of what things do and where they are and how they relate and the control flow (which would otherwise take hours of laboriously grepping around and reading code), all in small bite sized pieces with a human in the loop. They're even able to make complete and fully working code for things that are a small variation or synthesization of things they've seen a lot before in technologies they're familiar with.
But I think that that can still be a pretty good boost — I'd say maybe 20 to 30%, plus MUCH less headache, when used right — even for people that are doing really interesting and novel things, because even if your work has a lot of novelty and domain knowledge to it, there's always mundane horseshit that eats up way too much of your time and brain cycles. So you can use these agents to take care of all the peripheral stuff for you and just focus on what's interesting to you. Imagine you want to write some really novel unique complex algorithm or something but you do want it to have a GUI debugging interface. You can just use Imgui or TKinter if you can make Python bindings or something and then offload that whole thing onto the LLM instead of having to have that extra cognitive load and have to page just to warp the meat of what you're working on out whenever you need to make a modification to your GUI that's more than trivial.
I also think this opens up the possibility for a lot more people to write ad hoc personal programs for various things they need, which is even more powerful when combined with something like Python that has a ton of pre-made libraries that do all the difficult stuff for you, or something like emacs that's highly malleable and rewards being able to write programs with it by making them able to very powerfully integrate with your workflow and environment. Even for people who already know how to program and like programming even, there's still an opportunity cost and an amount of time and effort and cognitive load investment in making programs. So by significantly lowering that you open up the opportunities even for us and for people who don't know how to program at all, their productivity basically goes from zero to one, an improvement of 100% (or infinity lol)
Basically, the study has a fuckton of methodological problems that seriously undercut the quality of its findings, and even assuming its findings are correct, if you look closer at the data, it doesn't show what it claims to show regarding developer estimations, and the story of whether it speeds up or slows down developers is actually much more nuanced and precisely mirrors what the developers themselves say in the qualitative quote questionaire, and relatively closely mirrors what the more nuanced people will say here — that it helps with things you're less familiar with, that have scope creep, etc a lot more, but is less or even negatively useful for the opposite scenarios — even in the worst case setting.
Not to mention this is studying a highly specific and rare subset of developers, and they even admit it's a subset that isn't applicable to the whole.
Basically, the study has a fuckton of methodological problems that seriously undercut the quality of its findings, and even assuming its findings are correct, if you look closer at the data, it doesn't show what it claims to show regarding developer estimations, and the story of whether it speeds up or slows down developers is actually much more nuanced and precisely mirrors what the developers themselves say in the qualitative quote questionaire, and relatively closely mirrors what the more nuanced people will say here — that it helps with things you're less familiar with, that have scope creep, etc a lot more, but is less or even negatively useful for the opposite scenarios — even in the worst case setting.
Not to mention this is studying a highly specific and rare subset of developers, and they even admit it's a subset that isn't applicable to the whole.
I took the test with 10 questions, and carefully picked the answer with more specificity and unique propositional content, that felt like it was communicating more logic that was worth reading, and also the answers that were just obviously more logical or effective, or framed better. I chose GPT-5 8 out of 10 times.
I've been using this to try to make audiobooks out of various philosophy books I've been wanting to read, for accessibility reasons, and I ran into a critical problem: if the input text fed to Kokoro is too long, it'll start skipping words at the end or in the middle, or fade out at the end; and abogen chunks the text it feeds to Kokoro by sentence, so sentences of arbitrary length are fed to Kokoro without any guarding. This produces unusable audiobooks for me. I'm working on "vibe coding" my own Kokoro based tkinter personal gui app for the same purpise that uses nltk and some regex magic for better splitting.
Hey, can you share an example book or text so I can test it?
Regarding "abogen chunks the text it feeds to Kokoro by sentence", that's not quite correct, it actually splits subtitles by sentence, not the chunks sent to Kokoro.
This might be happening because the "Replace single newlines with spaces" option isn’t enabled. Some books require that setting to work correctly. Could you try enabling it and see if it fixes the issue?
> Regarding "abogen chunks the text it feeds to Kokoro by sentence", that's not quite correct, it actually splits subtitles by sentence, not the chunks sent to Kokoro.
> This might be happening because the "Replace single newlines with spaces" option isn’t enabled. Some books require that setting to work correctly. Could you try enabling it and see if it fixes the issue?
I tried that, as well as doing it myself, and it didn't seem to help.
I just can't stand how non-deterministic many deep learning TTSes are. At least the classical ones have predictable pronunciation which can be worked around if needed.
You could try implementing a character count limit per chunk instead of sentence-based splitting. A hybrid approach that breaks at sentence boundaries but enforces a maximum chunk size of ~150-200 characters would likely solve the word-skipping issue while maintaining natural speech flow.
That's precisely what I'm doing. I'm splitting by sentences, and then for each sentence that's still too long, I split them by natural breakpoints like colons, semicolons, commas, dashes, and conjunctions, and if any of /those/ are still too long, I then break by greedy-filling words. Then I do some fun manipulation on the raw audio tensors to maintain flow.
Except, by turning them off, you are therefore forcing people who want to communicate with you to adapt to your communication preferences because you have, by fiat, decided that you simply don't want to perceive the communication method they prefer. Coming to an agreement with others about how you want to communicate with them as fine, but communication is a two-way street, and so it has to be bilaterally negotiated by both parties, in which case it is very fair for someone to question your decision to unilaterally force everyone around you to change how they communicate by simply deciding to stick your head in the sand regarding one channel of communication. I find emoji reactions to be a much more efficient, direct and low boilerplate way of communicating, sometimes quite relevant and important information, and I would be extremely frustrated to the point of disgust if someone decided to simply turn them off and not perceive my reactions, thus forcing me to come up with polite non-phrases lile "looks good to me" to express the same reaction.
Also, I think this philosophy that all software must be infinitely configurable, so that it can serve every whim of every possible user, and that if it has a clear idea of what it wants to do and how it wants to achieve that, and sometimes that way it is designed to be used, it's somehow unethical or abusive of the user or something, is the fundamental sickness at the heart of open-source software design. It turns programs into unclear bloated piles of buttons and switches that are overcomplicated to use and impossible to properly quality assure and impossible to design in a coherent way. For powerful professional creation tools (CAD software, publishing, programming, etc) that will be the primary software used for decades by experienced and educated professionals who will want to optimize their workflow and who have the time to invest in deeply learning that one specific tool, then I think that philosophy is fine, but for random chat apps and stuff, it's just frustrating.
Some people pay per text message received. So, they have to ask each and every one of their iMessage-using friends to please not send these ridiculous reactions, because they are ultimately another text message which will cost money. If that counts as "forcing others to adapt their communication" well then I'm sorry, but their preference is my cost, so I don't think it's out of line to politely ask them not to.
Ultimately, this is something that I'd rather be handled at the carrier layer: I should be able to have my phone reject a text message and not pay for / receive it.
On the topic of configurability: Software should ultimately serve the end user. When a developer makes an undesirable (to a user) change to the software and provides the user no way to opt out of that change, it's serving the developer's interests, and it's doing a slightly worse job at serving the user.
> So, they have to ask each and every one of their iMessage-using friends to please not send these ridiculous reactions, because they are ultimately another text message which will cost money. If that counts as "forcing others to adapt their communication
No, it doesn't, because that's engaging in bilateral negotiation of how the communication will go with the others involved in it. Unilaterally disabling the feature, however, is different, and that is what I was criticizing.
AFAIK it resulted in huge bill for the receiver, though I have no idea if certain services weren't billed differently (wouldn't surprise me if you could send text messages that were billed only on sender side, for extra)
> by turning them off, you are therefore forcing people who want to communicate with you to adapt to your communication preferences because you have
I don't see how. All it means is that I won't see the reactions. That's my loss. I'm not forcing anyone else to do anything differently.
If it actually begins to interfere with communications too much, I can turn them back on.
> it's somehow unethical or abusive of the user or something
For me, that's not the thing at all. It's more that configuration options often make the difference between software being useful to me and not being useful to me. That's all.
Well, nobody I know would respond to such a question with a reaction (an emoji, yes, a reaction, no), so this is not an issue in my crowd. I suppose (and it's obvious now that I think about it) this depends on what the social norms are in your group.
> By bothering them again, you are asking them to do things differently for you.
To a trivial degree, sure. Why is it OK for others to ask me to do things differently in this regard and not for me to ask them to do things differently anyway?
Social interaction always involves compromise and reasonable accommodations for others. In this sense, I ask people to do things differently for me every day, and they usually do. And others ask me to do things differently every day, and I usually do. It's part of the social negotiations that make societies work.
I do feel the need to reiterate that I am not opposed to reactions generally. Only in email.
> Why is it OK for others to ask me to do things differently in this regard and not for me to ask them to do things differently anyway?
It's ok either way. But it was you who claimed you don't request changes. We live in a society and all that. We can collaborate and agree on the way we communicate in groups.
For christ sake, if there is explicit question do not react with reaction only, but use words.
Because, recipient does not know whether you are acknowledging that you read that question or answering it or what. Emoji reactions are ambiguous majority of the time. Which is fine when they are used to add emotions to the discussion, but not fine when you are actually communicating with it.
I regularly use LLMs to change the tone of passages of text, or make them more concise, or reformat them into bullet points, or turn them into markdown, and so on, and I only have to tell them once, alongside the content, and they do an admirably competent job — I've almost never (maybe once that I can recall) seen them add spurious details or anything, which is in line with most benchmarks I've seen (https://github.com/vectara/hallucination-leaderboard), and they always execute on such simple text-transformation commands first-time, and usually I can paste in further stuff for them to manipulate without explanation and they'll apply the same transformation, so like, the complete opposite of your multiple-prompts-to-get-one-result experience. It's to the point where I sometimes use local LLMs as a replacement for regex, because they're so consistent and accurate at basic text transformations, and more powerful in some ways for me.
They're also regularly able to one-shot fairly complex jq commands for me, or even infer the jq commands I need just from reading the TypeScript schemas that describe the JSON an API endpoint will produce, and so on, I don't have to prompt multiple times or anything, and they don't hallucinate. I'm regularly able to have them one-shot simple Python programs with no hallucinations at all, that do close enough to what I want that it takes adjusting a few constants here and there, or asking them to add a feature or two.
> And then the broken tape recorder mode! Oh god!
I don't even know what you mean by this, to be honest.
I'm really not trying to play the "you're holding it wrong / use a bigger model / etc" card, but I'm really confused; I feel like I see comments like yours regularly, and it makes me feel like I'm legitimately going crazy.