Just to clarify, the total dimension of birthdays is 365 (Jan 1 through Dec 31), but a 768 dimension continuous vector means there are 768 numbers, each of which can have values from -1 to 1 (at whatever precision floating point can represent). 1 float has about 2B numbers between -1 and 1 iirc, so 2B ^ 768 is a lot more than 365.
The TLDW : In the 90s clippy was a symbol of a friendly product feature that wanted to help you do one thing (but you could opt out). Clippy wasn't stealing your data, serving you ads, or anything malicious, it just wanted to help you do one specific thing (e.g. write a letter). The clippy movement is about sending a message to big tech that we don't appreciate ads in our start menu, having our data scraped and sold, being forced into dark patterns, having AI try to take jobs and/or destroy industry, blatant theft of work, etc. Basically, "make computers friendly again".
Yeah, Clippy was one of the early examples of infantilization and annoying anthromorphization in software, no better than the "cutesy" error messages or engagement popups that plague us today. It should be an example of what not to do but I guess nostalgia is a powerful drug.
HN equivalent: someone sees a link to an article and says "why would I read it, when all relevant information has already been incorporated into the comments?" Its the "efficient comments" hypothesis, all information relevant to a rational HN user about the article is already in the comments.
Honestly I feel my skills atrophying if I rely on AI too much, and many people I interact with are much weaker still (trying to vibe code without ever learning). To take your analogy further, having a single speed bike lets you go further faster and doesn't have a big impact on your "skills" (physical in this case), but deferring all transport to cars, and then to an electric scooter so you never have to walk definitely will cause your endurance / physical ability to walk to disappear. We are creatures that require constant use of and exercise of our capabilities or the system crumbles. Especially for high-skill activities (language, piano, video games, programming), proficiency can wane extremely quickly without constant practice.
The only problem with this utopian view of the world is the game theory. My company which doesn't hire juniors gets to benefit from others training them up for free. As a result, companies will opt not to train them (since this is optimal) -- which means they don't get trained. Classic prisoner's dilemma / free rider situation.
Its different because the AI model will then automate the use of that knowledge, which for most people in this forum is how they make their livelihood. If OpenAI were making robots to replace plumbers, I wouldn't be surprised when plumbers said "we should really stop giving free advice and training to these robots." Its in the worker's best interest to avoid getting undercut by an automated system that can only be built with the worker's free labor. And its in the interest of the company to take as much free labor output (e.g. knowledge) as possible to automate a process so they can profit.
I have received free advice that reduced future need from such actual plumbers (and mechanics and others for that matter)
> we should really stop giving free advice and training to these robots
People routinely freely give advice and teach students, friends, potential competitors, actual competitors, etc on this same forum. Robots? Many also advocate for immigration and outsourcing, presumably because they make the calculus that it is net beneficial in some scenarios. People on this forum contribute to an entire ecosystem of free software, on top of which two kids can and have built $100 billion companies that utilize all such technology freely and without cost. Let's ban it all?
Sure, I totally get if you want to make an individual choice for yourself to keep a secret sauce, not share your code, put stuff behind paywall. That is not the tone and the message here. There is some deep animosity advocating for everyone shutting down their pipes to AI as if some malevolent thing, similar to how Ted Kaczynski saw technology at large.
Which ones in particular? Is your belief all that are companies are inherently malevolent? If not why don't you start one that is not? What's stopping you?
> Is your belief all that are companies are inherently malevolent? If not why don't you start one that is not?
Because the one I start will be beaten by the one that is malevolent if they have a weapon that is as powerful as AI. All these arguments about "we shared stuff before so what's the problem?" are missing the point. The point is that this is about the concentration of power. The old sharing was about distribution of power.
In an economy where ideas have value, it seems logical we should have property protection, much like we do for physical goods. Its easy to argue "ideas should be freely shared", but if an idea takes 20 years and $100M dollars to develop, and there are no protections for ideas, then no one will take the time to develop them. Most modern technology we have is due to copyright/patents (drugs, electronics, entertainment, etc.), because without those protections, no one would have invested the time and energy to develop them in the first place.
I believe you are probably only looking at the current state of the world and seeing how it "stifles competition" or "hampers innovation". Those allegations are probably true to some extent, especially in specific cases, but its also missing the fact that without those protections, the tech likely wouldn't be created in the first place (and so you still wouldn't be able to freely use the idea, since the person who invented it wouldn't have).
this is a kinda strange example, since the discovery tends to be government funded research, and the safety shown by private money
the USSR went to space without those protections. its not like property protections are the only thing that has driven invention.
MIT licenses are also pretty popular as are creative commons licenses.
people also do things that don't make a lot of money, like teaching elementary school. it costs a ton of money to make and run all those schools, but without any intellectual property being created that can be sold or rented out.
i dont believe that nobody would want to build much of the things we have now, if there wasnt IP around them. Making and inventing things is fun
> i dont believe that nobody would want to build much of the things we have now, if there wasnt IP around them. Making and inventing things is fun
People write fanfiction without being paid, however, Avatar 2 cost hundreds of millions to produce [1]. The studio didn't spend this money for the heck of it, they spent this money with the hope of recouping their investment.
If no one can make money off of intellectual property, people will continue writing fanfiction. But why would a studio spend hundreds of millions making a blockbuster movie?
> The studio didn't spend this money for the heck of it, they spent this money with the hope of recouping their investment.
I wonder if the world would be a better place if we had fewer financial incentives to do things, in general?
> But why would a studio spend hundreds of millions making a blockbuster movie?
Under this hypothetical scenario, I believe there wouldn't be a "studio" in the first place.
There could be a group of people who want to express themselves, get famous or do something just for fun, without any direct financial gain. Sure, they wouldn't be able to pull off Avatar 2, but our expectations as consumers would also be different.
I note that production, or developing an idea, is not the same as having the idea. You can't deliberately have an idea by spending money, or have a better idea by spending more money. You can employ people to look at a problem and expect some of them to have reasonably good ideas about it - people who were selected because they already have good ideas in that general area. Then you say "these ideas cost this much money to come up with," as if you made ideas happen by decree. That's not what you did, those ideas were latent. What you did was to get them organized.
The opposite idea is intrinsic motivation, and that artists make art because they love it, and they were going to make the art (or come up with ideas) anyway, even if you didn't pay them. But artists also love having comfortable lifestyles, maybe families, maybe expensive studio equipment, maybe parties. And although you can't force them to care about your project you can certainly bribe them into seeing if they are interested. So you can bring out the ideas that they were supposedly going to have anyway - but might not have been able to have without funding - and you can steer the emphasis of their pre-existing interests around.
Which is to say that creativity and money interact in a weird way, where ideas don't have a cost, but creative focus does.
> but if an idea takes 20 years and $100M dollars to develop, and there are no protections for ideas, then no one will take the time to develop them
This sounds trivially true but I have some trouble reconciling it with reality. For example the Llama models probably cost more than this to develop but are made freely available on GitHub. So while it’s true that some things won’t be built, I think it’s also the case that many things would still be built.
I appreciate you giving the parent comment a fair chance.
As a society we’re having trouble defining abstract components of the self (consciousness, intelligence, identity) as is. What makes the legislative notion of an idea and its reification (what’s actually protected under copyright laws) secure from this same scrutiny? Then patent rights. And what do you think may happen if the viability of said economy comes into question afterwards?
It isn't about sales in the short term. Same with code. Your project might even be OSS (so not expecting profit). Its about a system that exploits that for one party's profit, while putting the other out of business. They reduced the writer's ability to ever be employed again, or to make royalties at the same level as before (not necessarily reduced the sales of existing titles, though that may occur as well).
Same with code, AI hoovering all the code doesn't mean people won't use libCurl, but it does mean jobs are disappearing and people may not be around to write the next libCurl.
In training, the model is trained to predict the exact sequence of words of a text. In other words, it is reproducing the text repeatedly for its own trainings. The by-product of this training is that it influences model weights to make the text more likely to be produced by the model -- that is its explicit goal. A perfect model would be able to reproduce the text perfectly (0 loss).
Real-world absurd example: A company hires a bunch of workers. They then give them access to millions of books and have the workers reading the books all day. The workers copy the books word by word, but after each word try to guess the next word that will appear. Eventually, they collectively become quite good at guessing the next word given a prompt text, even reproducing large swaths of text almost verbatim. The owner of company Y claims they owe nothing to the book owners, because it doesn't count as reading the book, and any reproduction is "coincidental" (even though this is the explicit task of the readers). They then use these workers to produce works to compete with the authors of the books, which they never paid for.
It seems many people feel this is "fair use" when it happens on a computer, but would call it "stealing" if I pirated all the books of JK Rowling to train myself to be a better mimicker of her style. If you feel this is still fair use, then you should agree all books should be free to everyone (as well as art, code, music, and any other training material).
> Second, the songs must share SUBSTANTIAL SIMILARITY, which means a listener can hear the songs side by side and tell the allegedly infringing song lifted, borrowed, or appropriated material from the original.
Music has had this happen numerous times in the US. The distinction isn’t an exact replica, it’s if it could be confused for the same style.
George Harrison lost a case for one of his songs. There are many others.
The damages arise from the very process of stealing material for training. The justification "yes but my training didn't cause me to directly copy the works" is faulty.
I won't rehash the many arguments as to why the output is also a violation, but my point was more the absurd view that stealing and using all the data in the world isn't a problem because the output is a lossy encoding (but the explicit training objective is to reproduce the training text / image).
Style in an ambiguous term here as it doesn’t directly map to what’s being considered. The case between “Blurred Lines” and “Got to Give It Up” is often considered one of style and the Court of Appeals for the Ninth Circuit upheld copyright infringement.
However, AI has been show to copy a lot more than what people consider style.
> In training, the model is trained to predict the exact sequence of words of a text. In other words, it is reproducing the text repeatedly for its own trainings.
That's called extreme overfitting. Proper training is supposed to give subtle nudges toward matching each source of text, and zillions of nudges slowly bring the whole thing into shape based on overall statistics and not any particular sources. (But that does require properly removing duplicate sources of very popular text which seems to be an unsolved problem.)
So your analogy is far enough off that I can't give it a good reply.
> It seems many people feel this is "fair use" when it happens on a computer, but would call it "stealing" if I pirated all the books of JK Rowling to train myself to be a better mimicker of her style.
I haven't seen anyone defend the piracy, and the piracy is what this settlement is about.
People are defending the training itself.
And I don't think anyone would seriously say the AI version is fair use but the human version isn't. You really think "many people" feel that way?
There isn’t a clear line for extreme overfitting here.
To generate working code the output must follow the API exactly. Nothing separates code and natural language as far as the underlying algorithm is concerned.
Companies slightly randomize output to minimize the likelihood of direct reproduction of source material, but that’s independent of what the neural network is doing.
You want different levels of fitting for different things, which is difficult. Tight fighting on grammar and APIs and idioms, loose fitting on creative text, and it's hard to classify it all up front. But still, if it can recite harry potter that's not on purpose, and it's never trained to predict a specific source losslessly.
And it's not really about randomizing output. The model gives you a list of likely words, often with no clear winner. You have to pick one somehow. It's not like it's taking some kind of "real" output and obfuscating it.
> often with no clear winner. You have to pick one somehow. It's not like it's taking some kind of "real" output and obfuscating it.
It’s very rare for multiple outputs to actually be equal so the only choice is to choose one at random. Instead its become accepted practice to make sub optimal choices for a few reasons, one of which really is to decrease the likelihood of reproducing existing text.