That code is still LGPL, it doesn't matter what some release engineer writes in the release notes on Github. All original authors and copyright holders must have explicitly agreed to relicense under a different license, otherwise the code stays LGPL licensed.
Also the mentioned SCOTUS decision is concerned with authorship of generative AI products. That's very different of this case. Here we're talking about a tool that transformed source code and somehow magically got rid of copyright due to this transformation? Imagine the consequences to the US copyright industry if that were actually possible.
> The ownership void: If the code is truly a “new” work created by a machine, it might technically be in the public domain the moment it’s generated, rendering the MIT license moot.
How would that work? We still have no legal conclusion on whether AI model generated code, that is trained on all publicly available source (irrespective of type of license), is legal or not. IANAL but IMHO it is totally illegal as no permission was sought from authors of source code the models were trained on. So there is no way to just release the code created by a machine into public domain without knowing how the model was inspired to come up with the generated code in the first place. Pretty sure it would be considered in the scope of "reverse engineering" and that is not specific only to humans. You can extend it to machines as well.
EDIT: I would go so far as to say the most restrictive license that the model is trained on should be applied to all model generated code. And a licensing model with original authors (all Github users who contributed code in some form) should be setup to be reimbursed by AI companies. In other words, a % of profits must flow back to community as a whole every time code-related tokens are generated. Even if everyone receives pennies it doesn't matter. That is fair. Also should extend to artists whose art was used for training.
> I would go so far as to say the most restrictive license that the model is trained on should be applied to all model generated code.
That license is called "All Rights Reserved", in which case you wouldn't be able to legally use the output for anything.
There are research models out there which are trained on only permissively licensed data (i.e. no "All Rights Reserved" data), but they're, colloquially speaking, dumb as bricks when compared to state-of-art.
But I guess the funniest consequence of the "model outputs are a derivative work of their training data" would be that it'd essentially wipe out (or at very least force a revert to a pre-AI era commit) every open source project which may have included any AI-generated or AI-assisted code, which currently pretty much includes every major open source project out there. And it would also make it impossible to legally train any new models whose training data isn't strictly pre-AI, since otherwise you wouldn't know whether your training data is contaminated or not.
> There are research models out there which are trained on only permissively licensed data
Models whose authors tried to train only on permissively licensed data.
For example https://huggingface.co/bigcode/starcoder2-15b tried to be a permissively licensed dataset, but it filtered only on repository-level license, not file-level. So when searching for "under the terms of the GNU General Public License" on https://huggingface.co/spaces/bigcode/search-v2 back when it was working, you would find it was trained on many files with a GPL header.
I don't know how far it would get, but I imagine that a FAANG will be able to get the farthest here by virtue of having mountains of corporate data that they have complete ownership over.
I agree with your assessment. Which is why I was proposing a middle-ground where an agreement is setup between the model training company and the collective of developers/artists et all and come up with a license agreement where they are rewarded for their original work for perpetuity. A tiny % of the profits can be shared, which would be a form of UBI. This is fair not only because companies are using AI generated output but developers themselves are also paying and using AI generated output that is trained on other developer's input. I would feel good (in my conscience) that I am not "stealing" someone else's effort and they are being paid for it.
> The ownership void: If the code is truly a “new” work created by a machine, it might technically be in the public domain the moment it’s generated, rendering the MIT license moot.
Im struggling to see where this conclusion came from. To me it sounds like the AI-written work can not be coppywritten, and so its kind of like a copy pasting the original code. Copy pasting the original code doesnt make it public domain. Ai gen code cant be copywritten, or entered into the public domain, or used for purposes outside of the original code's license. Whats the paradox here?
> To me it sounds like the AI-written work can not be coppywritten
I think we didn't even began to consider all the implications of this, and while people ran with that one case where someone couldn't copyright a generated image, it's not that easy for code. I think there needs to be way more litigation before we can confidently say it's settled.
If "generated" code is not copywritable, where do draw the line on what generated means? Do macros count? Does code that generates other code count? Protobuf?
If it's the tool that generates the code, again where do we draw the line? Is it just using 3rd party tools? Would training your own count? Would a "random" code gen and pick the winners (by whatever means) count? Bruteforce all the space (silly example but hey we're in silly space here) counts?
Is it just "AI" adjacent that isn't copywritable? If so how do you define AI? Does autocomplete count? Intellisense? Smarter intellisense?
Are we gonna have to have a trial where there's at least one lawyer making silly comparisons between LLMs and power plugs? Or maybe counting abacuses (abaci?)... "But your honour, it's just random numbers / matrix multiplications...
They say "if" it's a new work, then it might not be copyrightable, I guess. You suppose that it's still the original work, and hence it's still got that copyright.
I think they are rhetorically asking if your position is correct.
If you ask a LLM to derive a spec that has no expressive element of the original code (a clean-room human team can carefully verify this), and then ask another instance of the LLM (with fresh context) to write out code from the spec, how is that different from a "clean room" rewrite? The agent that writes the new code only ever sees the spec, and by assumption (the assumption that's made in all clean room rewrites) the spec is purely factual with all copyrightable expression having been distilled out.
The new agent who writes code has probably at least parts of the original code as training data.
We can't speak about clean room implementation from LLM since they are technically capable only of spitting their training data in different ways, not of any original creation.
This has the potential to kill open source, or at least the most restrictive licenses (GPL, AGPL, ...): if a license no longer protects software from unwanted use, the only possible strategy is to make the development closed source.
Yes, this is the reason I've completely stopped releasing any open-source projects. I'm discovering that newer models are somewhat capable of reverse-engineering even compiled WebAssembly, etc. too, so I can feel a sort of "dark forest theory" taking hold. Why publish anything - open or closed - to be ripped off at negligible marginal cost?
People are just not realizing this now because it's mostly hobby projects and companies doing it in private, but eventually everyone will realize that LLMs allow almost any software to be reverse engineered for cheap.
See e.g. https://banteg.xyz/posts/crimsonland/ , a single human with the help of LLMs reverse engineered a non-trivial game and rewrote it in another language + graphics lib in 2 weeks.
It’s a real problem. I threw it at an old MUD game just to see how hard it is [0] then used differential testing and LLMs to rewrite it [1]. Just seems to be time and money.
Licensing issues aside, the chardet rewrite seems to be clearly superior to the original in performance too. It's likely that many open source projects could benefit from a similar approach.
I am not a lawyer, but from my understanding the legal precedent is NEC v. Intel which established that clean-room software development is not infringing, even if it performs the same functionality as the original.
As an aside, this clean room engineering is one of the plot points of Season 1 of the TV show Halt and Catch Fire where the fictional characters do this with the BIOS image they dumped.
This is precedent setting. In this case the rewrite was in same language, but if there's a python GPL project, and it's tests (spec) were used to rewrite specs in rust, and then an implementation in rust, can the second project be legally MIT, or any other?
If yes, this in a sense allows a path around GPL requirements. Linux's MIT version would be out in the next 1-2 years.
> but if there's a python GPL project, and it's tests (spec) were used to rewrite specs in rust, and then an implementation in rust, can the second project be legally MIT, or any other?
Isn't that what https://github.com/uutils/coreutils is? GNU coreutils spec and test suite, used to produce a rust MIT implementation. (Granted, by humans AFAIK)
Its very important to understand the "how" it was done. The GPL hands the "compile" step, and the result is still GPL. The clean Room process uses 2 teams, separated by a specification. So you would have to
1. Generate specification on what the system does.
2. Pass to another "clean" system
3. Second clean system implements based just on the specification, without any information on the original.
That 3rd step is the hardest, especially for well known projects.
So what if a frontier model company trains two models, one including 50% of the world's open source project and the second model the other 50% (or ten models with 90-10)?
Then the model that is familiar with the code can write specs. The model that does not have knowledge of the project can implement them.
Would that be a proper clean room implementation?
Seems like a pretty evil, profitable product "rewrite any code base with an inconvenient license to your proprietary version, legally".
3. claude-code that converts this to tests in the target language, and implements the app that passes the tests.
3 is no longer hard - look at all the reimplementations from ccc, to rewrites popping up. They all have a well defined test suite as common theme. So much so that tldraw author raised a (joke) issue to remove tests from the project.
That why I carved it out to just the specs. If they can be read as "facts", then the new code is not derived but arrived at with TTD.
The thesis I propose is that tests are more akin to facts, or can be stated as facts, and facts are not copyright-able. That's what makes this case interesting.
I don't think you can classify "public data in" as public domain. Public data could also include commercial licenses which forbid using it in any way other than what the license states. Just because the source is open for viewing does not necessarily mean it is OSL.
That's the core issue here. All models are trained on ALL source code that is publicly available irrespective of how it was licensed. It is illegal but every company training LLMs is doing it anyways.
Copyright is not a blacklist but an allowlist of things kept aside for the holder. Everything else is free game. LLM ingestion comes under fair use so no worries. If someone can get their hand on it, nothing in law stops it from training ingestion.
We can debate if this law is moral. Like the GP I took agree public data in -> public domain out is what's right for society. Copyright as an artificial concept has gone on for long enough.
I don't think so. It is no where "limited use". Entirety of the source code is ingested for training the model. In other words, it meets the bar of "heart of the work" being used for training. There are other factors as well, such as not harming owner's ability to profit from original work.
This hasn't gone to Supreme Court yet. And this is just USA. Courts in rest of the World will also have to take a call. It is not as simple as you make it out to be. Developers are spread across the World with majority living outside USA. Jurisdiction matters in these things.
Copyright's ambit has been pretty much defined and run by US for over a century.
You're holding out for some grace on this from the wrong venue. The right avenue would be lobbying for new laws to regulate and use LLMs, not try to find shelter in an archaic and increasingly irrelevant bit of legalese.
I think the more interesting question here would be if someone could fine tune an open weight model to remove knowledge of a particular library (not sure how you'd do that, but maybe possible?) and then try to get it to produce a clean room implementation.
I don't think this would qualify as clean room (the Library was involved in learning to generate programs as a whole). However, it should be possible to remove the library from the OLMO training data and retrain it from scratch.
But what about training without having seen any human written program? Coul a model learn from randomly generated programs?
> I don't think this would qualify as clean room (the Library was involved in learning to generate programs as a whole)
Hm... I mean this is really one for the lawyers, but IMO you would likely successfully be able to argue that the marginal knowledge of general coding from a particular library is likely close to nil.
The hard part here imo would be convincingly arguing that you can wipe out knowledge of the library from the training set, whether through fine tuning or trying to exclude it from the dataset.
> But what about training without having seen any human written program? Coul a model learn from randomly generated programs?
I think the answer at this point is definitely no, but maybe someday. I think it's a more interesting question for art since it's more subjective, if we eventually get to a point where a machine can self-teach itself art from nothing... first of all how, but second of all it would be interesting to see the reaction from people opposed to AI art on the basis of it training off of artists.
Honestly given all I've seen models do, I wouldn't be too surprised if you could somehow distill a (very bad) image generation model off of just an LLM. In a sense this is the end goal of the pelican riding a bicycle (somewhat tongue in cheek), if the LLM can learn to draw anything with SVGs without ever getting visual inputs then it would be very interesting :)
I mean in my opinion GPL licensed code should just infect models forcing them to follow the license.
You can do this a lot by saying things like: complete the code "<snippet from gpl licensed code>".
And if now the models are GPL licensed the problem of relicensing is gone since the code produced by these models should in theory be also GPL licensed.
Unfortunately, there is a dumb clause that computer generated code cannot be copyrighted or licensed to begin with.
Interesting questions raised by recent SCOTUS refusal to hear appeals related to AI an copyright-ability, and how that may affect licensing in open source.
Hoping the HN community can bring more color to this, there are some members who know about these subjects.
Can we do the same with universal music? Because that's easy and already possible. Or Microsoft Windows? Because we all know the answer: if it works, essentially any government will immediately call it illegal.
Because if this isn't allowed, that makes all of the AI models themselves illegal. They are very much the product of using others' copyrighted stuff and rewriting it.
But of course this will be allowed because copyright was never meant to protect anyone small. And that it's in direct contradiction with what applies to large companies? Courts won't care.
Also the mentioned SCOTUS decision is concerned with authorship of generative AI products. That's very different of this case. Here we're talking about a tool that transformed source code and somehow magically got rid of copyright due to this transformation? Imagine the consequences to the US copyright industry if that were actually possible.
How would that work? We still have no legal conclusion on whether AI model generated code, that is trained on all publicly available source (irrespective of type of license), is legal or not. IANAL but IMHO it is totally illegal as no permission was sought from authors of source code the models were trained on. So there is no way to just release the code created by a machine into public domain without knowing how the model was inspired to come up with the generated code in the first place. Pretty sure it would be considered in the scope of "reverse engineering" and that is not specific only to humans. You can extend it to machines as well.
EDIT: I would go so far as to say the most restrictive license that the model is trained on should be applied to all model generated code. And a licensing model with original authors (all Github users who contributed code in some form) should be setup to be reimbursed by AI companies. In other words, a % of profits must flow back to community as a whole every time code-related tokens are generated. Even if everyone receives pennies it doesn't matter. That is fair. Also should extend to artists whose art was used for training.
That license is called "All Rights Reserved", in which case you wouldn't be able to legally use the output for anything.
There are research models out there which are trained on only permissively licensed data (i.e. no "All Rights Reserved" data), but they're, colloquially speaking, dumb as bricks when compared to state-of-art.
But I guess the funniest consequence of the "model outputs are a derivative work of their training data" would be that it'd essentially wipe out (or at very least force a revert to a pre-AI era commit) every open source project which may have included any AI-generated or AI-assisted code, which currently pretty much includes every major open source project out there. And it would also make it impossible to legally train any new models whose training data isn't strictly pre-AI, since otherwise you wouldn't know whether your training data is contaminated or not.
Models whose authors tried to train only on permissively licensed data.
For example https://huggingface.co/bigcode/starcoder2-15b tried to be a permissively licensed dataset, but it filtered only on repository-level license, not file-level. So when searching for "under the terms of the GNU General Public License" on https://huggingface.co/spaces/bigcode/search-v2 back when it was working, you would find it was trained on many files with a GPL header.
Im struggling to see where this conclusion came from. To me it sounds like the AI-written work can not be coppywritten, and so its kind of like a copy pasting the original code. Copy pasting the original code doesnt make it public domain. Ai gen code cant be copywritten, or entered into the public domain, or used for purposes outside of the original code's license. Whats the paradox here?
I think we didn't even began to consider all the implications of this, and while people ran with that one case where someone couldn't copyright a generated image, it's not that easy for code. I think there needs to be way more litigation before we can confidently say it's settled.
If "generated" code is not copywritable, where do draw the line on what generated means? Do macros count? Does code that generates other code count? Protobuf?
If it's the tool that generates the code, again where do we draw the line? Is it just using 3rd party tools? Would training your own count? Would a "random" code gen and pick the winners (by whatever means) count? Bruteforce all the space (silly example but hey we're in silly space here) counts?
Is it just "AI" adjacent that isn't copywritable? If so how do you define AI? Does autocomplete count? Intellisense? Smarter intellisense?
Are we gonna have to have a trial where there's at least one lawyer making silly comparisons between LLMs and power plugs? Or maybe counting abacuses (abaci?)... "But your honour, it's just random numbers / matrix multiplications...
I think they are rhetorically asking if your position is correct.
A lawyer could easily argue that the model itself stores a representation of the original, and thus it can never do a "fresh context".
And to be perfectly honest, LLMs can quote a lot of text verbatim.
We can't speak about clean room implementation from LLM since they are technically capable only of spitting their training data in different ways, not of any original creation.
See e.g. https://banteg.xyz/posts/crimsonland/ , a single human with the help of LLMs reverse engineered a non-trivial game and rewrote it in another language + graphics lib in 2 weeks.
[0] https://reorchestrate.com/posts/your-binary-is-no-longer-saf...
[1] https://reorchestrate.com/posts/your-binary-is-no-longer-saf...
Mark Pilgrim! Now that‘s a name I haven‘t read in a long time.
Is the "clean room" process meaningfully backed by legal precedent?
As an aside, this clean room engineering is one of the plot points of Season 1 of the TV show Halt and Catch Fire where the fictional characters do this with the BIOS image they dumped.
If yes, this in a sense allows a path around GPL requirements. Linux's MIT version would be out in the next 1-2 years.
Isn't that what https://github.com/uutils/coreutils is? GNU coreutils spec and test suite, used to produce a rust MIT implementation. (Granted, by humans AFAIK)
1. Generate specification on what the system does. 2. Pass to another "clean" system 3. Second clean system implements based just on the specification, without any information on the original.
That 3rd step is the hardest, especially for well known projects.
Then the model that is familiar with the code can write specs. The model that does not have knowledge of the project can implement them.
Would that be a proper clean room implementation?
Seems like a pretty evil, profitable product "rewrite any code base with an inconvenient license to your proprietary version, legally".
2. Dumped into a file.
3. claude-code that converts this to tests in the target language, and implements the app that passes the tests.
3 is no longer hard - look at all the reimplementations from ccc, to rewrites popping up. They all have a well defined test suite as common theme. So much so that tldraw author raised a (joke) issue to remove tests from the project.
The thesis I propose is that tests are more akin to facts, or can be stated as facts, and facts are not copyright-able. That's what makes this case interesting.
If "tests" should mean a proper specification let's say some IETF RFC of a protocol, then that would be different.
So, you can pilfer the commons ("public") but not stuff unavailable in source form.
If we expand your thought experiment to other forms of expression, say videos on YT or Netflix, then yes.
That's the core issue here. All models are trained on ALL source code that is publicly available irrespective of how it was licensed. It is illegal but every company training LLMs is doing it anyways.
We can debate if this law is moral. Like the GP I took agree public data in -> public domain out is what's right for society. Copyright as an artificial concept has gone on for long enough.
I don't think so. It is no where "limited use". Entirety of the source code is ingested for training the model. In other words, it meets the bar of "heart of the work" being used for training. There are other factors as well, such as not harming owner's ability to profit from original work.
Both Meta and Anthropic were vindicated for their use. Only for Anthropic was their fine for not buying upfront.
You're holding out for some grace on this from the wrong venue. The right avenue would be lobbying for new laws to regulate and use LLMs, not try to find shelter in an archaic and increasingly irrelevant bit of legalese.
But what about training without having seen any human written program? Coul a model learn from randomly generated programs?
Hm... I mean this is really one for the lawyers, but IMO you would likely successfully be able to argue that the marginal knowledge of general coding from a particular library is likely close to nil.
The hard part here imo would be convincingly arguing that you can wipe out knowledge of the library from the training set, whether through fine tuning or trying to exclude it from the dataset.
> But what about training without having seen any human written program? Coul a model learn from randomly generated programs?
I think the answer at this point is definitely no, but maybe someday. I think it's a more interesting question for art since it's more subjective, if we eventually get to a point where a machine can self-teach itself art from nothing... first of all how, but second of all it would be interesting to see the reaction from people opposed to AI art on the basis of it training off of artists.
Honestly given all I've seen models do, I wouldn't be too surprised if you could somehow distill a (very bad) image generation model off of just an LLM. In a sense this is the end goal of the pelican riding a bicycle (somewhat tongue in cheek), if the LLM can learn to draw anything with SVGs without ever getting visual inputs then it would be very interesting :)
You can do this a lot by saying things like: complete the code "<snippet from gpl licensed code>".
And if now the models are GPL licensed the problem of relicensing is gone since the code produced by these models should in theory be also GPL licensed.
Unfortunately, there is a dumb clause that computer generated code cannot be copyrighted or licensed to begin with.
Can you point to the clause? I have never seen it in any GPL license.
The key leap from gpt3 to gpt-3.5 (aka ChatGPT) was code-davinci-002, which is trained upon Github source code after OpenAI-Microsoft partnership.
Open source code contributed much to LLM's amazing CoT consistency. If there's no Open Source movement, LLM would be developed much later.
Hoping the HN community can bring more color to this, there are some members who know about these subjects.
Because if this isn't allowed, that makes all of the AI models themselves illegal. They are very much the product of using others' copyrighted stuff and rewriting it.
But of course this will be allowed because copyright was never meant to protect anyone small. And that it's in direct contradiction with what applies to large companies? Courts won't care.