A place to cache linked articles (think custom and personal wayback machine)
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

index.md 6.2KB

title: GitHub Copilot and Copyright url: https://mjtsai.com/blog/2021/07/07/github-copilot-and-copyright/ hash_url: 05391381e6

Rian Hunter (via Hacker News):

I do not agree with GitHub’s unauthorized and unlicensed use of copyrighted source code as training data for their ML-powered GitHub Copilot product. This product injects source code derived from copyrighted sources into the software of their customers without informing them of the license of the original source code. This significantly eases unauthorized and unlicensed use of a copyright holder’s work.

Julia Reda (tweet):

Since Copilot also uses the numerous GitHub repositories under copyleft licences such as the GPL as training material, somecommentators accuse GitHub of copyright infringement, because Copilot itself is not released under a copyleft licence, but is to be offered as a paid service after a test phase. The controversy touches on several thorny copyright issues at once. What is astonishing about the current debate is that the calls for the broadest possible interpretation of copyright are now coming from within the Free Software community.

[…]

In the US, scraping falls under fair use, this has been clear at least since the Google Books case.

[…]

The short code snippets that Copilot reproduces from training data are unlikely to reach the threshold of originality. Precisely because copyright only protects original excerpts, press publishers in the EU have successfully lobbied for their own ancillary copyright that does not require originality as a precondition for protection. Their aim is to prohibit the display of individual sentences from press articles by search engines.

[…]

On the other hand, the argument that the outputs of GitHub Copilot are derivative works of the training data is based on the assumption that a machine can produce works. This assumption is wrong and counterproductive. Copyright law has only ever applied to intellectual creations – where there is no creator, there is no work. This means that machine-generated code like that of GitHub Copilot is not a work under copyright law at all, so it is not a derivative work either.

Luis Villa:

“independent creation” is a doctrine in US law that protects you if you write the same thing without knowing about the first thing. May or may not apply here, but I mention it because it is non-intuitive and speaks directly to “but what if the code is the same”.

There is an observable trend in US law, based on fair use and older notions in US copyright law of the need for creativity, that judges give a looooot of leeway to “machines that read”. Copilot fits pretty squarely in that tradition.

[…]

Article 4 of the 2019 Directive seems to clearly make Copilot’s training unambiguously legal in the EU, but authors can explicitly opt out.

[…]

Note that this is an interesting example of what I wrote about in the context of databases, where rights are not the same across countries, making it hard to write a generic global license.

James Grimmelmann:

Almost by accident, copyright law has concluded that it is for humans only: reading performed by computers doesn’t count as infringement. Conceptually, this makes sense: Copyright’s ideal of romantic readership involves humans writing for other humans. But in an age when more and more manipulation of copyrighted works is carried out by automated processes, this split between human reading (infringement) and robotic reading (exempt) has odd consequences: it pulls us toward a copyright system in which humans occupy a surprisingly peripheral place. This Article describes the shifts in fair use law that brought us here and reflects on the role of robots in copyright’s cosmology.

[…]

Infringement is for humans only; when computers do it, it’s fair use.

Previously:

GitHub Support just straight up confirmed in an email that yes, they used all public GitHub code, for Codex/Copilot regardless of license.

Adam Jacob:

Those of us who remember when open source was the novel underdog, allowing us to learn, grow, and build things our proprietary peers could not - we tend to see the relationship to corp $ in OSS as a net benefit, pretty much always.

That’s because we remember when it wasn’t so, and it took a lot of work to make it legit. But if you started your career with that as the ground truth, you’re much more likely to see the problematic aspects of it; that your open code can be used by folks in ways you dislike.