How GitHub Copilot Could Steer Microsoft Into A Copyright Storm

Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.

Just make not possible to copyright code.

If only there were a way to copy it in another direction, like Left ...

I think i meant the other way around, as making impossible to copyright code

The GNU [wikipedia.org] license -- commonly called a Copyleft -- kinda has that effect. Similarly, the BSD [wikipedia.org] and MIT [wikipedia.org] licenses (copyrights) are very permissive, usually only requiring retention of the notice and authors ... (apologies if you already know all this...)

The GNU license -- commonly called a Copyleft -- kinda has that effect. Similarly, the BSD and MIT licenses (copyrights) are very permissive, usually only requiring retention of the notice and authors ... (apologies if you already know all this...)

The GPL license (What your alling the GNU) is not like that at all (And very different to BSD/MIT ones).

If you use GPL code in your code you *must* also release your code, under the GPL , OR, a GPL compatible license (of which there are very few). I

While I agree with what you posted, OP said "making impossible to copyright code" which I took to (generally) mean to allow others to freely use, share reuse that code -- as copyright is often used to prevent sharing and/or reusing of code -- for free anyway. In that case, the GPL. BSD and MIT licenses accomplish that by either requiring or allowing the code to be reused, etc ... Whether I'm explaining myself adequately, your descriptions of things make me think we're on the same page.

If that were the case, there would be no problem. But no, copyleft is copyright (with bad licensing terms). Even permissive licenses still keep copyright. You need to specifically put things in the public domain, such as with a CC0 license, if you don't want the work to be copyrighted.

"CopyLeft" is a stupid, nonsensical term. Code cannot be both free and copyrighted. You can't have it both ways. It is either free (as in "freedom") or it isn't. The purpose of copyright is to place restrictions on something. That is the exact opposite of what FOSS is supposed to be.

In many countries code just is copyright by default and there's nothing the author can do about that. That doesn't mean there have to be restrictions, just that you get rid of those restrictions by giving an open license. See the Creative Commons CC-0 license for example where the summary is really simple but the legal text is quite complex.

There are two simple points of view here. and it's a question of whether you want your software to be good for people and society of you want it to be good for develop

>That is the exact opposite of what FOSS is supposed to be.

Nope. I think you get into the paradox of intolerance area when you don't have restrictions like the GPL.

I also think that some people use the GPL exactly to extract a cost to further development on the code they created. This is basically a contract. This is my code. Do this with it or GET THE F OUT. It's still free to use, and open for all to see. But taking the work itself and extending it should *also* be open, not closed. And without

Why? does the original source vanish once copied?

The only thing that I agree with don't claim its yours when its not, and don't claim its someone else when its yours. That is trademark and protects people form lying.

Attribution while it seems OK, becomes cumbersome when you include libraries, that include libraries ....

The every modern invention/idea is built on other peoples invention/idea we have had fantastic innovation on the shoulders of others, and the creators owe society for that inspiration. Limite

It is still less work to do the attribution than it is to write those libraries yourself, so I don't see a problem.

A copyright protects the expression, presentation or arrangement of a creator’s ideas, but not the ideas themselves. Consider that many people could have the same idea, but they might express those ideas in vastly different ways. Those methods of expression are protected, but the shared idea is not.

Functionality is not supposed to be copyrightable, anyway, patents are for that. So unless the code is expressed in a creative way (and comments in the code that Copilot copied may fall under that), and the code's language did not limit how that functionality could be expressed, it should not be copyrightable. Courts may rule otherwise - there seems to be no end to expanding "rights" of companies at the expense of individuals.

You also have the question if the CoPilot code could ever be considered copyrightable; it is simply acting as an interpretation of a generic idea.

As an example, the summary's sparse matrix code... just how much work was required to get CoPilot to spit it back out?

There is a difference between creating an algorithm from scratch and what CoPilot does. All GitHub/Microsoft does is what your average coder does, search StackOverflow for a matching description and then copy/paste the actual code. Itâ(TM)s not regenerating new code on demand.

The question is whether those particular pieces of code are art (copyrightable) or if they are a mathematical expression or a list of facts.

Though code shouldn't be patentable.

And instead allow it to be patented?

If CoPilot isn't handling licensing, it's not ready for release and should be avoided at all costs.

You only license, once you've determined that what you want to do would otherwise be a copyright violation. If Microsoft believes they aren't violating copyright (and then if a court agrees with them) then they don't need any licensing, so whatever licenses are offered, are irrelevant.

This is a really weird situation. Look at the extremes and try to figure out where copilot is:

At once extreme, you literally look at the code and copy it.

At the other extreme is clean room design. Someone looks at the code, ex

The article linked show that it's not just single isolated snippets, it copies multiple isolated snippets up to the level that a whole file is clearly a derivative work of the original work.

No this is not really a "bug", more a reveal of the true feature, CoPilot's database is a derivative work of the software it is trained on and so is subject to the GPL. Any software developed using CoPilot, even if it the code is unrelated to any GPL package is a derivative work of CoPilot and so is a derivative work of the GPL software used to train CoPilot.

Otherwise referred to as "tripe." Look, they stole my for-loop!! This lawyer and his douchebag client need to take a hike.

Only problem with that is both sides want to keep copyrighting.

On the right, they see it as a money stream and are willing to spend big to protect that. On the left, they see it as a defence against the bullying of the right; especially when hiding the attribution for financial gain. An honesty keeper.

That doesn't really cover the full spectrum. The vast majority on the left and the right don't care about copyright at all. People who profit from copyright care about it a lot, and are willing to pay for it in campaign donations. A significant minority have a nuanced opinion opposing copyright, but generally aren't willing to pay for it in campaign donations.

tl;dr the establishment supports copyrights, the anti-establishment opposes them but most people don't care.

Look, they stole my for-loop!!

LOL this. Might not even be a good for-loop, but people become attached to such things and don't want to let them go.

>>Look, they stole my for-loop!!

>LOL this. Might not even be a good for-loop, but people become attached to such things and don't want to let them go.

Now, how about, Look they stole my copyrighted sparse matrix transposition code?

Sparse matrix transposition code isnâ(TM)t exactly new, it is taught very early on in any data sciences. It is likely many people copied it or re-invented it before and introduced it to GPL/open source projects.

GitHub Copilot -- a programming auto-suggestion tool trained from public source code on the internet -- has been caught generating what appears to be copyrighted code,

When individuals violate Microsoft's copyright, it's called "piracy" instead of copying. When Microsoft violates an individual's copyright, it's called "generating" instead of copying (or piracy).

Hes making the whole "tools" vs "paraphernalia" shtick others noticed.

When Microsoft violates an individual's copyright, it's called "generating" instead of copying (or piracy).

Sometimes they call it "innovating".

The argument being made for its legitimacy is effectively "we stole from so many people that it cant be illegal"

Isn't that similar to the arguement being made for all those AI/ML generated images recently?

Isn't that similar to the argument being made for all those AI/ML generated images recently?

It absolutely is. On the AI-generated image front one can argue that only the style is being copied: pointillism, surrealism, medieval painting, etc. That's not copyrightable: no artist has ever been sued for copyright infringement (successfully at least), just because their painting is in the same style as another artist. But in the case of Copilot is the generated code really original or is it just composed of chunks of the source material. If the latter it would be a derivative work and thus copyright in

Only creative aspects of code are copyrightable, not functional aspects. An algorithm for sparse matrix transposition is going to be extremely functional and have little or no creative aspect and thus quite likely not protected by copyright.

But still it might be easier to copyright the that section as it does "something" by copyrighting the entire block it does something in/to. People are trying to copyright syntax which is very limited and calling it style which is applied to something that actually runs.

Yeah but even with a sparse matrix transposition, if you're smart you'll do a clean-room implementation if you're serious about avoiding copyright violations.

Forget the copyright dispute for a minute. CoPilot is fundamentally flawed for another reason and that's because it is trained on unchecked, unreviewed code samples. Most code I come across in public repos is of horrible quality - student homeworks, experimental projects, online tutorials, people just tinkering with some new libraries, and so on. Of course, there also some good projects out there but according to Sturgeon's law "90% of everything is crap". Since AI/machine learning/neural network training is fundamentally dependent on the volume of data points used to strenghten relevant signals, all those outposts of good code become insignificant among all the other poop floating around, and the system starts suggesting crap code with all its functional, architectural and security issues.

So, Tay [wikipedia.org] has gone on to study CS?

And remember, you can apply Sturgeon's Law recursively as well on the other 10% as many times as you need.

it still does give interesting suggestions for: print("kill or print("women can't

>They're probably already filtering the cruft

I imagine they are having just as difficult a time doing this successfully as with any other AI classification project of human based inputs. "Good code" is like trying to determine pornography - "I know it when I see it".

It's quite satisfying to just do the work.

Also, most open-source code is covered by a license, which imposes additional legal requirements.

Considering the wholesale violations of "legal requirements" people perform when stealing music, videos, or games, this comment should never be included when discussing open source.

The code in question appears to have been published in the book 'Direct Methods for Sparse Linear Systems' and does not have an obvious license or restriction in that text.

If I'm reading a book that has an example of how to e.g. handle file IO, I'd assume that it was acceptable to use that example code in my work. Am I wrong in assuming that?

If I am wrong, I owe a deep apology to the authors of the 'Turbo C++ Professional handbook'. The first non-basic dev work I ever did was built by stitching together examples from that text. (This was ~1991 and I didn't have internet access, so this book and the language reference manual were how I learned C++.)

Bias disclosure: I work for Microsoft in a non-development role unrelated to Visual Studio, CoPilot, or Github.

You say "stupid and obviously" illegal.

I can search Google today by typing in things, and it brings back suggestions. Some of those suggestions may include copyrighted material that the author intended to be publicly available but still under their copyright. I can then decide to copy that material directly into my editor, and then use it. I made the decision, Google just made a suggestion when I searched for something.

Now, let's replace Google with Co-pilot:

I can search Co-pilot today by typing in thing

>Some of those suggestions may include copyrighted material that the author intended to be publicly available but still under their copyright.

Fair use, the search term is used to search, not to put in a paper. The code is designed to go into your project. It is blatantly illegal to be pulling copyrighted code through software and put it into another project.

>Some of those suggestions may include copyrighted material that the author intended to be publicly available but stil

I don't have this book but there is probably a section somewhere stating the readers' rights and limitations to use the code from the book. It's standard and all books have them.

I looked but didn't find one. Ignoring that, even with the fuzziness on this specific case, there is still a problem here. The source that copilot kicks back needs to be sufficiently original that it isn't obviously someone else's work.

On a related note, are there any free or cheap IP scanning tools as mentioned in TFA?

Please explain why Microsoft did not train Copilot on their own code if the generated code cannot possibly infringe on the copyright of the source material. Nat Friedman claims that "training ML systems on public data is fair use" but training it on their own code would have avoided this controversy entirely. Also it would have been the perfect source material for people who are mostly going to use it to write Windows applications.

Clearly Microsoft knows Copilot is likely to infringe on the copyright of th

Please explain why Microsoft did not train Copilot on their own code if the generated code cannot possibly infringe on the copyright of the source material.

You have as much data as I do. Same circus, different tent.

Is this type of thing an issue for dall-e generated images too? (I'm asking in ignorance; I genuinely don't know.)

Or I think it could be even better. Users should be able to select which licenses they agree to get their source code generated from. Then, as the system proposes snippets or whatever it does, it also generates a new license for the user. If the user selected GPL or LGPL or CC-BY-like licenses, the generated license lists the names of all the people whose code was used to train the dataset. Alternatively, because it's very likely to be an extremely long list, it could link to a webpage listing all those peo

Now we see how Microsoft owning Github can be used to break open source. By adding a handy tool that helpfully "suggests" copyright-violating insertions, they can encourage developers to sprinkle the bulk of the open-source codebase with IP-law violations.

Interestingly, this can also be described as a form of "Embrace, extend, extinguish", though the boobytrap is of an entirely different nature.

The problem with the argument that a code generator, whether or not trained on FOSS, emits copyrighted code is that it doesn't take into account the basic premise of "scènes à faire": given a set of design constraints and desired output, a given set of developers working on a specific function or procedure will likely create nearly-identical code.

Scènes à faire is a long-standing defense approach in copyright cases, albeit primarily in the intellectual-property worlds of graphic arts, photography, and other visual media. However, nothing says the approach can't be applied to the intellectual-property world of code design and deployment. And, given the propensity for developers and coders to reuse code under the dictum of "laziness is a virtue", trying to pin down a given code snippet for copyright violation, especially if created independently from the claimed copyrighted snippet, is likely to be a Sisyphean task.

Just my two cents' worth.

>The problem with the argument that a code generator, whether or not trained on FOSS, emits copyrighted code is that it doesn't take into account the basic premise of "scènes à faire": given a set of design constraints and desired output, a given set of developers working on a specific function or procedure will likely create nearly-identical code.

And now apply that to the copyrighted sparse matrix transposition code. Somehow I doubt your premise, given this evidence. Sure, there may be some

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Former Celsius Exec Joins JPMorgan As Director of Crypto Regulatory Policy

Waymo's Self-Driving Taxi Service Is Coming to LA

"An open mind has but one disadvantage: it collects dirt." -- a saying at RPI

High Quality Is Our Culture

A Manufacturer You Can Trust

11 Years Of Business Super Terminal Display Manufacturing Experience

How GitHub Copilot Could Steer Microsoft Into a Copyright Storm - Slashdot

Featured Products

News & Blog

Homagic - Professional and Advanced Integrated Prefab Construction

Homagic - Professional and Advanced Integrated Prefab Construction

How about the cost of a Modular House

How about the cost of a Modular House

Trial in dismemberment case continues with details about accused killer’s home | News | herald-dispatch.com

Trial in dismemberment case continues with details about accused killer’s home | News | herald-dispatch.com

Notes for High Vacuum and Ultra High Vacuum (UHV) Practices

Notes for High Vacuum and Ultra High Vacuum (UHV) Practices

StackPath

StackPath