It’s been about six weeks since GitHub launched Copilot in technical preview, and the initial excitement has given way to a more heated conversation. The AI pair programmer, built on OpenAI’s Codex model, can generate surprisingly competent code from natural language prompts and context. But the open source community is increasingly asking an uncomfortable question: is Copilot built on a foundation of licensing violations?
The debate has been simmering across GitHub Issues, Twitter threads, Hacker News, and mailing lists, and it touches on fundamental questions about how we think about code, copyright, and the training data that powers machine learning models.
How Copilot Works (And Why It Matters)#
Copilot is powered by OpenAI Codex, a descendant of GPT-3 that’s been fine-tuned on publicly available code — primarily from GitHub’s own repositories. When you type a comment or start writing a function, Copilot predicts what you’re likely to write next, offering multi-line suggestions that can range from boilerplate to surprisingly sophisticated implementations.
The technical achievement is impressive. In my own testing during the preview, Copilot correctly generated working implementations for common patterns in Python and JavaScript, often needing only a function name and docstring as input. It handles API calls, data transformations, and algorithmic patterns with reasonable accuracy.
But here’s where it gets complicated: the model was trained on code from public GitHub repositories. Those repositories carry licences — GPL, MIT, Apache, BSD, and many others. Each licence comes with specific obligations about how the code can be used, modified, and redistributed. The question is whether training an ML model on that code, and then reproducing patterns from it in suggestions, constitutes a use that’s governed by those licences.
The Licensing Argument#
The Free Software Foundation has raised concerns about Copilot’s relationship with copyleft licences like the GPL. The GPL requires that derivative works also be released under the GPL. If Copilot has learned patterns from GPL-licensed code and suggests those patterns to users who incorporate them into proprietary software, is that a GPL violation?
There are reasonable arguments on both sides:
The “it’s fair use” position: Training an ML model on publicly available code is a transformative use. The model doesn’t store or reproduce code verbatim — it learns statistical patterns and generates new code based on those patterns. This is analogous to a developer reading open source code to learn patterns and then writing similar code in their own projects.
The “it’s a laundering machine” position: Copilot has been demonstrated to reproduce substantial portions of code verbatim in some cases, including recognizable snippets from well-known projects — sometimes complete with original comments. If the model can reproduce exact code from training data, it’s not just learning patterns; it’s memorising and regurgitating copyrighted material, potentially stripping licence attributions in the process.
The truth likely sits somewhere in between, and current copyright law isn’t well-equipped to adjudicate the question. The concept of “fair use” in the context of ML training data is largely untested in courts, at least with respect to code.
The Consent Problem#
Beyond the legal question, there’s an ethical dimension that resonates with me more strongly. Many open source developers chose specific licences deliberately. A developer who licences their code under the GPL is making a philosophical statement: this code should remain free and open. A developer who chooses MIT is saying something different: use this however you want, just keep the attribution.
Neither of those developers explicitly consented to their code being used as training data for a commercial AI product. GitHub’s Terms of Service do grant certain rights to GitHub regarding hosted content, but the interpretation of whether those rights extend to ML training is contested.
This is particularly pointed because GitHub is owned by Microsoft, and Copilot is a paid product (currently free in preview, but pricing is coming). Open source developers contributed their code freely, and a commercial entity is now using that collective work to build a revenue-generating service. Even if it’s legally permissible, the optics are troubling for a company that positions itself as a champion of open source.
Practical Implications for Developers#
If you’re using Copilot in the technical preview, here are some practical considerations:
Review suggestions carefully: Don’t blindly accept Copilot’s output. Beyond the licensing question, there are quality and security concerns. The model can suggest code with bugs, vulnerabilities, or anti-patterns.
Be aware of verbatim reproduction: If a suggestion looks too specific or includes unusual variable names and comments, it may be a near-verbatim reproduction of training data. Consider searching for that code on GitHub to check.
Understand your project’s licence obligations: If you’re working on a permissively licensed project, the risk is lower. If you’re working on proprietary code, accepting GPL-derived suggestions could create compliance issues.
Keep an eye on the legal landscape: This is an evolving situation. The Software Freedom Conservancy and others are actively researching the legal dimensions. Court cases or regulatory guidance could change the picture significantly.
My Take#
I’ve been writing and using open source software for most of my career, and I have mixed feelings about Copilot. The technology is genuinely useful — it accelerates boilerplate coding and can help developers explore unfamiliar APIs. As a productivity tool, it’s impressive.
But I’m uncomfortable with the training data approach. The open source ecosystem is built on a social contract: developers share their work under specific terms, and users respect those terms. Using that collective output as training data for a commercial product without explicit consent feels like it bends, if not breaks, that social contract.
I don’t think the answer is to stop building AI coding tools — the genie isn’t going back in the bottle. But I’d like to see more transparency from GitHub and OpenAI about the training data, better mechanisms for developers to opt out of training, and serious engagement with the licensing questions rather than hand-waving about fair use.
The open source community built the platform that Copilot stands on. The least GitHub can do is address that community’s concerns with the seriousness they deserve. This conversation is far from over, and how it’s resolved will set important precedents for the intersection of AI and open source for years to come.
