Introducing code referencing for GitHub Copilot
Today, we’re announcing a private beta of GitHub Copilot with code referencing that includes a filter to detect code suggestions matching public code on GitHub.
Make more informed decisions about the code you use. In the rare case where a GitHub Copilot suggestion matches public code, this update will show a list of repositories where that code appears and their licenses. Sign up for the private beta today.
Over the course of the last year, GitHub Copilot, the world’s first at-scale AI pair programmer trained on billions of lines of public code, has attracted more than 1 million developers and helped over 27,000 organizations build faster and more productively. During that time, many developers told us they want to see when GitHub Copilot’s suggestions match public code.
Today, we’re announcing a private beta of GitHub Copilot with code referencing that includes an updated filter which detects and shows context of code suggestions matching public code on GitHub. When the filter is enabled, GitHub Copilot checks code suggestions with surrounding code of about 150 characters and compares it against an index of all the public code on GitHub.com. Matches—along with information about every repository in which they appear—are displayed right in the editor. Developers can now choose whether to block suggestions containing matching code, or allow those suggestions with information about matches.
Why? Some want to learn from others’ work, others may want to take a dependency rather than introduce new app logic, and still others want to give or receive credit for similar work. Whatever the reason, it’s nice to know when similar code is out there.
Let’s see how it works.
How GitHub Copilot code referencing works
With billions of files to index and a latency budget of only 10-20ms, it’s a miracle of engineering that this is even possible. Still, if there’s a match, a notification appears in the editor showing: (1) the matching code, (2) the repositories where that code appears, and (3) the license governing each repository.
Why code referencing matters
In our journey to create a code referencing tool, we discovered a few interesting things:
First, our previous research suggests that matches occur in less than one percent of GitHub Copilot suggestions. But that one percent isn’t evenly distributed across all use cases. In the context of an existing application with surrounding code, we almost never see a match. But in an empty or nearly empty file, we see matches far more often.
Suggestions are heavily biased toward the prompt so GitHub Copilot can provide suggestions tailor-made for your current task. That means, in an existing app with lots of context, you’ll get a suggestion customized for your code. But in an empty, or nearly empty file, there’s little to no context. So, you’re more likely to get a suggestion that matches public code.
We’ve also found that when suggestions match public code, those matches frequently appear in dozens, if not hundreds of repositories. In some ways, this isn’t surprising because the models that power GitHub Copilot are akin to giant probability machines. A code fragment that appears in many repositories is more likely to be a “pattern” detected by the model—similar to the patterns we see elsewhere in public code.
For example, research on Java projects finds that up to 11% of repositories may contain code that resembles solutions posted to Stack Overflow, and the vast majority of those snippets appear without attribution. Another study on Python found that many matches are too generic to trace to an original usage. A smaller-scale study found that Stack Overflow answers contain code from Android applications.
Finally, the repositories with matching code are often governed by multiple, sometimes conflicting licenses, which makes attributing a match to its source more challenging. By consulting a list of references, developers can now:
- Determine whether, what, and who to attribute rather than having matches simply blocked from the outset.
- Learn from other developers by studying their approaches to similar problems.
- Take a dependency on an open source library to avoid new business logic.
- Evaluate the context of code before accepting a matching suggestion.
- Discover new projects and contribute upstream.
Building a better developer experience and community
This is just the beginning. As GitHub continues to innovate, we will always strive to help developers stay in the flow, build creatively, and maintain a transparent connection to the community. We’re excited for you to try GitHub Copilot with code referencing.
Happy coding.
Tags:
Written by
Related posts
Announcing GitHub Secure Open Source Fund: Help secure the open source ecosystem for everyone
Applications for the new GitHub Secure Open Source Fund are now open! Applications will be reviewed on a rolling basis until they close on January 7 at 11:59 pm PT. Programming and funding will begin in early 2025.
Software is a team sport: Building the future of software development together
Microsoft and GitHub are committed to empowering developers around the world to innovate, collaborate, and create solutions that’ll shape the next generation of technology.
Does GitHub Copilot improve code quality? Here’s what the data says
Findings in our latest study show that the quality of code written with GitHub Copilot is significantly more functional, readable, reliable, maintainable, and concise.