GitHub's claims under scrutiny over Copilot's questionable code quality metrics

Image created via: ChatGPT (OpenAI)

GitHub’s claim that its Copilot AI generates superior-quality code has come under scrutiny, with Romanian software developer Dan Cîmpianu leading the charge against its statistical validity.

The tech giant recently published a study claiming that developers using Copilot were 56% more likely to pass all unit tests, wrote 13.6% more error-free lines of code, and produced code that was 1–3% more readable, reliable, maintainable, and concise. Additionally, the study reported a 5% higher likelihood of Copilot users having their code approved.

The study involved 243 Python developers with at least five years of experience. Participants were divided into two groups: one using Copilot and one not. The assigned task was to create a basic web server for managing restaurant reviews. Code submissions were evaluated through peer review by the participants themselves. However, inconsistencies in the number of reviews conducted (1,293 instead of the expected 2,020) raised questions about the process.

Cîmpianu criticised the study on multiple fronts. He argued that the choice of task, a simple CRUD application, is widely documented in online tutorials and likely included in Copilot’s training data, potentially biasing the results. He also highlighted inconsistencies in the reporting of key metrics, such as the claim that 60.8% of Copilot users passed all tests compared to 39.2% of non-users, which was not clearly supported by the data provided. Moreover, GitHub’s claim that Copilot users wrote 13.6% more error-free lines of code was criticised as misleading, as it equated to just two additional lines per error and did not include functional issues, focusing instead on stylistic concerns or linter warnings.

Cîmpianu also took issue with GitHub’s claims of a 1-3% improvement in code readability and maintainability, noting that such assessments are highly subjective and were not backed by transparent evaluation criteria. He further questioned the decision to use developers involved in the study as reviewers, suggesting that an independent review process would have been more reliable.

Cîmpianu's criticism echoes findings from other studies. A 2023 report by GitClear indicated that GitHub Copilot reduced overall code quality, while research from Bilkent University found that AI tools such as Copilot, ChatGPT, and Amazon Q Developer often produced code with stylistic flaws. These tools required significant manual corrections, with Copilot taking an average of 9.1 minutes to resolve issues in generated code.

GitHub’s study sheds light on a significant trend: the growing reliance on AI in software development. While Copilot and similar tools can provide valuable assistance, their current limitations highlight the importance of developer oversight. For Cîmpianu, though, the stakes are higher:

If you can't write good code without an AI, then you shouldn't use one in the first place.

The debate underscores a broader concern about AI's role in creative and technical fields. Tools like Copilot are reshaping how we create but it is also not without controversy.

Via: The Register