Before I say anything here, allow me to admit that this piece is not going to age well. We are constantly learning, exploring, and pushing the boundaries of large language model (LLM) applications to mass refactoring. It's worth writing something now though, because we do know some things.
There is no doubt that Copilot has increased the speed of my code authorship. I have some heartburn when I hear simplifications like "Copilot writes 30% of new code" because the reality is more complicated than that. It is above all a human-machine interaction where suggestions do in many cases save me a lot of mundane typing but are equally as often completely bogus. And that's fine!
Generative AI is best in single-point suggestive applications.
The same AI prompt applied to the same input will not always generate the same output. You can quickly test this with this unit test which successfully moves a switch default case on about 70% of the time. When I am authoring new code, success ratios like this lead to fantastic time savings, because I'm rolling the dice on practically every single word I type.
If I were trying to move switch default cases for every switch statement on 500 million lines of code, 70% success would imply a lot of manual review.
- Single-point: designed to operate at a single cursor position, which makes it ideal for human-machine interaction since a human always has a single focus point.
- Suggestive: Not guaranteed to be accurate. Humans very quickly accept or discard suggestions.
Rule-based systems are best in multi-point authoritative applications.
A rule-based approach to refactoring switch statements could be provably correct 100% of the time, but the recipe is more complicated to develop. This is the recipe using an AI prompt and this is the same recipe written for 100% accuracy.
- Multi-point: designed to operate at many (potentially hundreds of millions) of distinct cursor positions in the source code.
- Authoritative: Guaranteed to be accurate.
There is a tradeoff between recipe development speed and manual review time.
For refactoring operations that have a relatively small number of touches in a large codebase (say 1,000 occurrences), I believe leveraging prompt-based approaches to making the change and accepting more manual review is worthwhile.
For core recipes that are going to be used repeatedly or serve as building blocks for moving up the value chain, 100% accuracy is a surer footing.
Neither is a replacement for the other: just two different tools to use depending on the application.