title: The Illusion Of Developer “Productivity” Opens The Door To Snake Oil
url: https://codemanship.wordpress.com/2023/09/25/the-illusion-of-developer-productivity-opens-the-door-to-snake-oil/
hash_url: d048e59b32
There’s been much talk about measuring the productivity of software developers, triggered by a report from management consultants McKinsey claiming to have succeeded where countless others over many decades have failed.
I’m not going to dwell on the contents, as I prefer not to flatter it with that kind of scrutiny. Suffice to say, their ideas are naive at best. File in the usual place with your used egg shells and empty milk cartons.
What McKinsey’s take suffers from is very, very common: they’ve mistaken activity for outcomes. Activity is easy to measure at an individual level, outcomes not so much. In fact, outcomes are often difficult to quantify at all.
My first observation is that it’s not individual developers who produce outcomes; outcomes are achieved by the team. Just as there are players in a football team who usually don’t score goals, but without them fewer goals would be scored, there are usually people in a dev team who would be viewed as “unproductive” by McKinsey’s yardstick, but without whom the team as a whole would achieve much less. (See Dan North’s brilliant skewering of their metrics using the very highly valuable Tim McKinnon – and I know, because I’ve worked with him – as the example.)
My second observation is that outcomes in software development are rarely what they seem. Is our goal really to deliver code? Or is it to solve customers’ problems? Think of a doctor; is their goal to deliver treatments, or is to make us better?
We in software, sadly, tend to be in the treatments business, not in the patients business. We’re Big Pharma. And in the same way that Big Pharma invests massively in persuading us that we have the illness their potion cures, we have a tendency to try to get the customer’s problem to fit our solution. And so it is that “productivity” tends to be about the potion, and not the patient.
And so I wholeheartedly reject this individualist, mechanistic approach to measuring developer productivity. It’s nonsense. But I can understand why the idea appeals to managers in particular. The Illusion of ControlTM has a strong pull in a situation where, in reality, managers have no real control beyond what to fund and what not to fund, and who to hire and who to fire. Who wouldn’t want those decisions to appear empirical and rational, and not the gambles they actually are?
But more important to me is how this illusion can impact the very real business of solving customers’ problems with software. When all our focus is on potions and not patients, it’s easy for Snake Oil to creep in to the process.
At time of writing, there’s much talk and incredible hype about one particular snake oil that promises much but, as far as I’ve managed to see with concrete examples that can be verified, delivers little to nothing for patients: Large Language Models.
Code generation using LLMs like ChatGPT is, like all generative A.I., impressive but wrong. Having spent more than one hundred hours experimenting with GPT-4 and trying to replicate some of the claims people are making, I’ve seen how the illusion of productivity can suck us in. Yes, you are creating code faster. No, that code doesn’t work a lot of the time.
But if we measure our productivity by “how far we kick the ball” instead of “how many goals the team scores”, that can seem like a Win. It falls into the same trap that thinking skimping on developer testing – or skipping it altogether – helps us deliver sooner. Deliver what, exactly? Bugs?
On their website, GitHub claim that 88% of developers using Copilot feel more productive. But what percentage of developers also feel that skipping some developer testing helps them deliver working software sooner. I could take a wild guess at somewhere in the ballpark of 88%, perhaps.
They did a study, of course. (There’s always a study!) They tasked developers with writing a web server in JavaScript from scratch, some using Copilot, some doing it all by hand. And lo and behold, the developers who used Copilot completed that task in 55% less time. Isn’t it marvellous how vendor-funded studies always seem to back up their claims?
But let’s look a little closer, shall we? First of all, since when were customer requirements like “Write me a web server”? A typical software system used in business, for example, will have complex rules that are usually not precisely defined up front, but rather discovered through customer feedback. And in that sense, how quickly we converge on a working solution will depend heavily on iterating, and on our ability to evolve the code. This wasn’t part of their exercise.
Also, ahem.. If there was any problem that Copilot was probably already trained on, it’s JavaScript web servers. People have noted how good GPT-4 is at solving online coding problems that were published before the training data cut-off date, but not so hot at solving problems published after that. I’d like to see how it performs on a novel problem. (In my own experiments with it, poorly.)
And two more observations:
First, this study focuses on developers working alone for a relatively short amount of time. Let’s see how it performs when a team is working on a more complex problem -each working on their own part of the system – for several days. That’s a lot of rapidly-changing context for an LLM. It’s easy to fool ourselves into believing something makes us better at running marathons because it helped us run the 100m dash faster.
Secondly, GitHub’s musings on measuring developer productivity suffer a very similar “potions over patients” bias to the McKinsey report.
And vendors have a very real incentive to want us to believe that the big problems in software development can be solved with their tools.
Given the very high stakes for our industry – probably visible from space by now – I think it would be useful to see bigger, wider and more realistic studies of the impact of tools like Copilot on the capability of teams to solve real customer problems. As with almost every super-duper-we’ll-never-be-poor-or-hungry-again CASE tool revolution that’s come before, I suspect the answer will be “none at all”. But you can’t charge $19 a month for “none at all”. (Well, okay, you can. Just as long as there are enough people out there who focus on potions instead of patients.)
But here’s the thing: I suspect bigger, wider, longer, more realistic studies of the impact on development team productivity might reveal simply that we still don’t know what that means.