The Dual-Edged Sword of using LLMs for Developer Productivity

I’ll never forget the first time I saw ChatGPT generate code. It wasn’t just IntelliJ suggesting a quick fix or refactor for some trivial mistake. No, this was a tiny white dot moving on the screen, generating an entire method, then another, and another — in seconds, complete with instructions on how to run it. It was magical.

And I wasn’t alone in those thoughts. People were buzzing about how OpenAI’s ChatGPT was a breakthrough that could upend entire industries while creating new ones. At the height of the hype, there were whispers of singularity, of a world transformed — for better or worse, depending on your outlook.

But as with all bubbles, the frenzy began to settle, and we started to explore how to integrate LLMs into our workflows practically. For many who were only tangentially aware of the field of ML, LLMs (large language models) became household terms. These models are an advanced variant of neural networks, a specific approach to machine learning. Essentially, they predict the next word in a sequence. And that’s all they do. But they do it extraordinarily well.

Eagle eyed corporations have jumped on the bandwagon of leveraging LLMs in a variety of ways. Here, I talk about one of the more obvious and popular ways corporations are using LLMs — developer efficiency. I’ve been using LLM-based devX tools at work and at home, and have also worked with and managed software engineers who have leveraged these in various ways.

Andrew Jassy announced on X how Amazon saved 4,500 developer years of work by using LLMs for software upgrades. These numbers are massive — it’s nothing short of a revolution in this field. But with all this excitement, I’ve also learnt that there are interesting caveats we must talk about.

What we can learn from the humble pocket calculator

Press enter or click to view image in full size

An AI generated image of a pocket calculator in an article which talks about the pitfalls of using AI

Back when I was a kid, one hot summer day as I was struggling to solve a tedious math problem, I threw my hands up in the air and asked my dad — “Why can’t I just use the calculator? This is such a waste of time”. And he said — “You need to know how to do math without one. You may not always have a calculator”. To which I countered — “How often will I need to know what 134.42 divided by 9 is?”

A decade later, I still haven’t needed to know what 134.42 divided by 9 is. And yet, the ability to calculate quickly has been useful. Not essential, not critical, but useful no doubt. If I’m in a grocery store and I see similar products, that quick mental math helps me compare the value. Or if I’m glancing at the receipt to roughly confirm the total matches my purchase, it helps to be good at this. I’m not whipping out my calculator each time.

Calculators took a decade to roll out from the minds of researchers to the masses. This gave policymakers time to adapt (and calculator manufacturers time to build what customers wanted). Meanwhile, it took months from the public-facing beta release of chatGPT to the rollout of Github CoPilot for organizations to adopt.

When the calculator was massively adopted, educators modified their approach so that kids incrementally build foundations, and once that is built they would move on to using calculators as a tool to do more advanced math. A similar approach has to be deliberately cultivated in organizations.

A tool, not a crutch

An AI generated image of a tiny crutch and a massive tool in an article which talks about the pitfalls of using AI

I noticed that Violet and Ryan, mid-level engineers on my team who were working on the same project, were suddenly appearing to be massively productive in their code reviews. Lots of comments, lots of approvals, and lots of revisions were requested quickly. As we talked about this in our 1:1, Violet mentioned how CoPilot was helping her accelerate the velocity.

“It’s so much easier now. When Ryan publishes his PR, I run it by CoPilot to understand what he did, and even ask for suggestions I should comment on.” she confided in me.

I appreciated how she was using the newly introduced CoPilot and put this in the back of my mind for a while.

Until I revisited that code a few weeks later. It was hard to understand — long messy lines of code with no clear separation of boundaries. Methods that handled multiple responsibilities. Field names that were verbose but just didn’t seem right. Coupling where it wasn’t necessary.

In our next 1:1, while Violet and I were talking about that project — I mentioned how I was finding that bit of code hard to grasp. She said — “That’s alright since you do not have familiarity. I just use Github copilot to help explain it to me.”

And then it dawned on me. Ryan and Violet had made CoPilot a crutch, not a tool. It was just so easy to ask CoPilot to explain the code — why bother to read it yourself? Why run through each line and analyze various ways of refactoring it when LLMs can do it so much faster? Indeed, the proof was in the pudding — together they shipped the project earlier than the deadline.

But I wasn’t convinced just yet. I asked Chris, who had never worked on this codebase and wasn’t using CoPilot, to work on a small change, as an experiment. And sure enough, he struggled with it. It was just too complex.

There was no incentive for Ryan and Violet to write good code.

After all, they could understand poorly written code now, and even faster than before. But poorly written code does not scale. Even the best LLM is only as good as its masters. If we ask it to explain code, it isn’t here to speak out of turn and comment on how the code is poorly written and must be rewritten. It will simply explain what the code does. A wiser engineer would instead refactor the code so that it explains itself.

The tool had become a crutch.

Explaining poorly written code in PRs is not the only risk here.

Ryan and Violet used CoPilot to generate unit tests, but never questioned if the unit tests were appropriate.

They used it to add validation that was simply unnecessary.

And perhaps most interestingly, Violet would review Ryan’s CoPilot augmented code using the CoPilot herself. Which meant it was no longer a peer review. It was AI evaluating itself.

The following week, we sat down and discussed how we could establish guidelines to use CoPilot responsibly.

Instead of using CoPilot to understand a coworker’s code, could we use them to suggest refactored code?

That’s a good start. But we would just be blindly following whatever CoPilot tells us “good” code is.

Instead of using CoPilot to suggest refactored code, could we use them to explain the principles of good code using that refactored code?

That sounds like we would learn the basis of why CoPilot believes something is better, which is a pretty good place to be.

I would recommend going a step further and also understanding how CoPilot came to that conclusion (with sources) and challenging that opinion when not convinced.

Another way for organizations to adopt LLMs is to establish guardrails around what kind of areas it can provide opinions on. For example — the ability to identify poorly written exception handling, or missing logging and metrics. This allows developers to “learn” from the LLM, while still establishing their critical thinking skills.

When overconfidence leads to trouble

Press enter or click to view image in full size

An AI generated image of an overconfident elephant in an article which talks about the pitfalls of using AI confidently

I was embarking on a personal project, something I was building from scratch to, ironically, leverage LLMs. And I needed code written fast. I didn’t care about code quality. I didn’t care for this to last. I just cared for speed. And chatGPT delivered. Or so I thought.

As the project got more complex, I found myself slowing down. I felt myself writing code or performing options that should have already been solved.

Did a library exist for that somewhere? Maybe I’ll do a Google search on it.

And sure enough, the first result (minus the ads) took me right to what I needed.

Why didn’t chatGPT tell me this? I was so annoyed. It had confidently listed 3 different libraries to solve my problem, but never this one.

So I asked chatGPT to tell me more about this library. And it did know about it — spitting out details on how to use it.

Well, why didn’t you tell me earlier? I asked the person behind the curtain annoyed, acutely aware I’d only receive an AI-generated apology as a response.

I reflected on this and realized how the human-like response makes me trust chatGPT a lot more than trusting a web search.

I’m well aware of chatGPT’s overconfidence, and yet when it says The best way of doing X is using libraries A, B, and C, I run with it.

Sure enough, I noticed these biases among my engineers too, who were starting to make decisions based on these recommendations. If chatGPT said explicit null values are good, they’re good. After all, it has learned the cumulative sum of human history to come up with that answer.

Now, to be clear, chatGPT can help greatly accelerate projects when when working in unfamiliar areas, and I highly recommend leveraging it in these situations. As long as we also remember LLMs shouldn’t earn our trust by default. Every new prompt is a reset in trust.

Garbage in = garbage out

An AI generated image of garbage in an article which is about AI but is not garbage

Perhaps the biggest risk of an LLM-dominated future is the problem of data for model training. It’s well documented at this point that the world is running out of meaningful data to train LLMs. LLMs used on software have been trained on forums like StackOverflow and research papers, which are highly regarded, but also more subjective and possibly erratic quality data like personal blogs and social media threads.

Meanwhile, more and more content generated is now actually leveraging LLMs. This means what is being trained on could be the same as what is being served — a classic ML feedback loop problem. This risks problems like reinforcement of biases and lack of discovery. Kevin Roose, a journalist at the NYT, did an interesting experiment where he manipulated chatGPT to enhance his perceived image. Sure enough, in a few weeks, he was able to have the AI respond differently to his prompts about himself as a public figure (You should read the whole post, it’s fascinating). The same goes for generated code too — where opinions on the internet can quickly become facts presented.

Oversimplification isn’t always the best

An AI generated image of a big red button fixing everything in an article which says that AI cannot completely fix our communication

Contrary to popular belief, developers don’t just sit and code all day. Some of us talk to humans too, if the need arises.

A while ago, my favorite collaboration tool released a way to summarize conversations using LLMs. When I came back from a week-long vacation, I was greeted with a long thread and a private message from one of the participants- could you help resolve this asap?

As I tried to make sense of that thread, it went down several rabbit holes — investigations into issues that turned out to be non-issues, deadline changes, alternative design options, misunderstandings, incomplete thoughts, or guesses.

But then I remembered the fancy new AI summarization tool that was launched. One-click, and I was given a crisp summary of what happened. And it said — Brian concluded we cannot do option A and we should consider option B.

With that, I went to my coworker Brian and asked him why we were considering design B when we had already aligned previously on A. He said he was still aligned with A, but recent changes in the service along with timeline changes meant that B was an appealing option.

Ah, that makes sense — I thought.

LLMs have a hard time understanding nuance and individual personality traits.

When Brian says — We can’t do option A anymore, he means — A is challenging to do but still possible if someone takes the lead to unblock us. I would have known that if I had carefully read the thread. But the LLM doesn’t know that. Just this September, a new study concluded that AI is much worse than humans at summarizing.

The interesting thing with any tool that becomes critical is that if it works for 99 percent of the situations but fails for 1 percent, the tool still fails. We cannot trust in it. Imagine having 1 percent of your grocery receipts with wildly incorrect totals. Or imagine having 1 percent of cars on the road spontaneously combusting. LLMs — by their very nature are probabilistic and still in their infancy and are far more likely to be inaccurate. The stakes aren’t high at the moment, which is why we all love to use these tools. At its worst, it’s a few more lines of conversation with a human. But if our adoption is faster than the pace at which the quality improves, these mistakes will be costly.

Security and Privacy

Plenty of organizations are waking up to the threat of LLMs in leaking sensitive data. A few months ago, Slack faced flak for using customer data to train AI models by default. In many other situations, employees may deliberately or inadvertently provide information via prompts to LLMs that organizations would rather not want to share.

Those who sell LLMs would love to use consumer data to train their models to be better. But like with existing ads and recommender systems, it’s a dance between privacy and quality (or profits). Organizations adopting LLMs for dev tooling should be aware of how their data may be used.

Closing thoughts

LLM-based dev tools or enterprise products are an accelerated way to make developers productive and allow them to focus less on mundane tasks. Many organizations are rapidly adopting these to get that advantage, and much has been talked about that. We have only begun on this journey, and the landscape a few years later will look quite different, even if the actual AI under the hood were to stop making progress in further accuracy today.

Today’s LLMs have been rushed into adoption, and with that organizations must also be aware of, and navigate around pitfalls. This will allow organizations to be more effective in leveraging LLMs.

The world of software development was advancing at a breakneck pace, and the tools and frameworks we take for granted today were still in their infancy. Coding was often a wild, unpredictable ride, with bugs that could defy logic and fixes that seemed to play hide-and-seek in the depths of your codebase.

Disclaimer: While the events and characters in this post are fictional, they are inspired by real conversations and experiences, either my own or those I have observed.

The Dual-Edged Sword of using LLMs for Developer Productivity

What we can learn from the humble pocket calculator

A tool, not a crutch

When overconfidence leads to trouble

Garbage in = garbage out

Oversimplification isn’t always the best

Security and Privacy

Closing thoughts

Ready to find the right
mentor for your goals?

Explore

Support

The Dual-Edged Sword of using LLMs for Developer Productivity

What we can learn from the humble pocket calculator

A tool, not a crutch

When overconfidence leads to trouble

Garbage in = garbage out

Oversimplification isn’t always the best

Security and Privacy

Closing thoughts

Ready to find the rightmentor for your goals?

Ready to find the right
mentor for your goals?