back to article AI can improve on code it writes, but you have to know how to ask

Large language models (LLMs) will write better code if you ask them, though it takes some software development experience to do so effectively – which limits the utility of AI code help for novices. Max Woolf, senior data scientist at Buzzfeed, on Thursday published an experiment in LLM prompting, to see whether LLMs can …

  1. RosslynDad
    Facepalm

    Await Fix

    As an experiment with Ollama I asked it to write a .NET task-based multi-threaded web download method in C# (as you do). It came up with something plausible that used the C# await keyword, however it hadn't marked the method as async. I said I thought the async was needed and it responded with something akin to "Yes, you're right. Here's the proper version." Which included the async keyword. So it can correct it generated output, but why give me erroneous code in the first place?

    1. Brewster's Angle Grinder Silver badge

      Re: Await Fix

      Why do humans make mistakes and need to alter their first drafts?

      1. Anonymous Coward
        Anonymous Coward

        Re: Await Fix

        Because they can't fully remember the thousands of pages of specs and code they've read, so occasionally forget things?

        The "AI", however...

        1. Roland6 Silver badge

          Re: Await Fix

          >” The "AI", however...”

          Demonstrated it too makes mistakes and overlooks stuff; otherwise this experience t would not have shown so much improvement.

          1. doublelayer Silver badge

            Re: Await Fix

            I disagree there. The AI did almost the same kind of thing that I would have done. I wouldn't have done the first step, because my bias is always to do things incrementally when possible to avoid wasting space, but the rest of the steps match my typical workflow:

            Stage 1: You want a solution to your problem. Write some code that solves the problem.

            Stage 2: You want the code to run faster. Identify the simplest bottleneck in the code I wrote and fix it, again as quickly as possible.

            Stage 3: Improve the tools I'm using to speed it up.

            Stage 4: Can I parallelize?

            The code didn't bother with stage 5: try to find ingenious mathematical solutions to get it done faster, and it probably couldn't find any. It instead used a different stage 5: add extra stuff nobody asked for. However, the progression otherwise makes sense. I also tend to try to get some information before starting this process on how fast the code needs to be so if you need stage 3 performance, I don't bother with the first two. Since many problems are just fine with stage 1 performance*, it isn't automatically bad to start there.

            * For example, if they're going to run a script automatically once a week and the simple version takes ten minutes to run. I could spend three hours writing new code and the script would now take eight minutes to run. The time savings would take two years to equal out, so they don't care. But I could also take an extra eight hours and the script would take ten seconds to run, so now the time savings equal out in less than a year. They still don't care, because for something that runs only once a week and doesn't need human attention, nobody cares that the computer does it for ten minutes. If they run it every hour or if it's going to need to scale to millions of requests or entries, I would speed it up. Otherwise, they want lowest maintenance costs and time to write it, with execution speed a sacrifice they don't even notice they've made.

  2. thosrtanner
    Facepalm

    That's a fairly loose definition of improve if it goes faster but has more bugs in it. Or am I being difficult if I expect 'improve' to not include 'more bugs' as a result?

    1. Roland6 Silver badge

      The numbers don’t add up….

      Taking the actual numbers given, and doing the math, I don’t believe the cumulative levels of improvement claimed.

      The final improvements seem to indicate the code does nothing more than a “Blue Peter”: “here’s one I produced earlier”.

  3. b0llchit Silver badge
    WTF?

    Doing less worse faster

    If I am getting this right,... The programmer using the LLM needs to review the generated code and re-engineer the prompt for each and every time something does not pass review. And we are talking about rather simple problems.

    How is this effective in a larger setting? Normal problems are not simple because they are much larger, intertwined with other problems and have a large set of boundary conditions that must be fulfilled. The programmer might just put his/her thinking cap on, think up a solution and tap it into an editor. Few iterations with proper and appropriate test-cases and done. And all of that without trying to teach an LLM to do its job worse than any experienced programmer.

    1. Rich 2 Silver badge

      Re: Doing less worse faster

      Doing less worse faster?

      That’s “Agile” isn’t it?

  4. chuckufarley
    Joke

    So now software devs have to...

    ...learn to write Rust and how to write prompts for LLMs? Queue riots.

    1. Fonant

      Re: So now software devs have to...

      s/Queue/Cue/

      soz!

      1. chuckufarley

        Re: So now software devs have to...

        You think they won't be lining up for it?

  5. Anonymous Coward
    Anonymous Coward

    I have a very simple view of LLM's and "AI" writing code. I remember learning. I remember all those tutorials available on the internet and how they could get me started but missed out so much. I remember all the times I had to look things up when things didn't work and how half the time they didn't fix it or if they did but they introduced another issue. I remember learning to go to the language itself and the documentation to fully understand something to the point where I was happy with it and confident I knew exactly what it was doing and why. Learning the correct way to interact with said language or piece of code.

    An LLM can't do that. It's using all the reference data I used but it can't tell the right way from wrong. It's a nice idea don't get me wrong. I'm not a luddite. I just hope we don't get to a point where it's the defacto way of doing things but I know how the world works and if you can get people to code on the cheap then people will do it.

    1. Handlebars

      It's using the same reference data but not in the same way that you you do.

  6. Fonant
    FAIL

    I tried out LLM code writing in PhpStorm. The LLM would generate multiple lines of code that nearly, but not actually, did what I wanted it to do. Plausible but incorrect: LLMs are bullshit generators, nothing more. You cannot trust them to be accurate or correct.

    1. Anonymous Coward
      Anonymous Coward

      To precie your comment.

      LLMs are Plausible Bullshit generators that cannot be trusted to be accurate or correct !!!

      Ding ding ding ....

      Got it in one !!!

      :)

    2. Dom 3

      [fx: waves at fonant]

      "In other words, to get a good answer from an LLM, it helps to have a strong background in the topic of inquiry."

      It's *essential*. If you can't spot when the makey-uppy machine has made something up that is Just Plain

      Wrong, you're hosed. Doesn't matter whether it's code or the script for a "factual" radio show.

      (Yes I am looking at you, Radio 4).

  7. Anonymous Coward
    Anonymous Coward

    Why bother?

    "find the difference between the smallest and the largest numbers whose digits sum up to 30 [for] integers between 1 and 100,000". By inspection:

    smallest = 3,999

    largest = 99,930

    difference = 95,931

    done!

    Rationale: with 1,000,000 integers of value 1 to 100,000 randomly generated, each integer has, on average, possibility of being generated 10 times. So all integers between 1 and 100,000 are quite likely to be produced at least once. Ergo, take the max and min, subtract, and done. If AI existed, it would have come up with this ...

    1. doublelayer Silver badge

      Re: Why bother?

      But there is a chance that at least one of 3,999 or 99930 do not appear in your set. True, that solution is probably more likely to pass tests than the more advanced methods the LLM did try because the chance that the knot-cutting solution is incorrect is 0.0045%, but it is guaranteed to be wrong for that 0.0045% of cases.

  8. Brewster's Angle Grinder Silver badge

    What the report doesn't say is it ends up vectorising and parallelising the code. And switches from an interpreted library to a JIT library.

  9. Blackjack Silver badge

    Well, there is a few videos by a guy who codes games for the Nintendo 64 (yes really) and he tested AI helping him to code, he got mixed results.

    Granted if he tried again now the tesults may be better or worse because these AI chatbots eat a lot of garbage data and just because they get feed similar garbage data a million times it doesn't stop the data from being wrong.

    If you want AI to help coding get a specific AI that does that, these glorified chatbots are just too unreliable.

    1. Anonymous Coward
      Anonymous Coward

      Makes me wonder if, if he posted the code he eventually wrote for this particular problem, future requests would simply show a copy of his code instead of trying to build one.

  10. Gene Cash Silver badge

    I love AI

    Because it slows down the idiots in the team even more, and makes me look even better.

    Stop dicking around with the AI and just write the fuckin' code! It'll take half the time!

  11. DS999 Silver badge

    I'm not following here

    Can you really get the AI to give you code, then repeatedly tell it "make that code better" and get 4x, 5x and 99x (!) improvements in performance?

    They said they repeated the experiment with prompt engineering, which requires more work on the part of the programmer and gained "similar" speedups?

    What's the incentive to do the prompt engineering if you can cut and paste "make that code better" a few times in a row for the same gains you get via prompt engineering? OK there's the issue of the code needing to have bugs fixed during that process and (apparently) not during the prompt engineering, but this is a sample size of one. It may turn out that bugs are just as likely via either optimization method, in which case prompt engineering sounds like a lot of wasted time when you can basically say "yeah that sucks, do it better!"

    1. doublelayer Silver badge

      Re: I'm not following here

      Yes, in the same way that you can do that to an interview candidate. This problem is exactly the kind of thing I've been asked during interviews, where they are looking for me to demonstrate the most efficient way of achieving a goal that nobody cares about. When you're actually working there, they don't care in the slightest how long the computer takes because 0.5s of server time is free unless it's at scale, and instead they're optimizing for lowest writing time and/or lowest difficulty during maintenance. Writing it more efficiently is something the programmer tries to do even though management is mostly ignoring whether they have.

      What this won't do so well at is getting bugs out of some code. As demonstrated, as it tried more and more advanced methods, it introduced more and more bugs. It also started expanding outside its brief. For example, the first version creates a function which does the task provided and returns the result, an integer, as an integer. By version number three, it was printing it out to the console with extra text. It wasn't told to do that. Maybe we wanted to use that result in a calling function, but now we're going to have to fix that. That expansion, by guessing requirements, just increases the number of places where bugs can crop up. That massive speedup, by the way, wasn't from the LLM finding a new, more efficient algorithm. It was by using a library to run it outside of the Python interpreter. Of course, if you need this answer really fast, not writing the number-crunching part in Python might have been a starting point.

      The bot's output was somewhat impressive to me, but mostly because I've seen much worse code generated for even simpler cases. It wasn't because it did anything particularly difficult.

  12. orly_andico

    Why is it that all stories on AI are either mindless boosting or doom and gloom.

    Like all things, the reality lies somewhere in between.

    Code generation / coding assistants are one of the most useful use case for LLMs, because (1) there's a lot of decent code e.g. on Github, Linux kernel, that can be used for training; and (2) it's fairly easy to measure success - the code works, or it doesn't. And it runs fast, or it doesn't. Will there be subtle bugs? possible, hence the danger of blindly accepting generated code.

    But human coders can also introduce subtle bugs. The difference is that the LLM's can generate significant amounts of code quickly and so there's the temptation to just accept it, and not adequately review it.

    But any programmer or developer who dismisses LLM's as stochastic parrots or BS generators... will get left behind. The productivity benefits are significant enough that an experienced developer with these tools, will outperform an equally competent developer who doesn't use LLM's.

    Here's a very good read from a credible and realistic (i.e. not AI-hyping) author (may be paywalled) - https://newsletter.pragmaticengineer.com/p/ai-tooling-2024

    1. Anonymous Coward
      Anonymous Coward

      "it's fairly easy to measure success - the code works, or it doesn't."

      I'm remembering a tale (can't remember from where) where a company was asked to write a bit of code, and were provided a set of test cases. They returned code that passed every test case perfectly - but failed any other condition. Turns out they had simply written it as a giant switch statement, giving the correct output for each test case, instead of writing the real functionality that the test cases were intended to test.

      Just because the code works on your test cases, doesn't mean it actually works correctly. And while there's a fair bit of decent code on various sites, there's even more awful code (or code fit only for demo purposes) which the AI will happily copy without any understanding of its faults and limitations.

      Want good code? Have a good coder write it, instead of having it written by a probability generator.

      1. Roland6 Silver badge

        > I'm remembering a tale…

        In developing conformance tests for the ISO OSI protocols back in the 1980s, the tests and test suites were deliberately designed to make it easier to actually implement the protocols rather than a test responder…

  13. O'Reg Inalsin

    That's just a simple O(n) algorithm, so unless it goofed by sorting first, it's just a question of optimization. Not knowing knowing what it did makes the article seem speculative and a little boring. (Sorry).

    However I'll just chime in now with a boring personal anecdote. Yesterday I asked copilot to write something in bash to parse a section of a .ssh/configuration file. It used awk in way I hadn't seen before and saved me some trouble. While I don't think that AGI, I do think it was something even better that could be called AUI (actually useful information).

    1. doublelayer Silver badge

      "Not knowing knowing what it did makes the article seem speculative and a little boring. (Sorry)."

      We do know what it did. All the generations are posted here. Most of the optimization that it did was farming out the calculations to libraries. It did quite a few things that are typical of good software, but almost in the way that people showing off would. The rest of the optimization was adding additional features that nobody asked for, which could be positive or negative. Sure, some longlasting tasks might benefit from a metrics reporting server, but for a task that originally took half a second to execute back when it was allocating a few megabytes when it didn't have to, that might not be the most useful and it was definitely not specified.

  14. trevorde Silver badge

    Meanwhile, in the real world...

    * write some [your language here] code to [your problem here] which does not contain any GPL code

    * explain to me what the product manager wants because he's' making no sense

    * write some code to call [co-worker's name here] function WHICH HE STILL HASN'T IMPLEMENTED!

    * query our database to get [describe your data here], even though half the information is not there

  15. captain veg Silver badge

    half of anything measurable is below average

    LLM promotors seem to imagine that a big enough training set fixes all problems. Well, it does if you are trying to emulate the mediocre.

    If you want above-mediocre quality (for your objective function) then you have to train on only the best (for your objective function) data. But no one does that.

    I flatter myself that I can, and do, write better than average quality code. The idea that I would use an automated system to produce what is, by definition, mediocre code (provided that I used the system well) is laughable.

    I'm prepared to accept that such a system might save some typing, but (a) I can type pretty fast, and (b) formulating a solution happens inside the brain, not on the monitor and keyboard.

    -A.

  16. ComputerSays_noAbsolutelyNo Silver badge
    Joke

    isn't this the same conclusion that was arrived on the internet?

    The Internet: all the combined knowledge of humanity at your fingertips ... however, being able to google "neuro-surgery" doesn't turn you into a neuro-surgeon.

    ... 20 years later ...

    AI: being able to ask HallucinaGPT about neuro-surgery doesn't turn you into a neuro-surgeon

  17. Toastan Buttar
    Joke

    Engineer's itemised bill joke.

    This reminds me of the old joke about the retired engineer who fixes a huge problem for a company by turning a single screw through 90 degrees. He presents his client with a bill for £10,000. The client is shocked at the cost and asks for a complete itemised bill for the job. The engineer issues another bill in response:

    1. Turning the screw - £1.00

    2. Knowing which screw to turn - £9,999.00

    I suspect that "knowing which prompts to ask" will similarly require a level of skill and wisdom which can only be gained through years of experience.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like