back to article Who needs GitHub Copilot when you can roll your own AI code assistant at home

Code assistants have gained considerable attention as an early use case for generative AI – especially following the launch of Microsoft's GitHub Copilot. But, if you don't relish the idea of letting Microsoft loose on your code or paying $10/month for the privilege, you can always build your own. While Microsoft was among the …

  1. bud-weis-er

    This is cool stuff, thanks for posting.

    I was coding the other day using google to find information - suddenly had a "moment" where I realised I was being an old fart, and went back to LLMs. They're not perfect, but wow they get you results that you consider brilliant (so not good for beginners) very fast.

    1. Gene Cash Silver badge

      That's probably the best example of how Google's results have gone to shit with all their gaming and monetization of the algorithm.

      1. bud-weis-er

        Jeez yeah. If any company was ripe to have its market share ripped off it, it's google.

        I like LLMs (totally respect anyone who disagrees), and am happy to pay $30 per month for the best. Shame Google has sold its soul (read: those damn adverts / sponsored links) and can't offer a cash-paid service for best of breed search.

    2. elsergiovolador Silver badge

      LLM is like having somewhat knowledgeable but not so smart Junior developer at your fingertips.

      You have to hand hold them to the solution.

      Problem starts when you start writing to your colleagues like they were LLMs too.

      1. This post has been deleted by its author

      2. BeardyMike123
        Mushroom

        Ollama run theregister:405b

        You are a commenter on a tech news website, responding to a comment with compassion, empathy, and grace, but also including a small amount of humour. The following text is the comment you are responding to.

        "

        I think it IS about time we starting talking to our colleagues as if they were LLMs. I'd love for them to provide even the smallest amount of context when they talk to me. I'm not asking for 128k worth... just a little hint as to what they need, instead of their inane blabber about their personal lives and how 'its tough to be alone this time of year' or 'ever since my wife died...'.

        "

    3. O'Reg Inalsin

      Switch hitter

      I find the that a Programmer Chat covers 80% of what traditional internet search covers, and visa-versa. A Programmer Chat is usually a little faster, but occasionally the back and forth gets doesn't converge - a Programmer Chat is frequently as dumb a rock and repeats itself during the back and forth. So I use both.

      You say "not so good for beginners" but I think it's useful when writing in an unfamiliar or rusty language or library. A Programmer Chat excels at knowing common facts like basic syntax. E.g., for me, my "bash" is always rusty.

  2. Anonymous Coward
    Anonymous Coward

    Thanks for sharing this, especially as I'm interested to see if it's less prone to hallucination than say CP (or it's nightly version) in VS Code. At times it's useful for boilerplate stuff, but a few days ago it presented a snippet using an imaginary package (that never existed in registry or github, I checked). Or an answer combining API calls across versions (so doubling the errors).

    It is faster than google, but for one thing it'd be nice if it actually sourced its answers and give a traceable confidence score.

    Iow, it's good for low entropy / high likelyhood tasks, but each snippet is still very much 'trust but verify line by line'.

    1. elsergiovolador Silver badge

      a snippet using an imaginary package

      You open another session of LLM, paste the snippet and tell LLM to create a library that would make the snippet work.

      Push the library to GitHub.

      Reap profit.

    2. DoctorPaul Bronze badge

      but each snippet is still very much 'trust but verify line by line'

      So what's the point exactly?

  3. elsergiovolador Silver badge

    Anonymice

    Continue collects anonymized telemetry data including:

    Whether you accept or reject suggestions

    Okay.

    1. gratou

      Re: Anonymice

      It could probably be written in the passive form.

      Whether suggestions are accepted or rejected.

      But I agree, it's confusing as is.

  4. Lee D Silver badge

    I wonder how anyone can prove "clean-room" type code if they have an AI assistant sitting in their IDE.

    Honestly, it should taint all the software it touches - you have no idea what it was trained on and whose code it's putting into yours.

    All you need is a few lines of Oracle or someone's code in there and you are toast.

    Even hobby projects like emulators, etc. - how do you know it wasn't trained on say a Windows source leak, or a poorly reverse-engineered bit of code?

    I don't see how companies are letting people use them on their codebase, it has the potential to taint the whole thing and - as we've found - even Google aren't immune to claims of code-copying.

    1. Charlie Clark Silver badge

      I don't think you can prove it to be clean-room at all if you're using an LLM that is not created solely from your own codebase. However, most of the code scraped will be from open source repositories, and most of these now have extremely permissive copyright licences. And, following on from the various copyright suits from authors and musicians, I suspect we'll soon be seeing some kind of disclaimers for the various models as to what they're based on, and potentially even attribution for open source packages.

      In addition, I suspect that the "innovative" and "novel" test for generated code will not succeed in most generated samples, except for the completely shit ones!

      1. Michael Strorm Silver badge

        > However, most of the code scraped will be from open source repositories

        "Most" in this case wouldn't be good enough here though, any more than 99% cyanide-free water is mostly safe for drinking(!)

        If the risk is there, it's still there.

        > I suspect we'll soon be seeing some kind of disclaimers for the various models as to what they're based on,

        That would either require *far* more human supervision, slowing down the automated consumption of vast quantities of text the training depends upon. It would also result in less training data than the current models have.

        Whether that would be practical at all is questionable. It would likely make them much less competitive than any rivals, so I suspect they'd seek any excuse to avoid that.

        Big Tech essentially depends on automating anything that can be automated quickly and cheaply, and ignoring or railroading their way through any issues that can't. The same applies here.

        > and potentially even attribution for open source packages.

        As far as I'm aware, that's not how current LLMs work, since they're based on the combined textual statistics from numerous sources. And it wouldn't be possible for existing models to work backwards to their training data to figure out where a particular answer came from, even if- in effect- it *was* inadvertently plagiarising a specific source in some cases.

      2. Anonymous Coward
        Anonymous Coward

        most of these now have extremely permissive copyright licences

        I suspect quite a large part of open source code is licensed under GPL instead of an "extremely permissive copyright licence" (GitHub is full of it) and a company could definitely get into legal trouble for using GPL code without sticking to the license provisions.

    2. elsergiovolador Silver badge

      How do you know if the engineer you've hired has been learning to code from reading leaked source codes?

      It's a bit pointless worry.

      1. fromxyzzy

        The problem is that when the company/person the code is lifted from looks for someone to sue, the buck stops with the user of the LLM.

        So your theoretical engineer could get the blame and get hung out to dry, but if you use an LLM and it steals code, you're that engineer and you probably don't even realize it.

        AI code is unreliable on both a functional and a legal level.

        1. Michael Strorm Silver badge

          Also, does Elsergio mean intentional plagiarism or literally programmers and LLMs who "learned" in the past from having studied plagiarised code at some point?

          Because it's very unlikely that a human would unintentionally regurgitate that exact same proprietary code, whereas that's a much greater risk with LLMs because they don't "learn" like a human.

  5. Michael Wojcik Silver badge

    Who needs GitHub Copilot when...

    ... you can learn to be a competent programmer?

    It's really no wonder software is so terrible, when the predominant attitude appears to be "screw learning how these things work, I'm just going to bang out some code and hope".

    1. Rich 2 Silver badge

      Re: Who needs GitHub Copilot when...

      I think the fact that you have (at the time of posting this) one down-vote just emphasises your argument. No wonder modern software is increasingly utterly shite

    2. Manzo

      Re: Who needs GitHub Copilot when...

      Although the code generation is awful, it is great for commenting the source code and writing documentation.

      1. Rich 2 Silver badge

        Re: Who needs GitHub Copilot when...

        No. You should comment your code as you write it. Not write undocumented incomprehensible code and then expect an algorithm running on some computer goodness-knows-where to try and fill in for your deficiencies and sloppiness

        Yet again, this just highlights the “modern” way of writing software - “write it as fast as possible, doesn’t matter if it doesn’t work correctly - we can fix it next sprint, right after adding some new shiny icons and colours. Comments? That’s SO last century”

  6. osxtra

    Leaning Toward Learning

    Call me old fashioned, but it seems using AI to "help" write code, term papers, laws, etc. is misguided at best, and outright dangerous for humanity at worst.

    There are plenty of good use cases for the technology - iterating through billions of permutations to help discern what *might* become a new blood pressure drug comes to mind - but it seems to be antithetical to straight-up learning.

    True intelligence is partly knowing how to combine data, knowing not only what to include in solving whatever problem may be at hand, but also what to ignore.

    I fear that overuse of this technology will result in humans losing sight of how to think, and that just can't be a good thing.

    There's a difference between getting a refresher on the capital of Zimbabwe if it comes up in conversation, and remembering how to solve a differential equation.

    Having it readily available to spit out whatever answer is being sought, the tool becomes nothing more than a crutch, and you never absorb knowledge yourself because you can always get the machine to regurgitate it for you.

    Sure, if in a hurry I'll query StackOverflow - it's way more convenient than sloughing through many, many boring pages of documentation - but am not just a copy/paste sort of fellow, which is the vibe I get from all these "assistants". I want to know *why* the solution works, not just that it *does* work.

    Plus, who knows if they're even correct? I've talked to many a folk that don't have a clue but are still willing to spout as if it's gospel. Do most people fact check the results they get? Survey seems to say "no".

    Do we really want to put all of our faith in this still relatively infant technology?

    Am sorely hoping that at least in its current state, AI is a modern Hula Hoop, down the road used by some enthusiasts, but not in the mainstream.

    Otherwise, we might as well just stop trying to improve our own minds and rely sadly and solely on the tool, awaiting the day it goes away and we've completely lost the ability to think for ourselves.

    1. O'Reg Inalsin

      Learning to reason vs Memorization

      LLMs excel at memorization but have limited reasoning ability - especially limited ability to "think outside of the box".

      So far I haven't seen an LLM reliably grasping the birds point view unless they are regurgitating what someone else said.

      So I don't really see even any possibility of letting an LLM write a program that would actually work, let alone design it, unless it's a template type project.

      Of course if you are writing an undergraduate essay or advertising copy, it's a different story.

    2. beeka

      Re: Leaning Toward Learning

      If you want to know *why* the solution works, not just that it *does* work, you can always ask the LLM.

      While you can use them in a "do my homework for me" way, which could lead to the knowledge drain you fear, I tend to use them like an intern: doing the things I know how to do but don't have the time / energy to do. So even though I know how to write a recursive descent parser, I can feed an LLM with a bunch of BNF and it generates code in seconds that would take hours to write. You still need to understand what it has generated, as it isn't perfect. Getting it it to rough out tests or documentation also helps battle inertia around those tasks: easier to review / clarify / extend something that exists than stare at a blank screen.

  7. Anonymous Coward
    Anonymous Coward

    European coding assistant

    Anyone know one or more European coding assistants, perhaps based on a Mistral AI model? Any experience with Mistral's own Codestral tool?

    If my interactions with a coding assistant are going to be used for training their model and making them richer/more powerful, I'd rather do that with a European provider rather than risking helping yet another Silicon Valley fake-it-till-you-make-it, our-users-are-our-victims, TESCREAL scumbag company.

  8. Pete Sdev Bronze badge
    Linux

    Licenses?

    It would have been useful if the article had mentioned the sources which the various models were trained on, and the licences of those sources.

    For example, if the model was trained on GPL'd code and you've accepted a coding suggestion from the model based on its learninginput, then your code arguably may also be released under the GPL. There's a potential legal and liability minefield here.

  9. O'Reg Inalsin

    I'd rather be a training data generator for an open source LLM model.

    Most of the time (with hard exceptions) I don't mind feeding even Co Pilot. But I'd rather be feeding an open source model. I think one day in the not to distance future the price of co-pilot is going to rise to cover costs and profit, so it might go up from $10 to $100 (and depend on region), or the quality might drop. So I'm bookmarking this helpful info and will try it out. Thanks!

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like