back to article GitHub claims source code search engine is a game changer

GitHub has a lot of code to search – more than 200 million repositories – and says last November's beta version of a search engine optimized for source code that has caused a "flurry of innovation." GitHub engineer Timothy Clem explained that the company has had problems getting existing technology to work well. "The truth is …

  1. captain veg Silver badge

    Why?

    Why would you want to search code?

    I mean, why would you want to perform a textual search of a horde of strangers' code, the intention of which is unknown to you?

    If for a solution to an actual problem then, without some knowledge of how to solve that problem you can't know what to search for.

    Unless, of course, you're relying on comments to describe what the code does, or is intended to do. Which is eminently subvertible, whether for lolz or for nefarious pursuit of illicit profit.

    -A.

    1. b0llchit Silver badge
      Facepalm

      Re: Why?

      It is a "look at us!" project. The "we can do this" adventure of a "see how good we are" endeavour.

      There is no point to any literal textual search in code of that size. If you want to find illicit copying you need to do a lot more than simple-words-and-phrases comparisons. Algorithms and solutions cannot be found on word basis. This type of search also discards any code structure and context.

      And, if you really need to find your own code in 15e9 lines of community code... well, you should have kept your copy. Alternatively, it would be much faster to use a search engine.

      So we are clearly left with the "look at us!" explanation.

      1. claimed

        Re: Why?

        So now there is a way to search, super fast, a collection of version controlled projects.

        So now if we make the web a collection of projects and shove it into git, we can search that pretty fast without all the wazoo that google is doing (maybe...).

        Anything that might improve Bing would be good, so when I accidentally use it from the start menu I'm not completely enraged...

    2. jake Silver badge

      Re: Why?

      It'll be GREAT!

      For patent trolls.

      Methinks kids looking for easy homework answers and other plagiarizers will also find it handy.

      Other than that ...

      1. vtcodger Silver badge

        Re: Why?

        It's not just patent trolls. Take malware authors for example. Assume they've found an exploitable flaw in some library. But does anyone actually use said library? GitHub code search to the rescue!!! I'm sure many other equally beneficial uses will surface.

    3. Notas Badoff

      Re: Why?

      On the other side, looking at other people's code is finding out how 'standard' APIs really work. Sometimes, *if* they really work. Heck, *if* they are really utilized by anyone!

      Programming blindly gets you pokes in the eye. I don't trust documentation to call out the pain points and gotchas in real world usage. Do you really think you can duplicate years worth of painfully discovered production oddities in your pitiful few 'tests'? I'll search through the code of successful mature projects to find "here be dragons" warnings. Long time ago reading jQuery source convinced me R.O.U.S. exist!

      I'm not trying to copy their code, just avoid their gray hairs.

      1. Richard 12 Silver badge
        Facepalm

        Re: Why?

        Most of those who need to know will never look.

        I remember reviewing some pull requests in 2020 from a guy (yes, it was a guy) of their C library for reading/writing the file format they'd invented.

        The file format they'd come up with was ok. No obvious major flaws, reasonable attempt at documentation. Not great, but not terrible.

        The C library was completely unusable.

        They'd used the code golf technique of "everything in one file using C macros to use the same file as both header and source".

        The code used static variables, and the callbacks for file operations had no "cookie" for any context. So not even re-entrant.

        When we asked him to change the API to add the cookie and remove the statics to make it at least re-entrant, they flat refused and insisted nobody ever does that for C APIs.

        Nevermind that the stb files he had copypasted into his did exactly that.

        1. Roland6 Silver badge

          Re: Why?

          >When we asked him to change the API to add the cookie and remove the statics to make it at least re-entrant, they flat refused and insisted nobody ever does that for C APIs.

          Probably didn't understand re-entrancy...

          It was something I started to ask about at interviews back in the 90's after asking a contract VB developer a question on the topic and got the response "what's that?" followed by "why would you want to do that?". I decided this and some other questions was rough-and-ready guide to the level of formal training and experience they had, in general those who hadn't studied computing at University couldn't satisfactorily answer the question.

      2. jake Silver badge

        Re: Why?

        "Long time ago reading jQuery source convinced me R.O.U.S. exist!"

        jQuery hasn't really been around (and in common use) long enough to have a "long time ago".

        1. Disgusted Of Tunbridge Wells Silver badge

          Re: Why?

          I hate to break it to you but August 2006 was a long time ago.

          1. Roland6 Silver badge

            Re: Why?

            Hate to break it to you, but 2006 is recent, pre-1970 is a long-time ago. But then my initial experience was doing development at a company where the expectation was that the system would run without major failure for 20+ years...

            1. captain veg Silver badge

              Re: Why?

              About 20 years ago I created (at the demand of Manglement) a system of software authorisation which issued keys with a finite lifetime. I allowed a sufficient number of bits to encode a lifetime of, about, 20 years.

              You can guess what happened about 20 years later.

              -A.

              1. Bebu Silver badge

                Re: Why?

                "You can guess what happened about 20 years later."

                A nice little retirement earner :)

              2. captain veg Silver badge

                Re: Why?

                Still working for the same org.

                I swapped out as quietly as I could the token-based authorisation for the SAML-based authentication that had been introduced subsequently. I pretty-much got away with it.

                -A.

          2. jake Silver badge

            Re: Why?

            "August 2006 was a long time ago."

            I hate to break it to you, but no. It really wasn't.

            1. diodesign (Written by Reg staff) Silver badge

              2006 to 2023

              It's the best part of two decades..

              C.

              1. jake Silver badge
                Pint

                Re: 2006 to 2023

                Let's be generous and say it was in wide-spread use by 2008. That's a decade and a half, give or take.

                From my perspective, a decade and a half is not "a long time ago". It's not even a generation.

                Shall we all have a beer and agree it's a subjective call?

                1. Michael Wojcik Silver badge

                  Re: 2006 to 2023

                  Yeah. Much of my day-to-day is for a component that was introduced in 2006, which makes it newer than most of the other components my team and I are responsible for. Some of those components date back to the late 1980s.

                  I still have maintenance responsibility for a commercial product that hasn't seen an update in two decades.

                  2006 is nothing. And September is Eternal.

        2. captain veg Silver badge

          Re: Why?

          It's long enough that young web developers not only spurn it, but probably don't even know what it is.

          Which is good. At best it was only ever an answer to a stupid question.

          -A.

          1. Roland6 Silver badge

            Re: Why?

            >It's long enough that young web developers not only spurn it

            Does not need to be that long ago for this to happen. Unix and C took off in the 1980s in part because that is what new graduates had used at university and they didn't want to sully their hands with anything proprietary...

            I suspect many new graduates today will probably not want to have anything to do with on-prem because it is all cloud now and cloud has only been a thing for a few years...

            1. jake Silver badge

              Re: Why?

              "Unix and C took off in the 1980s in part because that is what new graduates had used at university and they didn't have enough money to purchase anything proprietary..."

              FTFY

      3. Michael Wojcik Silver badge

        Re: Why?

        Just reading a few of John Resig's posts ought to be enough to warn people away from jQuery.

        This is a man who threw a public tantrum because a popular implementation of the language conformed to the specification, rather than to his preferences.

    4. Disgusted Of Tunbridge Wells Silver badge

      Re: Why?

      Perhaps if you want to find working implementations of code that uses a specific feature. Eg: If you want want your C program to download a web page, you might want to see an example of a C prorgam using libcurl to download a web page.

      1. jake Silver badge

        Re: Why?

        If you don't know how to call an external program in C and then use the results, no search engine on Earth will be able to help you. Instead, you should probably take a course on programming in C ... but that would be "hard", and so out of the question, right?

        1. captain veg Silver badge

          Re: Why?

          Unusually, you seem to have missed the point.

          The OP mentions libcurl. Not curl.

          I've never used it, but so far as I can tell libcurl is a library that exposes the curl program's functionality to C code.

          If you're writing code for a Unix-like system then you certainly know that reading from a socket and writing to a file descriptor is no big deal. So I'm not sure who this library is for.

          -A.

          1. that one in the corner Silver badge

            Re: Why?

            > as I can tell libcurl is a library that exposes the curl program's functionality to C code.

            More like, libcurl *is* all the cURL functionality, the curl exe is little more than a honking great CLI argument parser to expose libcurl functionality to the shell - as is done with plenty of other libraries (e.g. just using a simple a REPL to drive a language implemented inside a library: Lua, SQLite etc)

            > you certainly know that reading from a socket and writing to a file descriptor is no big deal

            True, but that is only the simplest part of the problem

            > So I'm not sure who this library is for

            Anyone who doesn't want to re-invent the wheel by not only writing to and then reading from a socket but also knowing exactly *what* to read and write in order to operated the protocol required, namely HTTP to get a web page (as in the example). Or FTP if you give libcurl an ftp:// URL, or a Gopher URL, or POP3 or SMTP or any of the other protocols that cURL can handle. Not forgetting how to cope with the various error responses (e.g. "page permanently moved to") or...

            You could always invoke the curl exe from your app but that is a bit of a faff, IMO, compared to calling the library function.

          2. Michael Wojcik Silver badge

            Re: Why?

            so far as I can tell libcurl is a library that exposes the curl program's functionality to C code

            It's more that curl is a command-line interface to libcurl.

            If you're writing code for a Unix-like system then you certainly know that reading from a socket and writing to a file descriptor is no big deal.

            But a great many people get it wrong nonetheless. I have seen far too many misuses and abuses of the socket API (and the name resolver API, etc) to think this is a good idea.

            And more important, HTTP is no longer a trivial protocol. Yes, if you only need to fetch something once in a while, you can probably get away with just implementing client-side HTTP/1.0. But it won't do for anything ambitious, and any version of HTTP beyond that is significantly harder to get right (particularly regarding security).

            And then there's TLS, and no one is competent implement TLS. (If you're one of the exceptions to this rule, you already know that you shouldn't implement TLS either. There are very, very few people who should even try to implement TLS.)

            Now, all that said, it's a terrible example. Searching GitHub for code that calls libcurl will very likely give you examples of using libcurl poorly. And you won't learn anything from copying them. The only way to use libcurl properly is to understand the protocols it implements, and then its API and architecture. If you can't write it yourself with nothing beyond the occasional reference to the libcurl documentation, you need to put the API down and back away slowly.

            Source-code search, like Copilot, is just another form of learned helplessness and a way to encourage the worst development practices. It optimizes for bad behavior.

        2. JDX Gold badge

          Re: Why?

          >>If you don't know how to call an external program in C and then use the results, no search engine on Earth will be able to help you. Instead, you should probably take a course on programming in C ... but that would be "hard", and so out of the question, right?

          Ah yes, you just woke up one morning with the knowledge how to run an external process like Neo learning Drunken Boxing. That's exactly something you would find via a search engine. This has to be the most block-headed stupid post I've seen for a while.

          1. jake Silver badge

            Re: Why?

            "Ah yes, you just woke up one morning with the knowledge how to run an external process"

            No, dumbass. That's why I suggested taking a course.

            "This has to be the most block-headed stupid post I've seen for a while."

            Backatcha.

            1. Michael Wojcik Silver badge

              Re: Why?

              Right. There are programmers who learn about what they need to use; and there are programmers who just search for something that looks like it might be right.

              The former have managed to do plenty of damage over the years, certainly. But the latter are an unmitigated disaster.

              1. jake Silver badge

                Re: Why?

                "But the latter are an unmitigated disaster."

                Congrats, you have won the Understatement Of The Week Award!

                You can collect your prize in the usual place before Close of Business on Friday.

    5. JDX Gold badge

      Re: Why?

      I often search code where I work, to see if there is an example of an API method in use.

      You might also want to search for projects using a specific library - either for example usage or because you want to see if anyone is using a feature you wish to deprecate, or how widely used your library is, or because you want to use (or avoid) a certain library in a project you use due to licensing or known vulnerabilities.

      There are quite a few use-cases for this when you engage the brain instead of just trying to be negative about anything. Apart from anything else, it sounds like an impressive piece of technology which may well be usable in other contexts.

      1. captain veg Silver badge

        Re: Why?

        What you seem to be describing is "documentation".

        You can make a case that documentation can be part of the source code. I think that case is wrong. In-source documentation is, explicitly or implicitly, just comments. Sometimes characterised as code-smell.

        The problem with comments is not that they smell, but that they are often ignored by the maintenance programmer when the code is updated. A comment which describes how the code used to be, rather than how it is now, is worse than useless.

        Documentation also can be stale. This is something to be taken up with whoever is supposed to keep the documentation up to date.

        > There are quite a few use-cases

        I doubt that there are any "use-cases" that couldn't be described as "uses". Why add two useless syllables?

        -A.

        1. jake Silver badge

          Re: Why?

          "code-smell"

          Whenever I hear that term, I take a couple cautious sniffs and say "That ain't the code, mate, it's you" ... Generally gets the point across.

      2. jake Silver badge
        Pint

        Re: Why?

        "engage the brain instead of just trying to be negative about anything"

        I did think about it, albeit not for long, and I came up with the most likely uses. Nothing negative at all.

        Touchy much? Everything OK?

        Regardless, have a beer. It's good for what ales[0] one.

        [0] No, that's not a typo ...

  2. T. F. M. Reader Silver badge

    Beyond "why?"

    Not sure the numbers add up.

    45 million GitHub repositories, which together amount to 115TB of code and 15.5 billion documents.

    Assuming a typical LOC is about 55-60 bytes, this works out to 45M repos with 1.3 MILLION LOC and 345 docs on average. I have trouble believing that. A million lines of code is a huge project, and I doubt a typical GitHub repo is huge.

    I might assume that the LOC count includes all the revisions of everything, so they count every line and every doc multiple times. But then they are searching it wrong (apologies, Steve) and it is still not quite in the realm of believable.

    I'd like to see how they arrived at the count.

    1. Richard 12 Silver badge

      Re: Beyond "why?"

      It's definitely doing a huge amount of double counting.

      They do mention later that it's only 25TB once de-duplicated.

      A lot of larger projects have hundreds of forks, most of which were used to write one merge request, which got merged back into upstream and the fork was never touched again.

      A million lines also isn't that many in a longstanding "real" application or library. Pretty sure all the projects I work on are well over a million lines - representing a decade or more of work by many people.

    2. Justthefacts Silver badge

      Re: Beyond "why?"

      Test vectors? Some types of projects do take this approach, typically signal-processing or image-processing, although it’s not so common. A few projects might have hundreds of 5MB test-vector files would soon rack up the data-size?

      I notice that GitHub limits per repo are <5GB recommended, <1GB ideal, and a hard max of 75GB before you get a stern warning. Limits tend to occur as a result of occurrences, so I reckon 1-5GB per repo must be not uncommon. If that were code alone, even with a 4x factor from revision control, that suggests lots of 30MLOC out there, which is more than I would expect. So I reckon much if it might be “other data” of some sort.

    3. captain veg Silver badge

      Re: a typical LOC is about 55-60 bytes,

      Mine got a lot longer since I bought a 4K monitor.

      -A.

      1. jake Silver badge

        Re: a typical LOC is about 55-60 bytes,

        "Mine got a lot longer since I bought a 4K monitor."

        Oh. You're one of those.

        1. captain veg Silver badge

          Re: a typical LOC is about 55-60 bytes,

          Yes, sorry.

          I think everyone should have a 4K monitor, at least. It's great.

          In my very first job writing code most of us had Wyse 30 terminals, a basic 80-column job.

          Those working on a lucrative contract with a big bank had Wyse 50s, which could do 132 columns*.

          There is no substitute for square inches.

          -A.

          * I suppose that I ought to clarify for the younger readers that, in the old days, screens displayed pure text, and that each character occupied the same pixel-width (often eight). So X-columns meant that the screen had enough horizontal dots to display that many characters on a line.

          1. jake Silver badge

            Re: a typical LOC is about 55-60 bytes,

            The first thing I plug into any new computer is a so-called "dumb" terminal. I use it for debugging, writing, and coding (and as a life-boat on the very, very rare GUI crash).

            The way I figure it, any line of code that is over 75 or so characters hasn't been thought out properly[0]. Several hundred lines of code that alternate between under 25 lines and over 80 lines quickly becomes unreadable.

            [0] Yes, there are exceptions. But they are rare exceptions.

            1. jake Silver badge
              Pint

              Re: a typical LOC is about 55-60 bytes,

              "Several hundred lines of code that alternate between under 25 characters and over 80 characters quickly become unreadable."

              FTFM

              Friday afternoon brainfarts are brainfarts. Is it beer thirty yet?

            2. captain veg Silver badge

              Re: a typical LOC is about 55-60 bytes,

              That's OK, all of mine are over 80 characters.

              Well, all the important ones. Some of the comments are shorter.

              I use a single space for indentation too, if that helps harden your judgment.

              -A.

              P.S. Do you really plug in a dumb terminal, or was that licence? I mean, apart from anything else, you'd need an RS232 (or similar) interface and these are not fitted to the average desktop these days, let alone laptop. If you really meant that you ssh in, well, moi aussi.

  3. Andy 73 Silver badge

    They should share the code for that..

    ..if only there was a place to do it.

  4. andy 103
    WTF?

    The opposite to how all other search engines work then?

    If I Google something, generally speaking I know what I'm searching for.

    "cheap flights to Majorca August"

    Uses natural language to describe something that I need more specific results for.

    How the hell does a code search engine work?

    If I'm looking for code I generally wouldn't have a clue what that code was... which is why I'd be trying to look it up! If I was using something like Stack Overflow I could describe it in natural language but that's a step before getting to the code itself.

    Given that I don't know what the end result (the code) is, how the fuck is this useful or even usable?

    1. klh

      Re: The opposite to how all other search engines work then?

      It's not for general search. It's for things like "in how many places this function is called in this project or group of projects" - it's probably the same thing that also powers their references/declaration functionality.

  5. sw guy

    grep ?

    So you want to compare with grep, with all options ?

    Including regular expression in legacy syntax or in egrep syntax or in perl syntax ?

    I wonder how this works with indexing to save time...

  6. Bebu Silver badge

    Only 100Tb/25Tb?

    I was surprised that github has only 25Tb of distinct code on it. I would have guessed in the 1000s of Tb (Pb). As the late Prof. Julius Sumner-Miller would have said "Who would have thunk it?"

    I would have thought the structure of the data could make indexing/searching easier. eg Classifying documents into various categories such as C code, English text etc and preprocessing them separately.

    C or any programming language has a grammatical structure with which an abstract representation (AST) can be constructed. Structure searches should then be possible eg "is this code fragment ~like~ any other in the repo?" I have a nasty suspicion that in general these types of (tree) searches are "hard."

    A nice side effect would be picking up syntactically invalid code :)

    1. captain veg Silver badge

      Re: Only 100Tb/25Tb?

      There's only so many ways to code bubblesort.

      -A.

    2. captain veg Silver badge

      Re: Only 100Tb/25Tb?

      > I was surprised that github has only 25Tb of distinct code on it

      I can't claim to be any kind of expert on Github, but I would imagine that they have some sort of de-duplication mechanism in their back-end store.

      There's only so many ways that you can code bubblesort.

      -A.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like