The Register Home Page

back to article UK's grand plan to fuel AI with public data faces uphill battle

The UK's hopes of fueling cutting-edge AI development and applications with a National Data Library (NDL) could be dashed unless it makes datasets easier to use. With misleading titles and non-existent metadata, the data currently available cannot support any meaningful analysis, a study from the Open Data Institute (ODI) …

  1. Anonymous Coward
    Anonymous Coward

    Shock horror the AI won't fix the decades of mess so can't get accurate info. Starting to feel like it was a good idea not making things work well. Only thing that'll save us from the AI is the poor quality of data accessibility... Thanks years beaurocatic mis-alignment.

    I often find the latest benefits always assume you've implemented the previous tower of fads. I have yet to see this anywhere. So wonder who ever sees these benefits they tout...

    1. Guy de Loimbard Silver badge

      Shock horror the AI won't fix

      AI really isn't as smart as the smack talk and hype would lead you to believe.

      It's got some uses I'll concede, but not the silver bullet that would justify the power and resources it's consuming.

    2. Roland6 Silver badge

      Definitely shock horror given “ misleading titles and non-existent metadata, the data currently available cannot support any meaningful analysis” is.a good description of unstructured data, which is exactly what all the current mass market LLMs have been trained on…

      The shock horror is how so many believe: garbage in, intelligent insights out…

    3. Flocke Kroes Silver badge

      I am probably a bit out of date, but if I understood LLMs correctly their first step is to convert words into tokens. Presumably when faced with a table of numbers in training data they will assign a token id to each unique number in the table. Pretend a few of these tables are followed by a number with words saying it is the average of one of the columns. Now someone comes along with a query for "what is the average of some data not in the training set". The LLM will dutifully find the sequences of number tokens that vaguely resemble the query data and pick one of the most popular "average" tokens to incorporate into the response.

      This procedure explains why LLMs were absolutely useless at maths. I have heard they got a little better. I am not exactly brimming with confidence there would be any value in feeding tidied up data sets into an LLM.

  2. Steve Davies 3 Silver badge
    Big Brother

    National Data Library?

    Hell No.... [see Icon]

  3. Jamie Jones Silver badge

    "With misleading titles and non-existent metadata, the data currently available cannot support any meaningful analysis, a study from the Open Data Institute (ODI) found."

    Hmmm, cataloguing that sound like a job for AI!

    Oh...

  4. theOtherJT Silver badge

    As below, so above.

    I was over at my parents place this weekend. Dad is trying to organize his "documents" - which in this case turns out to mean a collection of files going back to the windows 3.11 days that have been copied, zipped, copied, renamed, put on an external hard drive, copied to a new computer, zipped again, renamed again, copied to a NAS, backed up from the NAS to another machine, renamed again* ... and so on and so on for about 30 years.

    An entire afternoon's digging left me with the impression that in about 600G of data spread across no less than 15 separate disks, mostly various sizes of spinny - some of which even 40 pin IDE** - there was probably about 5G of actual stuff the rest being either duplicates or desperately irrelevant garbage like the drivers for a scanner that he'd owned 15 years ago that had somehow ended up in the "Family Photos (2) - Copy" directory. Probably twice.

    I tried all sorts of things that I thought would be clever to automate cleaning this up, but after a couple of hours messing around came to the conclusion that the only way to clean this mess up is with the Mk1 eyeball and some common sense. It's going to take a human with some context to work out what the hell to do with all of this, and even then it's pretty much inevitable that something that should not be lost is going to get lost.

    We like to think that digital data will exist for ever and that all the problems of duplication-degredation and physical decomposition don't apply. We are wrong.

    In short, none of this should come as a surprise to anyone, and if they're serious about cleaning this up the only realistic answer is an absolute army of civil servants doing it by hand. It'll take decades, and by the time they're done, whatever format they've converted it all into will be obsolete.

    * lost, found, and finally recycled into soft compost.

    ** and finding a working IDE card to read those was a challenge let me tell you.

    1. Paul Crawford Silver badge

      Re: As below, so above.

      If you have a NAS using du-duplication such as the option for ZFS (TrueNAS, etc) then you might be surprised by how much it saves - due exactly to the duplication of data even if files were copied/renamed you saw.

      1. theOtherJT Silver badge

        Re: As below, so above.

        Not really the point - there's so much crap in here he can't find anything. Not to mention the endless issues of "Untitled Document.doc" that aren't the same file but are scattered all over the place.

        Saving space isn't the problem, I could just by a cheap 1T SSD and dump the lot in there he has so little data over all. The problem is that over the decades it's become impossible to find which of the "design-plan (final).2002.-revision1.2005 (1) - COPY.dwg" is actually the latest version of the damn thing. You can't even rely on the metadata because a bunch of them got copied via some godawful "backup" software at some point so everything that did now things it was created 1st January 1970.

    2. Like a badger Silver badge

      Re: As below, so above.

      "and if they're serious about cleaning this up the only realistic answer is an absolute army of civil servants doing it by hand. It'll take decades,"

      It might not take as long as you think. Much of the data referred to is of high quality and assured by qualified statisticians (I work with some of these people in a different context), the problem with it is often finding it, and understanding it. Often you need to know what exists to be able to find what you want, or it may be that you need to get lucky with search terminology because the document or file title isn't what works for normal people or indeed LLM. Data in large collections often has file names that are fine for regular and technical users, but for normal people aren't so helpful, and what's more there's no consistency across government. The people who maintain each data set, THEY know what exists, where it is, what it pertains to, where the data came from etc etc. So much of what's needed is some sort of naming or classification system that LLM can use, and that shouldn't take too long to design and then apply - if there's the will of course.

    3. Roland6 Silver badge

      Re: As below, so above.

      >” 40 pin IDE”

      There are IDE desktop enclosures (5.25 inch) and pocket drives (2.5 inch) with either USB or SATA connectors. Otherwise this might be of use: https://www.amazon.co.uk/SATA-Converter-Adapter-Bi-Directional-Conversion/dp/B0B3CXJD79

      One of my “fun” projects is the discovery of 3 ancient external 5.25 inch disk enclosures complete with (pre IDE) disks with full sized DB25 parallel port connector… unfortunately, as they predate Bitcoin, there won’t a “lost” crypto wallet waiting to be discovered.

      1. theOtherJT Silver badge

        Re: As below, so above.

        Appreciated, but I've got one now. Well, actually I've got 5, but only 2 of them worked :/

    4. ecofeco Silver badge
      Windows

      Re: As below, so above.

      A perfect example and perfect summation of what it will really take.

      I've worked at places that lost entire servers in their system. They simply could not physically find them. Nor access them. Let alone a few files.

      They could see them in the network reports, just had no idea were they actually existed on the planet. And the only way to log into them was to find them and manually reset them. Not even joking.

      I've... I've seen things man. Things I wish never had. ---------------------->>>>>

  5. alain williams Silver badge

    Let MPs and top tier civil servants ...

    be the first ones who have all of their data given to these AIs. I would like to see what they say when this is regurgitated or ends up being abused by the USA (in spite of contract and reassurances).

  6. Will Godfrey Silver badge
    Mushroom

    NO

    Nuff sed.

  7. Bebu sa Ware Silver badge
    Windows

    Data v. Information

    The NDL - lite or otherwise - is probably replete with "data" but due to the inaccuracy or absence of meta data, of integrity checks, of provenance etc is likely an information desert punctuated with lush oases of misinformation.

    I recall that many years ago the "semantic web" was intended to address these issues. I had picked up a remaindered text but, while interesting bedtime reading, was remote from my occupation so I have no idea where that went. That the text was remaindered hardly augured a bright future.

    Not restricted to governmental records. How many readers have had in their care ½" tapes of experimental or observational data with the only metadata being the paper label on the tape container detailing the instrument and date? The actual tape being binary sequences of integers and floating point numbers in whatever endianess and FP format the instrument or data capture program used.

    If you are really lucky someone will recall the source to the Fortran program that wrote the tapes with the data from the long defunct instrument was archived on some 8" CP/M floppies stored in a cardboard box at the back of the compactus. Scientific program source documentation from that period was typically very good; explaining pretty much everything needed to transform useless data into valuable information

    Only "lucky" I suppose if hunting around for an 8" floppy drive as well as something to read a CP/M disk and learning more about tape technology than you ever wanted to know, isn't a nightmare. ;)

    1. theOtherJT Silver badge

      Re: Data v. Information

      Oh god, yes, the academic data. I once got emailed - back when I was working for The University - a file named "output.o" by a desperate graduate student who had been sent it by their tutor with no explanation for how to read it, just before said tutor went on holiday.

      On closer inspection it was about 15M of text data in the format [$long_floating_point_number,$longer_floating_point_numer] 12 times per line, over and over and over. I have no idea what this could possibly represent and neither did the poor grad student.

      I'm guessing it was the output of some script or other in a totally custom format that the professor had made up on the fly. When faced with something like that you're screwed.

  8. elsergiovolador Silver badge

    Mess

    Just give more billions to usual suspects. Maybe they won't fix the mess, but it will be the greatest British mess.

  9. ecofeco Silver badge
    Mushroom

    They want to what?!

    Oh dear god.

    Dear effing god.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon