back to article OpenAI pulls AI text detector due to it being a bit crap

OpenAI has taken down its AI classifier months after it was released due to its inability to accurately determine whether a chunk of text was automatically generated by a large language model or written by a human. "As of July 20, 2023, the AI classifier is no longer available due to its low rate of accuracy," the biz said in …

  1. Anonymous Coward
    Anonymous Coward

    Correlating Commentards Causes Confusion

    It will be interesting to compare reactions on this article against where each person's previous posts stood on the "ChatGPT just stores the text it saw in its training and prints it out again, word for word" versus the "no it doesn't, it uses (long winded description)" debate.

    You lot are all going to post as AC now, aren't you!

    1. Falmari Silver badge

      Re: Correlating Commentards Causes Confusion

      @AC "It will be interesting to compare reactions on this article against where each person's previous posts stood on the "ChatGPT just stores the text it saw in its training and prints it out again, word for word" versus the "no it doesn't, it uses (long winded description)" debate."

      Why, is it because OpenAI's classifier "struggled with prose it hadn't seen in its training dataset"? That just means their classifier does not really work, because it failed to accurately classify data not in the training data. They exclude data that is in the training set because that data will be classified as it was trained to be classified.

      You can't test the accuracy of a model no matter how it works (just stores the text or uses (long winded description)) on its training data as the model has been trained to fit the data. Therefore the results will fit the model.

      BTW unlike you my post is not AC.

      1. Anonymous Coward
        Anonymous Coward

        Re: Correlating Commentards Causes Confusion

        You are halfway there (see "not quite", below)

        Those who claim that ChatGPT (almost) always spits out unaltered material from its training set will expect the detector to rarely see anything that doesn't match its training set and that both the positive and negative training inputs will look the almost exactly the same (assuming that they use the same human-generated inputs as used with ChatGPT, which is the sensible thing to do).

        == Why, is it because OpenAI's classifier "struggled with prose it hadn't seen in its training dataset"? That just means their classifier does not really work, because it failed to accurately classify data not in the training data.

        Not quite: it can also mean that the classifier is working perfectly and there simply is no useful difference between the inputs: the ChatGPT text has nothing in it that allows it to be distinguished from the human generated text.

        So, if ChatGPT only spits out existing text then the detector's failure is what you would expect.

        == BTW unlike you my post is not AC.

        Well done you. That last line was just a rather obvious "poke a stick into the cage" with a nice big explanation mark to point. Commentards usually have thicker skins and shrug those off as being rather too obvious to react to.

      2. Anonymous Coward
        Anonymous Coward

        Re: Correlating Commentards Causes Confusion

        "BTW unlike you my post is not AC"

        But OP wants an experiment relating to poster's tags - if they didn't post as AC that would skew the results.

        Don't you want to see even social science done the best it can be?

    2. doublelayer Silver badge

      Re: Correlating Commentards Causes Confusion

      One reason that won't be very interesting is that there basically wasn't any debate on that topic. GPT sometimes prints large chunks of text verbatim, but even more frequently mashes up small bits from lots of chunks and returns those. Either is pretty easy to prove, since it can be made to either quote something accurately which can be verified (depending on what it quoted), or to incorrectly state something which its training text would have accurately stated, demonstrating that it had modified the original text to get there. So the answer to the debate is that it does both.

      It's also not interesting because what would a failed classifier prove about what GPT was doing? Yes, one thing they could have done is to have the classifier classify anything from the training data as AI-generated and everything not in that as not. Obviously, that wouldn't produce the results they were going for, so they didn't do that. The other method they used also didn't work, probably because these LMMs have mastered English to the extent that their output and human output is hard to tell apart just on the basis of word usage or sentence structure, although depending on the subject matter, it might be more obvious to humans who have more context. Of course, nothing guarantees that they were competent while making that classifier, so it could have failed for more basic architectural reasons. I have little confidence that they will ever be able to make an accurate classifier for this purpose.

      1. Anonymous Coward
        Anonymous Coward

        Failing at the impossible - shocking

        English and most every human grammar make this task provably impossible. (most including every single human language I have ever looked at in depth, but since I don't know every human grammar let's leave that open ended for now)

        Fake-AI shilling brogrammers can stick their heads in the sand, but the problems were laid out by the linguists in the pre-computing era. People can't do this with other people, and a computer can't do what the people can't do. Maybe hire some more linguists and philosophy majors, and then listen to the ones who tell you it can't be done, and why.

        (Not going into more detail as I am leaving to meet an underemployed philosophy major who used to work for Google for drinks. He deserves the pay check, so post a job for it if you want the answer or work it out for on your own and claim the prize for yourself.)

    3. MyffyW Silver badge

      Re: Correlating Commentards Causes Confusion

      I can barely be arsed to write "no it doesn't, it uses (tokenisation)", but it looks like I now have.

      Non-anon because as the beautifully-flawed Wendy James once sang, baby, I don't care.

      1. Falmari Silver badge
        Pint

        Re: Correlating Commentards Causes Confusion

        Have a beer for the Transvision Vamp reference :) ---->

  2. Howard Sway Silver badge

    AI classifier is no longer available due to its low rate of accuracy

    So, you've built an AI tool to detect AI text because AI text is mostly inaccurate, and that tool's output is also mostly inaccurate.

    In other words a failed AI fails to detect other failed AI.

    This stuff is amazing. Definitely going to replace all our jobs.

    1. heyrick Silver badge

      Re: AI classifier is no longer available due to its low rate of accuracy

      Don't laugh - manglement will be looking at costs, not accuracy...

      1. ethindp

        Re: AI classifier is no longer available due to its low rate of accuracy

        Accuracy might not matter now, but I'll be laughing when businesses who are jumping on the AI train realize that everyone the AI "hires" is highly unqualified and unable to perform the duties assigned to them, or they're unable to get the AI to do what they want, so have to hire people who will tell the AI what to do because... The AI isn't actually intelligent, as much as OpenAI would like you to believe that it is. Or, even worse: a business does something monumentally stupid like putting an AI in charge of finances and suddenly the corporation is investigated for fraud because the AI "decided" to do something unlawful.

    2. Michael Strorm Silver badge

      Re: AI classifier is no longer available due to its low rate of accuracy

      Yo dawg, etc.

    3. MrAptronym

      Re: AI classifier is no longer available due to its low rate of accuracy

      Its going to replace our jobs to save costs... just before our companies go under when it turns out the AI can't do our jobs.

  3. that one in the corner Silver badge

    Do OpenAI really want a working LLM detector? Or is it a bluff?

    Training your neural net, to turn it into a LLM, requires an amount of feedback to direct the learning process: a few short years ago we were being told of "new" techniques like GANs[1], where the training eats up even more machine cycles by training two nets to basically fight each other. So long as you have the cycles available, GANs are cheaper and quicker than using humans to read all the output and grade it. You can use other software (e.g. to stop the LLM getting away with just spitting out nonsense words that aren't in the dictionary, not allowing long repetitions of one word) but that is going to be very limited in scope.

    So, what are the chances that ChatGPT was created alongside its GAN nemesis: one generates text that is as good as humans can manage, the other tries to spot the difference. When the training is complete, you hope that the generator is fooling the detector as much as possible, preferably all the time.

    OpenAI just happen to have a detector available, do they? I wonder where that came from. And they have let people feed into it lots and lots of samples of the sort of text that the public want to verify.

    Now OpenAI admit that the detector can not spot the differences and have taken it away. At the same time, promising to come up with some kind of watermarking scheme, just what the administration would like to see. A watermarking scheme that, if they develop it first, would be pushed by the administration (with subtle hints from Microsoft) as something that every LLM creator should make use of. For a suitable licence fee, of course.

    By the way, you know those restrictions that all the big boy LLMs have? About not allowing you to use their output to train another LLM? Do you think the adversarial detector counts as an LLM? It is large, it is a model and it is only concerned with language... So if you suddenly turn around and claim to already have software that can detect ChatGPT output, how did you manage to create it? Want to prove in court that it isn't an LLM? Hey, did you know that OpenAI (or another Microsoft subsidiary) has used a very similar technique?

    [1] knowing what this stands for doesn't really help, but just in case you don't know: Generative Adversarial Network.

  4. Anonymous Coward
    Anonymous Coward

    Stochastic Parrot shall snitch unto Stochastic Parrot

    Stochastic Parrot.

  5. ChoHag Silver badge

    If a student's essay is written in the same voice as all the other essays that student has written, then it's probably not AI. If the level of knowledge they display is about equal to the amount they've been taught, it's probably not AI.

    But to know that, you'd have to take the time to teach them.

    1. heyrick Silver badge

      And bother to read their homework rather than just copy-pasting it into an "is it AI?" box.

    2. MrAptronym

      As a TA, I had to grade essays from a 100+ person class. Not going to lie: I would not get to know the writing style for like 95% of them. Grading all those essays in addition to my grad school work left me with a minute or two to grade a 1 to 2 page essay. I would also specifically try not to look at names to avoid any bias.

    3. Anonymous Coward
      Anonymous Coward

      Even then

      What you outline will only catch obvious cut and paste plagiarism and 3rd party authorship. That is the first round in the adversarial cat and mouse game the LLMs will engage in.

      Some of the coders won't give a shit, but it's trivial to provide feedback to these systems to let them adapt to the cheat detection software.

      It has already been shown that a LLM with front-ended adversarial training will consistently pass these cheat detection systems when actual humans get marked as a false positive. Especially those with anything other than a middle of the road profile, and that's not just race or native language issues, even regional idiom and accent inflect writing and trip these things because the statistical models they are using are factually broken.

      This is more snake oil sold by idiots to other idiots.

  6. Seajay
    Boffin

    Barking up the wrong tree?

    Attempting to determine if something is AI written or not seems doomed to failure in the long run. Sure, at the moment it might be possible to discern some patterns (although obviously even this is difficult given the article), but as AI generation improves this will become so it is indistinguishable from human written text.

    Instead of trying to spot AI text in terms of student submissions, you really need to look at how students are assessed, and what new and maybe even so far untried methods there might be for determining knowledge and understanding.

    If you want to be really radical, you could say that those who "cheat" by using these tools will be found out eventually - either that or they get along just fine in life, using the tools they have to hand - and is that such a bad thing either? Discuss.

    1. Mike 137 Silver badge

      Re: Barking up the wrong tree?

      "what new and maybe even so far untried methods there might be for determining knowledge and understanding"

      There's an old and well tried but latterly abandoned method that worked superbly -- have an expert in the subject talk to the student, asking them to explain and justify whatever assertions they make. Oh sorry, I forgot -- that's just too expensive despite (here in the UK) the course costing said student £9000 per year. If we turn higher education into an expensive diploma mill we must expect street wise students to respond appropriately by fiddling.

      1. Anonymous Coward
        Anonymous Coward

        With one small caveat

        That the session be recorded so the student and reviewer can get a review for fairness and bias. One of the reasons these were dropped had nothing to do with the money, it was the difficulty of screening for fairness and bias in the questioning and grading. Another is that some people are terrible talking things out on their feet under pressure, due to social anxiety, neurodiversity, language proficiency, and a slew of other issues. That said, in some form it's probably still a useful tool.

        One alternative is to prevent the verbal interview from determining grading, but make it a checkpoint for credit and progression. If the kid paid a couple grand for the class, I'm less concerned about them getting do-overs as being able to demonstrate knowledge and proficiency in the subject matter. Verbal reviews would also help with other forms of cheating, and with student cramming stuff into short term memory before an exam and forgetting it a week later.

        Too many students I sat classes with let the material go in one ear and out the other, and were only worried about the grade. Not good in medical or engineering degrees.

    2. Ken Moorhouse Silver badge

      Re: either that or they get along just fine in life, using the tools they have to hand

      But when they venture out into the real world, if they can't do tasks without AI as a crutch, then that reduces the usefulness of the work force by sinking it to the lowest common denominator.

      We are gradually being deskilled in almost every aspect of our lives. Cooking, diy, car repair, darning socks. Add to that: writing a letter or report, coding, researching using primary records. Let's not make things even worse.

      1. Will Godfrey Silver badge

        Re: either that or they get along just fine in life, using the tools they have to hand

        Ummm. Some of us are, and some of us are not.

      2. Aitor 1

        Re: either that or they get along just fine in life, using the tools they have to hand

        Why?

        Ai is a tool, and I don't require to know how to paint in order to take pictures.

      3. Seajay

        Re: either that or they get along just fine in life, using the tools they have to hand

        Interesting... does it though? "...if they can't do tasks without AI as a crutch, then that reduces the usefulness of the work force..."

        The point is, if they are using tools to do the job anyway, then the job is being done. So are they less useful or just resourceful in using tools to help them? We all use tools to make things easier - this is just another step!

        If on the other hand the tools mean they are doing the job badly, won't that be self limiting? (I'm not stating any of this as fact, but I do think it's an interesting discussion, and not necessarily as clear cut as it appears!)

        1. Ken Moorhouse Silver badge

          Re: if they are using tools to do the job anyway, then the job is being done

          Provided the information provided is accurate. At the present time, we can 'triangulate' our results against other sources, but when those other sources become submerged into a soup of dubious sources then there is no concrete reference.

          Let's take genealogy as an example. I can go onto a website and get census results for a particular family in a particular house in 1901. This information is primary data, in a way. It was collected by an 'official' visitor to the house in 1901, who recorded it. The person giving that information may not be in a position to give that information in a coherent way, they may not understand the question, they may not know the answer, and they may not be able to read what the census official wrote down, they may even evade giving the correct information. But it is up to us to evaluate its likelihood of accuracy and to delve deeper in other ways. So then, over 100 years later, that information has been transcribed, with further errors being made in the transcription, ages could be miscalculated, names misspelt. In most instances at present it is possible to go in and look at the original scanned record and try to interpret the record to see if it has been properly transcribed. Now imagine that all of the original scans of those records being discarded. Data has been lost and forever so, unless a researcher has microfilms of the original scans, and has recorded and uploaded them somewhere for others to find.

          Now a lot of fair-weather genealogists would accept the results they get from a quasi primary source such as this as as gospel, and continue to build their family tree, based on inaccurate information. More diligent researchers will 'triangulate' that data against birth records (which can be inaccurate), baptismal records (which can be inaccurate), marriage records (which can be inaccurate), death records (ditto), gravestone markings (ditto)... you get the picture, I hope.

          The point is that the 'primary data' amongst that lot is vital to keep on the surface somewhere, in libraries, churches, etc. in its original recorded form. It occupies space, and is relatively difficult to access, but it is necessary for researchers to evaluate as facts come to light. I suspect however it will eventually all be destroyed because "why do we need it?" When that happens we have closed the trap-door of history down upon ourselves and are reliant on nebulous inferences of it instead. The storage of those nebulous inferences likely is extremely voluminous and of questionable veracity.

          People do say that this type of search is not what AI is about, and it should not be used to produce inferences of this sort, but does it tell you this if you search for them? Does the person using these tools know how to use them? There is a very real danger that they don't. What happened to the expression GIGO? (Garbage In Garbage Out). I don't think I've heard it uttered, of late. Time to revisit it, I feel.

          ===

          There is the efficiency angle too, which nobody would consider because it is hidden away in a datacanter somewhere, out of site, out of mind. The fact is that datacenters are using heavy resources and are bad for the earth.

  7. Primus Secundus Tertius

    No worse than the average human

    One might almost think that AI writing is no worse than average human writing. Far too often we flatter ourselves.

    1. katrinab Silver badge
      Boffin

      Re: No worse than the average human

      AI is programmed to mimic human writing, and therefore is unable to tell the difference between its writing and human writing.

      If it was able to tell the difference, its writing wouldn't have those differences.

  8. The Central Scrutinizer

    OpenAI has taken down its AI classifier months after it was released due to its inability to accurately determine whether a chunk of text was automatically generated by a large language model or written by a human.

    And there endeth the story.

  9. Anonymous Coward
    Anonymous Coward

    Other Available Classifiers For Written Submissions.......

    (1) Is any of the writing true? If no, then perhaps we need to mark it "False". Do we care how it was written?

    (2) Are any of the assertions in the writing novel? If no, then mark it "Seen this a million times".

    ...........If yes, perhaps it's worth examining the assertions, irrespective of how they were written.

    (3) Does the writing break the law? Ah.....now we need to find out who is responsible! If it's AI, we need to find the owner of the AI. Otherwise, find the real human being.

    ...........This type of material will no doubt be attractive to lawyers! Perhaps lawyers will be huge buyers of AI dedicated to finding illegal texts?

    Other commentards here on El Reg will no doubt be more able than I am at finding other classifiers.....but these three might be a start!!

    Oh....and about so called "training databases".....how much content in these databases would be marked "False"? How much content might break the law? I think we should be told!!!!

  10. FrogsAndChips Silver badge

    Detecting human edits

    The classifier didn't work very well on writing that had been AI-generated and edited by humans

    Sure, but does it still classify as AI-text if it's been edited? Depends on the level of editing, I guess (with 80% accuracy).

  11. Ken Moorhouse Silver badge

    AI Classification is Important

    If the concept of AI is considered to be something humankind should get involved with (I think it should be outlawed) then working out if something is AI generated is vital. Why? Simply because if you feed AI generated nonsense into the AI learning pool then the quality of output from the AI system will decline further.

    Outlawing the use of AI is also going to be impossible without an AI classifier.

    We are already in a vicious circle, heading downwards very quickly.

  12. steelpillow Silver badge
    Holmes

    So, we have a new Turing test.

    A candidate is truly intelligent if it can distinguish between other intelligence which are or aren't able to do the same. Discuss.

  13. amanfromMars 1 Silver badge

    What do you think comes next after an Almightily Capitalised CyberSpace Venture*

    The Register asked OpenAI for further comment and any predicted release date for a new build of the classifier.

    :-) El Reg doing the NEUKlearer HyperRadioProACTive IT Quantum Communication thing ..... both biting the hand that feeds IT and and licking it to reveal that which it is trying to be stolen and owned by A.N.Others as if rightly entitled to and by it being of their own invention and creation?

    Bravo, El Reg. That's more like it .. in these pathetic days and energetic 0days of turmoil and conflict, misleading information and CHAOS [Clouds Hosting Advanced Operating Systems]

    * ..... A Grand AIMaster Piloted ProgramMING Project ‽ .....https://forums.theregister.com/forum/all/2023/07/20/ultra_ethernet_consortium_ai_hpc/#c_4701629

    1. decentralised

      Re: What do you think comes next after an Almightily Capitalised CyberSpace Venture*

      https://www.nationaldefensemagazine.org/articles/2023/7/25/defense-department-needs-a-data-centric-digital-security-organization

      Indeed, this looks like a damn nasty minefield. Especially with our universal propensity of "cost-cutting".

  14. Bebu Silver badge
    Headmaster

    Halting problem?

    I would have thought AI detecting AI would be equivalent to the halting problem ie formally undecidable.

    The halting problem was normally expressed in terms of a Turing Machine (when I was a student) so Alan Turing gets another guernsey here :)

    Although I suspect a bullshit detector can be provably 100% effective.

    1. Anonymous Coward
      Anonymous Coward

      Re: Halting problem?

      While not related explicitly to the halting problem, you are on the right path that there is a foundational problem with what they are trying to do. The crux is that the middle of this is just plain text. The complexity of the language is high and the medium is low. So weather it's two people, two machines, or a well trained parrot, there can't be a general solution to detecting the authorship or nature of the author from plain text.

      You might be able to do it sometimes, but without huge constraints on the system, the author can simply say things in neutral enough language that there isn't any actual evidence of difference and still communicate. There are an infinite number of banal and trivial conversations, statements and essays that no classifier can ever sort reliably. As an example if you assigned every person on earth to generate a four sentence, one paragraph essay on cats, you'd inevitably get a few exact duplicates, in dozens of languages. Then there is the issue of what your classifier did with people who have never seen a cat,

      or to put it another way, was this sequence of numbers generated by a machine, a two year old, or a former QC tester?

      123456789

      The best you can do is guess.

  15. Duncan10101
    FAIL

    They made a machine that passes the Turing test ...

    ... and now everyone's crying that it passes the Turing test.

    Wikipedia describes the test like so: If the evaluator could not reliably tell the machine from the human, the machine would be said to have passed the test.

    So ... whose bright idea as it put research money into being able to "Reliably tell the machine from the human" ?

    THAT'S ITS FUCKING DEFINITION.

  16. Anonymous Coward
    Anonymous Coward

    Well, ChatGPT thinks the US constitution was AI written

    Here https://arstechnica.com/information-technology/2023/07/why-ai-detectors-think-the-us-constitution-was-written-by-ai/

  17. Ken Moorhouse Silver badge

    Every time I play with ChatGPT I am gobsmacked by its stupidity

    I have a client who is an actor. I just asked ChatGPT "Who is <name>?"

    It responded with what, on the face of it, looks like his Wikipedia entry, but somehow pulled out of the ether the date that he died. Er? He's still living, I've just emailed hm to find out where his stipulated date of death (exactly specified as 18 June 2018) came from. I suspect he will give an opinion on some of the other tripe that it has come up with too...

    1. Ken Moorhouse Silver badge

      Re: Every time I play with ChatGPT I am gobsmacked by its stupidity

      Just had a reply from my actor client, very little of ChatGPT's output was accurate.

      ChatGPT's responses remind me of dreams. A lot of dreams that I can remember when I wake up are to do with things I've done, people I've met. But in a lot of cases the people involved are swapped with someone different. The term that best describes this is 'to convolve'. A scene is set where various people are convolved together and presented to me in the dream.

      In the question I asked, ChatGPT asserted that my client had been in Brush Strokes, Last of the Summer Wine, EastEnders, Z-Cars, Dad's Army and Blake's 7, but he was in none of those, but he was in numerous others instead. Presumably there may be another actor who had been in these productions where ChatGPT says not.

      1. that one in the corner Silver badge

        Re: Every time I play with ChatGPT I am gobsmacked by its stupidity

        You weren't tempted to ask ChatGPT which actor(s) it thought had appeared in that list of productions? See if it attempts to be consistent.

        1. Ken Moorhouse Silver badge

          Re: You weren't tempted to ask ChatGPT which actor(s) ... had appeared in that list of productions?

          Good thinking: So I did, here's its response (my client is Michael Kilgarriff btw):-

          The actor who appeared in all of the TV shows "Brush Strokes," "Last of the Summer Wine," "EastEnders," "Z-Cars," "Dad's Army," and "Blake's 7" is Christopher Beeny. He was a British actor known for his work in these popular television series. Christopher Beeny passed away on January 3, 2020

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like