back to article ChatGPT can't pass these medical exams – yet

ChatGPT has failed to pass the American College of Gastroenterology exams and is not capable of generating accurate medical information for patients, doctors have warned. A study led by physicians at the Feinstein Institutes for Medical Research tested both variants of ChatGPT – powered by OpenAI's older GPT-3.5 model and the …

  1. sebacoustic

    multiple guess

    60ish percent -- hmm. How high is the "tick boxes at random" score for this test?

    1. b0llchit Silver badge
      Joke

      Re: multiple guess

      That would probably be 25%. Then, using some great chatGPT math, we can conclude that the real success rate of the chat bot is between 87.4% and 90.1%.

      Therefore, the doctors are wrong! It is great medical advice to ask chatGPT. You'll get, again using great chatGPT math, between 9.9% and 12.6% correct answers from your robotic overlord. Nothing can possibly go wrong, chatGPT told me so.

  2. Chris Miller

    If I were allowed unlimited time and free access to the Internet, I suspect I could pass this exam (or almost any written test), even though I know nothing about gastroenterology.

    1. Herring` Silver badge

      You could do it on gut feel. But you may be talking through your arse

  3. FeepingCreature

    Passing grade

    > "I don't think a patient would be comfortable with a doctor that only knows 70 percent of his or her medical field. If we demand this high standard for our doctors, we should demand this high standard from medical chatbots," he added.

    Then why is 70% the passing grade?

    As the joke goes: "What do you call the person who graduates medical school with the worst grades in their year?" "Doctor."

    1. Pete 2 Silver badge

      Re: Passing grade - mirror, indicate before passing

      Yes, this does seem to be an example of the double standards (bias against?) employed in many fields.

      Many articles that talk about autonomous vehicles give the impression that nothing less than perfection is acceptable. Yet the standard of AI driving is already (measured as accidents per 100,000 mile / km) better than the average for police drivers.

      This seem to be a confidence issue rather than one about actual abilities. Maybe what's needed is some blind comparisons: doctors' diagnoses vs. machine,and see where the truth lies.

      1. FeepingCreature

        Re: Passing grade - mirror, indicate before passing

        To be fair, the self-driving numbers are in part because the autopilot gets to choose to disengage in operation when it hits a problematic situation. Drivers can generally not decide to stop the car and bail in the middle of an intersection.

        To complete the blind comparisons, I would also be extremely interested in "ChatGPT answer delivered by a doctor" and "doctor's answer delivered by the ChatGPT website."

        1. DS999 Silver badge

          Re: Passing grade - mirror, indicate before passing

          To be fair, the self-driving numbers are in part because the autopilot gets to choose to disengage in operation

          Not only that, the driver decides when to engage it. When are you most likely to engage it, when you are hitting a boring stretch of highway where you feel nothing much is likely to happen or in the middle of NYC or London traffic where unless you are a daily driver in those places you are white knuckling it and hoping to reach your destination in one piece? Probably no one is engaging autopilot in a storm or other impaired visibility scenarios.

          Self driving engagement is self selected to basically be the easiest stuff to drive, while humans have to drive all the time. Tesla fanboys love to tout accidents per mile claiming autopilot is already better than human drivers, but those stats are meaningless until autopilot drives 100% of the time or you compare accident rates only in exactly the same conditions for each.

      2. that one in the corner Silver badge

        Re: Passing grade - mirror, indicate before passing

        > Yet the standard of AI driving is already (measured as accidents per 100,000 mile / km) better than the average for police drivers

        How do the numbers for Police driving beak down between, say, pootling from A to B with no particular urgency, responding to a callout, chasing some prat at high speed?

        And how many actual Police miles are driven a year compared to AI miles? Along motorway, town roads or twisty B roads?

        Are your numbers really comparing like for like?

        1. DS999 Silver badge

          Re: Passing grade - mirror, indicate before passing

          Police can flout driving rules at will, knowing they won't get pulled over or ticketed. They are probably the last people you'd want to compare to unless your intent is to rig the comparison in favor of self driving.

      3. vtcodger Silver badge

        Re: Passing grade - mirror, indicate before passing

        Downvoted because there are situations where sort of OK is not a responsible goal. Safety in cars, nuclear reactors and air travel are among them. Even if you aim for perfection, you'll end up with a flawed system because you'll make mistakes. If you aim for less, You'll likely end up with a system that unnecessarily endangers others.

    2. Mike 137 Silver badge

      Re: Passing grade

      "Then why is 70% the passing grade?"

      Because enough candidates must pass to keep the course in operation, and as demand for certificates as passports to employment increases the score declines, as do the demands on knowledge.

      In the UK in ages past, "honours" required a fourth year of undergraduate study beyond the "ordinary" degree. Then it became synonymous -- ordinary 3 year degrees being relabelled as "honours". Then they introduced the unmarked first year, resulting in what are in effect two-year "honours" degrees. When I graduated some three decades back, around 7% of candidates got first class honours, but even then 70% was the threshold. Now it's reported that around 30% get firsts. I'm sure that the teaching is not four times as good, and I doubt that the student body is four times smarter either. We have to keep candidates passing well or the revenue stream will dry up.

      1. TheMaskedMan Silver badge

        Re: Passing grade

        "In the UK in ages past, "honours" required a fourth year of undergraduate study beyond the "ordinary" degree. Then it became synonymous -- ordinary 3 year degrees being relabelled as "honours""

        When I graduated in the early 90s, the honours part of the course was a fairly substantial project completed alongside routine classes. I had a project supervisor, and someone above him supervising all the year's projects. I had to complete a viva at the end of the year.

        A LOT of work went into that project - mine required me to devise a methodology for a particular kind of task, then demonstrate the use of that methodology by using it to write a fairly large piece of software, then document the software and write up the methodology. It took me most of my final year to do. It used to really piss me off that history honours students were just required to write a 10,000 word essay, and that most of them did that in a couple of days in the library after their finals.

    3. that one in the corner Silver badge

      Re: Passing grade

      > Then why is 70% the passing grade?

      Because even the worst graded medical student will go on from the exam to a junior learning post where they'll be expected to continue to study and be brought up to speed (or remain in a lowly position in the gastro unit).

      With ChatGPT, that is it - one shot and it is done.

      *Maybe* the next iteration will do better - maybe not (see article).

      PS I was trying to find the Foil Arms and Hog "Doctor Google" but Google is refusing to find it for me. Anyone?

      PPS Doooomdahh

    4. Anonymous Coward
      Anonymous Coward

      Re: Passing grade

      "70%"

      Well it is called a 'practice' after all...

  4. abend0c4 Silver badge

    OpenAI is secretive about the way it trains its models

    And herein lies the real problem: do you want your medical advice coming from a secret source whose origin cannot be revealed?

    The word for something that gives the appearance of improbable success in circumstances where you're not allowed to look behind the curtain is "magic".

    That's the kind of medicine it's taken generations to get away from.

    1. Pete 2 Silver badge

      Re: OpenAI is secretive about the way it trains its models

      > do you want your medical advice coming from a secret source whose origin cannot be revealed?

      Isn't that what is commonly called "experience"?

      Some years ago a large computer consultancy set about automating a lot of processes. I has an external person following me around for a week, taking note of how I worked and in particular how I debugged operational issues. Each time I solved a problem, this person would ask me the process I had used to reach a solution. Most times the reply would come down "I have 25 years of experience, it looked like something I'd seen before".

      Which wasn't very helpful - though admittedly I felt no obligation to be helpful ;) -, but was the truth.

      1. Benegesserict Cumbersomberbatch Silver badge

        Re: OpenAI is secretive about the way it trains its models

        There's nothing secret about the sort of education that years of successes and failures brings - in a way, that's how at least some of medicine has been taught for centuries, aided more recently of course by the scientific method.

        But an algorithm that even the person who wrote it can't tell you how its internals work aside from "we threw hordes of texts of varying provenance at a wall and this is what stuck," sounds like hocus pocus.

        What the AI industry calls hallucination, your physician would call confabulation, and everyone else calls making shit up. Not uncommon in these language models, it seems. Also not a trait I look for in a doctor.

    2. Anonymous Coward
      Anonymous Coward

      Re: OpenAI is secretive about the way it trains its models

      Do I want to take medical advice from a source that won’t read medical journals because it’s too cheap to pay for them?

  5. Anonymous Coward
    Anonymous Coward

    I have a major problem with this ChatGPT rush

    .. which is also why I *seriously* question Microsoft sticking it in anything it can lay its hands on in the hope that that will somehow liven up their sales.

    To me, this feels like we may be heading for a computer version of thalidomide.

    There, a 'wonderdrug' was found in the 1950s to be useful for all sorts of purposes and its use kept spreading under various different names (which made it harder to tie the problems together). It took 5 years for the appaling truth to be traced back to the drug. Nowadays, safe use cases have been found (in some cases even spectacular), but not before there was a full generation of fairly dramatic human disasters and decades of dealing with them.

    I see four problems here:

    1 - sticking it everywhere without having fully evaluated consequences (some of that is simply because we don't know them yet);

    2 - the utter lack of accountability of software companies for the aforementioned consequences;

    3 - it's now a buzzword, which means the critical thinking dial is turned *way* down;

    4 - it's Microsoft. For those who know its history, I don't need to say more (and if you don't, look it up).

    Even if your risk management has arrived at the conclusion that you should avoid Microsoft products doesn't mean you may not be exposed to the consequences of others using it, in volume.

    So no, I am not filled with enthusiasm for the new toy being put into places where it can cause massive harm without someone being able to say 'no' to the whole idea.

    1. that one in the corner Silver badge

      Re: I have a major problem with this ChatGPT rush

      > To me, this feels like we may be heading for a computer version of thalidomide.

      You mean, something where we literally had to learn a whole new area of drug research (chirality) before being able to understand what was going on?

      The thalomide disaster was a disaster for everyone caught up in it (thankfully, those I knew personally were just getting on with it, no fuss - made pretty good perfect, actually). No getting away from that.

      But, much as I question the use of ChatGPT et al fr anything other than amusement, I'm not seeing that there is a connection to be made here.

      1. Manolo
        Headmaster

        Re: I have a major problem with this ChatGPT rush

        Lack of understanding of chirality was not the root cause of the thalidomide disaster.

        The thalidomide currently still in clinical use is still a racemic mixture.

        ( (RS)-N-(2,6-dioxo-3-piperidyl)-ftalimide, note the RS)

        Lack of mandatory rigorous testing procedures was.

    2. Tom 38

      Re: I have a major problem with this ChatGPT rush

      I don't agree with this analogy at all:

      sticking it everywhere without having fully evaluated consequences

      ChatGPT produces text. We know the fully evaluated consequences of text, because we can read and review. We professionally use ChatGPT at work in the form of Copilot for development, and a custom agent trained to respond to customer queries, but in both cases it simply generates text which can be accepted, rejected or edited. It doesn't provide the solution, and never will, it simply enables a human to produce an output quicker than starting from zero.

      1. StewartWhite Bronze badge
        Stop

        Re: I have a major problem with this ChatGPT rush

        You're naive in the extreme if you think that companies won't use ChatGPT to "provide the solution, and never will". Organisations such as IBM and BT that are salivating at the thought of culling vast numbers of jobs simply won't care whether the results it produces are correct, 80% right is probably good enough from their point of view and if anybody complains they'll just be told "it's AI mate so it must be right". The initial automated output from the mills that the much-maligned Luddites complained about were terrible but v cheap which was all the mill owners cared about.

        For anybody that thinks that this is going to be the same as other waves of automation and that their job is safe because "they're different from everybody else" - dream on. The neo-liberal consensus seems to be that ever more wealth must be concentrated in ever fewer pockets at the top of the tree and everybody else can go hang. If you don't believe me, try plotting the average salary of listed company CEOs (+ bonuses if you're feeling conscientious) vs average worker pay at the same companies over the last 20 years.

  6. Flocke Kroes Silver badge

    I think there is a problem with the test

    Several years ago, a machine learning based facial gaydar was tested on photographs with the faces blurred out. Hiding the faces did not reduce the accuracy. The software must have been basing its decision on something other than the face, like clothing, background, lighting or the composition of the photo.

    As paywalled gastroenterology text books and medical journals are not part of the training data there could be unexpected reasons why ChatGPT is scoring better than a random number generator. My guess is there is some kind of pattern to the the phrasing of many of the wrong answers in the multiple choice tests that ChatGPT can exploit to improve its score. A more useful test would be to feed it transcripts of patient consultations and compare ChaptGPT's proposed treatments with the recommendations of gastroenterologists with a high success rate.

    1. Caver_Dave Silver badge
      WTF?

      Re: I think there is a problem with the test

      I dropped Biology as soon as I could in school. I had a look at my daughters multiple choice GCSE Biology paper a few years back and scored 100%. A combination of a little common sense and very leading questions.

  7. Mike 137 Silver badge

    A fundamental misconception

    "the multiple choice questions taken from the 2021 and 2022 American College of Gastroenterology (ACG) Self-Assessment Tests"

    The very idea that multiple choice exams can validate expertise in life-critical subjects such a medicine is both ludicrous and dangerous. They don't even validate competence in simpler subjects such as infosec or plumbing.

    All a multiple choice question can validate is your ability to remember something when prompted. Real skill lies in the ability to recognise what is going on without being prompted and to draw valid conclusions from that recognition that determine correct courses of action.

    It's not surprising that "AI" is gaining such recognition for problem solving, given the shatteringly low expectations of human capacity for it that are the current norm. Feedback from an infosec management course I co-authored a few years back that included short answer and essay question in the exam alongside multiple choice was that students had problems answering the short answer and essay questions because they couldn't readily express their own ideas.

    Unfortunately, the economics of exam setting and marking have driven us into a corner where multiple choice is the universally preferred option, regardless of its capacity to validate real knowledge. In their 1930 classic "1066 and all that" Sellar and Yeatman famously stated "History is not what you thought. History is what you can remember" but they were joking. The joke is now on us.

    1. Caver_Dave Silver badge
      Holmes

      Re: A fundamental misconception

      I used to give potential employees (all at least with an MSc) a pencil and piece of paper to describe their journey to the interview, "while I found the other interviewer";-)

      It was amazing how many of these people couldn't write a coherent description of something that had only just happened!

      Those that could, progressed into the interview with people.

      I was made to stop this by HR as it was "demeaning" in their eyes. In my eyes, if a highly educated person can't be fluent and expressive in their normal language of communication, then they can't be in a programming language, and certainly can't write an unambiguous specification!

      1. MiguelC Silver badge

        Re: A fundamental misconception

        So, did you use to get lots of politicians to pass that first phase and exclude lots of potentially competent coders? As I see it, communicating and coding require quite different skill sets - although I concur that to write good specs, communication skills are a sine qua non

        1. JamesTGrant Bronze badge

          Re: A fundamental misconception

          It’s all about thought expression through transmitted means. If you can’t express a thought clearly to the intended target in a language that presumably (for a native-speaker) you’ve both been using since you were two years old, one can only assume the confused spaghetti code that will be churned out in a computer-interpreted language! Even if that person has some sort of neuro-symbiosis with the matrix and writes fluent machine code, the person maintaining the code later will probably be a meat-brained human and require the code to be followable. So, I think that is a very good and useful interview icebreaker! Nice.

  8. Anonymous Coward
    Anonymous Coward

    So....I just asked ChatGTP for a thousand words on......

    ......."Miss Marple Screws It Up"

    ......and I got a thousand words describing how wonderful this non-existent novel was!

    Yup.....coherent, well drafted.....but describing a novel WHICH DOES NOT EXIST!!!!

    Perhaps I should try "Donald Trump's Third Presidency Ends In Failure"...........

  9. blue-eyes

    I was not looking to stump ChatGPT but did so today with the following question:

    Did any communities south of the equator develop words to describe rotational movement in two different directions and if so, what words did they historically use?

  10. Daedalus

    Feynman's Observation

    When Richard Feynman went to teach in Brazil, he encountered a system of education that produced people who could spout answers to questions on demand, providing the answers were those they had learned by rote. So asking about "Brewster's Angle" (relating to the polarization of light reflected from the surface of a transparent medium) he could get chapter and verse from students who actually had no idea what polarization or refractive index meant, and couldn't say why light reflected off water might be polarized.

    I include this because it's exactly the kind of "learning" we can expect from AI as related to medicine or science.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like