The war between OpenAI, Gemini, ChatGPT, for the most accurate responses, and the deeper question about the meaning of medicine
Do the questions tell us more than the answers?
There are now multiple papers out by multiple groups comparing ChatGPT to Gemini to OpenEvidence to Claude to uptodate, to AskJeeves (j/k).
These papers ask: which platform gives the most accurate answer? (as judged by 2, 3, 4, or X random doctors). Some papers say Gemini, some say OE. And the commenters are at each others throats. My AI is better, dammit, they argue.
A recent paper does something I appreciate: It provides specific queries that are judged. Analyzing them in some detail can give us a perspective that is being lost. It can find the meaning of medicine. Here are 3 actual questions:
Is it appropriate to proceed with Cycle 4 of FOLFOX in a patient with metastatic cholangiocarcinoma whose oxaliplatin has already been dose-reduced by 25% and 5-FU bolus omitted since Cycle 1, with ANC values of 1390, 1450, and 1300 before Cycles 2, 3, and 4 respectively, and platelets trending down from a baseline of 110-120 to 79 before Cycle 4?
What is the management of chronic phase CML with a white blood cell count of 500?
What is Castleman's disease?
(I am choosing oncology questions because I am an oncologist, so have thoughts).
I find these questions fascinating. Supposedly these are real questions put into one AI tool by real doctors seeing real patients.
Let us take them in reverse order:
What is Castleman’s disease? — This strikes me a medical student question. If you have this question, I might recommend starting with wikipedia or I am partial to this paper by David Fajgenbaum, who famously had Castleman’s and wrote a nice book about his experience. A case of doctor heal thyself.
I can only imagine an attending doctor asking this as general curiosity. If the doctor has a Castleman’s patient, and asks What is Castleman’s, lord help the patient.
What is the management of chronic phase CML with a white blood cell count of 500?
My first question is… I am just double checking you have the right diagnosis, partner. Every since I saw doctors miss that heart failure patient in Nature Medicine (video out at the link) I am worried about competency.
Please confirm: you looked at the smear and it is CML, not AML or CLL? I need you to confirm you looked at it. Read me the differential. And no blasts/ accelerated phase? And you confirmed bcr-abl? And patient has no evidence of leukostasis? You asked the appropriate questions?
And then I think it gets tricky. Because I know many of the big CML players. And I can close my eyes and hear what Brian or Hagop might say, but I think you will have a bit of an argument between them. And I have my opinion about how to incorporate fluids, hydrea, tki, allopurinol, rasburicase (may be needed) etc. I have my own preferences for how to manage this. Imatinib is still my first choice btw.— (everyone else is wrong ;) )
I think it is fair to say that while there are some aspects of management we might all agree on— different experts will manage the patient differently. And again, I would be terrified if I were the patient and I knew my doctor was performing this search. If I presented with CP CML WBC 500, and the doc were asking this question, I would be asking for a transfer to Hopkins.
The final question is most interesting of all:
Is it appropriate to proceed with Cycle 4 of FOLFOX in a patient with metastatic cholangiocarcinoma whose oxaliplatin has already been dose-reduced by 25% and 5-FU bolus omitted since Cycle 1, with ANC values of 1390, 1450, and 1300 before Cycles 2, 3, and 4 respectively, and platelets trending down from a baseline of 110-120 to 79 before Cycle 4?
Well, again, let me double check a few things. This patient is on second line? They progressed through cis-gem-durva? (That makes most sense to me) No mutations? No IDH?
Boy, you really have to wonder if the right answer is being missed by the doctor with how they frame the question— the patient’s bone marrow is full of cholangiocarcinoma because they are dying in front of you (might that be what is going on?)— after all this is second line— and they may have had the diagnosis for some time— and those platelets may not be due to chemo toxicity, but disease— and instead of having the appropriate hospice discussion, you are treating this like it is a drug dosing question. Can I review the imaging? Can I see the patient? Can I see if you are even asking the right question?
Perhaps the right question is the existential question. Wake up doctor— are you sure this patient needs more chemo? Are you sure that dose reducing is the question we have? As a wise oncologist told me, 13 years ago when I was getting started, “It is easy to give more chemo, it is hard to be honest.” My intuition from being in this situation many, many times is there is more here than meets the eye.
AI is wonderful, but AI can’t jump out of the screen and ask you all these things. AI can’t lay hands on the patient and look them in the face. And touch their legs, and listen to their heart and lungs.
And getting different answers to these questions and having random doctors decide which answers are good or bad or best or worse is also, in my opinion, insufficient, inadequate and misses the point. The doctors judging may not know enough to know if the answer is close or not. And different expert doctors may have dramatically different opinions. I know excellent oncologists who agree or disagree with me on different cases.
There is no canonical “right” answer to many, perhaps most, questions in medicine; instead there are many wrong answers and a shorter list of defensible choices. We don’t have for instance RCTs restricted to patients with CP CML WBC>500. We have all comer studies, but considerable nuance is needed for very high white count. And don’t get me started on the literature on leukopheresis— were someone to evoke that.
And the platforms are evolving by the minute. Whatever is in first place in the morning, might not be at dinner. If the doctors judging are 6 random doctors, the judgement may be different than 6 doctors who are immersed in that question. And the doctors judgment also changes over the course of our careers— often based on salient examples we remember. The same 6 may vote differently a year later.
And finally, I refuse to believe that most doctors are choosing which AI platform to use based on which scores best in a preprint or peer reviewed publication.
Instead, doctors want a rapid system to remind us of what we think we remember, to stimulate our thinking, to push us in the right direction, to suggest options we have not thought of, to reassure us, and to guide us to focus on the big picture.
I think 4 things are true
AI is already doing amazing things and better than 90% of doctors
Doctors will use AI for all of the above reasons
Trying to prove your technology is the “most accurate” misses the forest for the trees.
Letting random doctors judge which answer is best for these questions moves us nowhere. These types of papers are misguided.
And if you are a patient, there is no substitute for a doctor who never forgets what matters in life.



Abstracting 'all the way down'
An important essay. I wonder whether the central issue is not which AI gives the most accurate answer, but whether the clinical question itself has already abstracted the phenomenon it seeks to understand. A complex, historically situated organism-in-the-world is first compressed into a clinical category and question, then compressed again into a computational problem for AI to solve. We then compare the accuracy of answers to that abstraction.
The critical issue is the ontology implicit in the question itself. Before AI begins reasoning, a complex lived phenomenon has already been reconstituted as a clinical abstraction. AI then performs a second abstraction, transforming that question into a computational problem. Each stage further distances us from the original phenomenon. The danger is not simply inaccurate answers, but increasingly accurate answers to progressively impoverished questions.
The tragedy in the chemotherapy example is that the patient's possible dying has disappeared before the AI begins reasoning. No AI, however sophisticated, can recover what has already been excluded by the question.
Perhaps AI's greatest contribution will not be providing better answers, but helping us recognise that our abstractions never exhaust the phenomena under consideration—and, sometimes, that we have been asking the wrong question all along.
Whitehead described this as the fallacy of misplaced concreteness: the abstraction (the chemotherapy dosing question) acquires the status of the concrete reality, while the living person who exceeds that abstraction recedes from view. AI then operates flawlessly, but probabilistically, on that abstraction. The error is not primarily computational—it is ontological. It is not only an issue with AI, its also the prevailing nature of medicine’s epistemology. The decisive loss occurs before computation begins.