Agentic AI Comes to Medicine
Two new Nature papers introduce agentic AI in healthcare: MIRA achieves end-to-end emergency department management with superior diagnostic accuracy over physicians; AIME demonstrates non-inferior or better outpatient management across three visits. While limited, they mark a shift from narrow AI support to autonomous management.
Agentic AI Comes to Medicine
Expansion of Capabilities With Two New Medical AI Models
Eric Topol
Jun 17, 2026
It was just a matter of time. Agentic autonomous AI has already been applied to life science and many other domains, and today there were 2 notable publications in Nature that move this concept forward for healthcare. One is called MIRA from Jacob Kather and colleagues from Germany, and the other is called AIME, from Mike Schaekermann and colleagues at Google (acronyms defined below). This work is getting well beyond AI support for narrow applications, such as help in making diagnoses, to full management, end-to-end care plans. They are both very complicated papers with a lot to unpack, including tens of pages of supplementary information to fully describe what they assessed. In this issue of Ground Truths, I’m going to get to the core results and implications. First, a summary Table that compares the 2 systems.
Note the Towards in the title of the 2 papers:
Summary of MIRA
This was designed to be embedded in a health system EHR to provide reasoning and action steps. There were 2 agents, the patient and the AI physician (MIRA). MIRA queried the patient’s history, the physical exam results, and could order labs, blood cultures, scans, medications, procedures, surgery, and triage for hospital admission. This was done in 500 emergency department established real cases with the MIRA results directly compared to 4 board-certified (BC) physicians, and also to a hybrid group of 2 BC physicians and 2 residents. (I won’t review the results of the hybrid group further since their performance in all tasks was lower than the 4 BC physicians.) It simulated the sequential way a patient’s data would be interrogated and processed. MIRA was enabled with 11 different tools and choices from >85,000 action options, operating in a standards-compliant framework for multi-step reasoning (using FHIR, ICD-10, RxNORM, ATC, LOINC, and SNOWMED-CT). The system was built on OpenAI’s GPT-4o.
The overall diagnostic accuracy for MIRA was 87.8% compared with BC physicians at 78.1%. That increase was especially notable for specific diagnoses like pancreatitis (95.2% vs 78.6%), and appendicitis (100% for MIRA, 88% for BC physicians). While MIRA ordered more blood tests (51% vs 28%), resource consumption was countered this by ordering substantially less scans. For therapy, MIRA surpassed the BC physicians 53.5% vs 38.3% for correctly ordering procedures such as laparoscopic appendectomy or cholecystectomy (Figure below). Other advantages for MIRA therapy included better IV fluid management and analgesic adherence to guidelines, and an overall 35% increased alignment of clinical guidelines compared with the BC physicians. Of 468 medications ordered by MIRA, 99.8% were correct for indication and safety (such as allergy, interactions, and kidney dosing). MIRA triaged more cases than physicians for hospital admission, which reflects lack of being economically driven.
Because the design of the system would allow leak of the case data to the AI, considerable effort was made to avoid premature information flow. That worked well, with 0 of 933 cases exhibiting any leak. 880 adversarial prompts were tested and the system held up well, as it did for stress-testing attempts at hacking, medico-legal threats, and other patient agent trickery. MIRA also assessed multiple patient perturbations including high anxiety, non-English speaking, paranoia, and diagnostic denial, without affecting its performance.
Summary of AIME
This system had a very different design, with its focus on longitudinal assessment of outpatients, with the primary goal of developing first-rate management plans. Like MIRA, there were 2 agents used. The Dialogue Agent was conversational, interacting with the patient, representing fast, System 1 thinking (À la Danny Kahneman, using Gemini 1.5 Flash), and asynchronous to the Mx management agent, System 2, slow thinking that used long, context processing (even though it was quick). AMIE assessed 100 patients with 3 visits (each separated by ~2 days) spanning 5 different specialties. The results were compared with 21 BC primary care physicians. A noteworthy feature was the Ensemble Refinement which took 4 different treatment plans developed and came up with a consensus, mimicking a real medical treatment board, as is typically seen with cancer management. The massive >600 clinical guidelines were fully tokenized (not just parts of them) to provide the grounding and citations for management. This ensemble only took about 80 seconds to produce. Like MIRA, this was all text based, which was defended as necessary in AIME for the intent to maintain blinding. 30 physicians rated the performance of AIME outputs vs the 21 BC PCPs. Unlike MIRA, patient actors were used. The performance wheel below shows the differences for AIME vs the PCPs for 6 metrics, all leaning to AIME for being slightly or substantially better.
Overall, for management reasoning, AIME was non-inferior to the 21 PCPs. By the 3rd outpatient visit, the rating of AMIE’s management plan was 98% vs 81% of the PCPs. Preciseness of treatment was 95% vs 67%. Expert guidelines alignment was 100% vs 86%, respectively. So while the overall management was non-inferior, there were several ratings that favored AIME’s management.
A new benchmark for medication management was developed, called RxQA, built on 600 questions to board-certified pharmacists and the content from the UK and US formularies. AIME’s medication management outperfomed the PCPs for the correct medication, dose, duration, side-effects, and follow-up evaluation. This was found with the more difficult cases even when the PCPs were tested open-book (58% vs 48% in favor of AI).
Share Ground Truths
The Implications
The new studies raise the level of capabilities for medical AI from what has been previously studied, which were relatively narrow, mainly for support in making a diagnosis or answering questions on a medical exam or from a patient. MIRA took on autonomous agent end-to-end assessment and action plans for each patient presenting to an emergency department. AIME was geared to provide longitudinal assessment across 3 outpatient visits. Both used only 2 agents, one interacting with the patient and the other to do the AI work.
There are many reasons to think these are preliminary findings that do not reflect the real world of medicine. Both MIRA and AIME are text-only AI, meaning all the other things that are part of medicine, from the patient’s non-verbal communication and tone of voice to the review of actual medical images, were not included. The cases used in both studies were “clean,” complete data from established datasets. MIRA interactions were capped at 20 conversation turns. AIME used patient actors. There was nothing done that truly represents the practice of medicine, which is typically characterized by incomplete and conflicting data. The 3 outpatients visits in AIME were set up 2 days apart, hardly simulating how hard it is to get appointments with doctors! Likewise, only 5 specialties of medicine were addressed.
But there were some findings that can be viewed as filling in gaps of current medical care. Improving diagnostic accuracy in MIRA was clear, although the comparator group was very small, with only 4 board-certified physicians. Getting 100% correct diagnosis and ordering a laparoscopic procedure for appendicitis is impressive, as was the accuracy of the medications selected. The breadth of conditions in MIRA were limited (n=8) even though there was an array of diagnostic tests, procedures and surgeries as part of the management option mix. The lack of economic incentives in the practice of MIRA management is desirable and even resulted in less expansive scan tests ordered.
For AIME, a big advantage was the longitudinal context across 3 clinic visits. Unlike real patients today, the Patient Dialogue Agent didn’t have to fill out forms. There was memory and efficiency (not what we see in US healthcare). That agent was set up to be empathetic and attentive which showed up in the patient preferences, a pretty big gradient favoring the AI.
Both of these medical AI models were remarkably precise with respect to sticking to guidelines or providing specific plans. For example, the PCP in the AIME study wrote “give an antibiotic whereas the AI prescribed amoxicillin 500 mg orally, 3 tablets daily for 7 days, and wrote to check for a penicillin allergy.
But here is the rub. Guidelines in medicine are meant to apply to the vast majority of patients, without specific regards for needs, fears, prior patient experiences, cost, and many other factors. Much of present day guidelines are provided by “experts,” that is they represent opinions that are eminence-based, not evidence-based medicine. So this very tight, superior alignment of the AI with guidelines is not necessarily a good thing, with the claims of great “precision.” Frankly, over-adherence to guidelines may presage the loss of the art of medicine, not taking in the human factor of each patient, and the human-to-human bond that would be the foundation of the patient-doctor relationship.
Where is this headed? The large language models (LLMs) will keep getting better. In fact, the ones used in these 2 reports are already obsolete. In the agentic AI era, there could literally be hundreds of specialized agents, such as one for labs, one for scans, one for sensors, one for environmental exposures, one for genetics/genomics, and so on. Today, with these new papers, we are only seeing the rudimentary use of agents.
You can think of MIRA and AIME as a major step forward within the constraints of a simulation, not real medicine. But the improvements in AI’s capabilities are coming fast, and it would not be surprising to see some of the benefits here extended to the actual practice of medicine. To prove that, we ideally will need randomized trials of 3 strategies to assess outcomes: (1) end-to-end medical AI ; (2) human clinicians only; and (3) combining both. That won’t happen soon since are still on a steep slope of model improvement and such a large trial would not be easy to get funded, no less executed. In the meantime, we’ve got new evidence for potential ways that generative AI will improve medical communication, diagnosis, and management.
NB This post was written by me, no AI. I have no COI related to the content of the post.
Loading...
A big thanks to Ground Truths subscribers from every US state and 212 countries. Your subscription to these free essays and podcasts makes my work in putting them together worthwhile. If you’re not a subscriber, please join!
If you found this interesting PLEASE share it!
Share Ground Truths
Paid subscriptions are voluntary and all proceeds from them go to support Scripps Research. They do allow for posting comments and questions, which I do my best to respond to. Please don’t hesitate to post comments and give me feedback. Let me know topics that you would like to see covered.
Leave a comment
Many thanks to those who have contributed—they have greatly helped fund our summer internship programs for the past two years. It enabled us to accept and support a record number of 51 summer interns coming in 2026! These are high school, college and medical students selected from thousands of applicants. We couldn’t do this expanded program without the funds coming in through Ground Truths.