From breadth to depth in clinical artificial intelligence evaluation

A large-scale benchmark of 87 clinical text tasks across nine languages reveals just how far large language models remain from mastering real-world medical records — and raises the question of what comes next.

This is a preview of subscription content, access via your institution

Prices may be subject to local taxes which are calculated during checkout

Wu, J. et al. Nat. Biomed. Eng. https://doi.org/10.1038/s41551-026-01719-2 (2026).

Article PubMed PubMed Central Google Scholar

Tordjman, M. et al. Nat. Med. 31, 2550–2555 (2025).

Article CAS PubMed Google Scholar

Raji, I. D., Daneshjou, R. & Alsentzer, E. NEJM AI 2, AIe2401235 (2025).

Bedi, S. et al. Nat. Med. 32, 943–951 (2026).

Article CAS PubMed PubMed Central Google Scholar

Bedi, S. et al. JAMA 333, 319–328 (2025).

Wu, D. et al. Preprint at https://arxiv.org/abs/2512.01241 (2025).

McCoy, L. G., Manrai, A. K. & Rodman, A. N. Engl. J. Med. 391, 1561–1564 (2024).

Rodman, A., Zwaan, L., Olson, A. & Manrai, A. K. NEJM AI 2, AIe2500143 (2025).

Division of Neurology, University of Alberta, Edmonton, Alberta, Canada

Institute for Medical Engineering and Science, Massachusetts Institute of Technology, Cambridge, MA, USA

Harvard Combined Dermatology Program, Harvard Medical School, Boston, MA, USA

Department of Dermatology, Mass General Brigham, Boston, MA, USA

Search author on:PubMed Google Scholar

Correspondence to Liam G. McCoy.

L.G.M. and D.W. report paid consulting for Meta Platforms via Magnit Global.

McCoy, L.G., Wu, D. From breadth to depth in clinical artificial intelligence evaluation. Nat. Biomed. Eng (2026). https://doi.org/10.1038/s41551-026-01691-x

Version of record: 24 June 2026

DOI: https://doi.org/10.1038/s41551-026-01691-x