Performance of Algorithms Submitted in the 2023 RSNA Screening Mammography Breast Cancer Detection AI Challenge.

Document Type

Article

Publication Date

8-1-2025

Keywords

JMG, Humans, Female, Breast Neoplasms, Mammography, Middle Aged, Algorithms, Aged, Artificial Intelligence, Australia, Early Detection of Cancer, United States, Sensitivity and Specificity, Breast, Radiographic Image Interpretation, Computer-Assisted

JAX Source

Radiology. 2025;316(2):e241447.

ISSN

1527-1315

PMID

40793948

DOI

https://doi.org/10.1148/radiol.241447

Abstract

Background: The 2023 RSNA Screening Mammography Breast Cancer Detection AI Challenge invited participants to develop artificial intelligence (AI) models capable of independently interpreting mammograms. Purpose: To assess the performance of the submitted algorithms, explore the potential for improving performance by combining the best-performing AI algorithms, and investigate how performance was influenced by the demographic and clinical characteristics of the evaluation cohort. Materials and Methods: A total of 1687 AI algorithms were submitted from November 2022 to February 2023. Of these, 1537 algorithms were assessed using an evaluation dataset from two sites—one in the United States and one in Australia. Cancer cases were identified at screening and confirmed with pathologic examination; noncancer cases were followed up for at least 1 year. Results for ensemble models of top algorithms were computed by recalling a case when any of the included algorithms indicated recall. Odds ratios (ORs) were used to investigate differences in AI performance when the dataset was stratified by clinical or demographic characteristics. Results: The evaluation dataset consisted of 5415 women (median age, 59 years [IQR, 52–66 years]). Among the 1537 AI algorithms, the median recall rate, sensitivity, specificity, and positive predictive value (PPV) were 1.7%, 27.6%, 98.7%, and 36.9%, respectively. For the top-ranked algorithm, the recall rate, sensitivity, specificity, and PPV were 1.5%, 48.6%, 99.5%, and 64.6%, respectively. Ensemble models of the top 3 and top 10 algorithms had a sensitivity of 60.7% and 67.8%, respectively; the corresponding recall rates were 2.4% and 3.5%, and the corresponding specificities were 98.8% and 97.8%. Lower sensitivity was observed for the U.S. dataset than for the Australian dataset (top 3 ensemble model: 52.0% vs 68.1%; OR = 0.51; P = .02), and greater sensitivity was observed for invasive cancers than for noninvasive cancers (top 3 ensemble model: 68.0% vs 43.8%; OR = 2.73; P = .001). Conclusion: The different AI algorithms identified different cancers during screening mammography, and ensemble models had increased sensitivity while maintaining low recall rates

Share

COinS