This systematic review comprehensively assesses the predictive performance and clinical outcomes of Artificial Intelligence (AI) and Machine Learning (ML)-based triage systems in Emergency Departments (EDs). Analyzing 14 retrospective observational studies from 2021-2026, the review finds that AI/ML models show moderate to excellent retrospective predictive performance for various ED outcomes, including mortality and critical illness, particularly with ensemble tree-based and Natural Language Processing (NLP)-enhanced approaches. However, the evidence base is critically limited by an overreliance on heterogeneous retrospective designs, insufficient reporting on model calibration, and a notable absence of prospective or external validation. Consequently, the authors conclude that the strength of conclusions regarding clinical applicability remains weak, emphasizing the urgent need for rigorous prospective validation, comprehensive calibration reporting, and randomized controlled trials measuring patient-centered outcomes before widespread clinical implementation.
Emergency departments (EDs) face significant challenges, including increasing patient volumes and overcrowding, which traditional, clinician-judgment-based triage systems struggle to manage due to inter-observer variability and limited predictive accuracy. Artificial Intelligence (AI) and Machine Learning (ML) are emerging as promising tools to enhance triage by analyzing complex healthcare data (e.g., EHRs, vital signs, clinical notes) to predict critical outcomes like admission, critical illness, and mortality. Despite their potential, the translation of AI/ML models into routine care is hindered by issues such as dataset bias, lack of external validation, and limited evidence of real-world effectiveness. This systematic review aims to synthesize existing evidence on AI/ML-based ED triage systems, focusing on their predictive performance and clinical outcomes, while critically appraising methodological gaps to inform future development and cautious implementation.
This section details the systematic review's methodology and presents its findings.
The systematic review adhered to PRISMA 2020 guidelines, using the PICOS framework to define eligibility criteria. A comprehensive literature search was conducted across PubMed, Scopus, CINAHL, IEEE Xplore, and Web of Science for publications between January 2021 and December 2026. Two independent reviewers screened titles, abstracts, and full texts, resolving disagreements through discussion. Data extraction utilized a piloted, standardized form, capturing study characteristics, AI/ML model types, input variables, outcomes, validation strategies, and performance metrics. Due to substantial clinical and methodological heterogeneity in populations, outcomes, triage systems, and model designs, a narrative synthesis approach was adopted instead of meta-analysis. The Prediction Model Risk of Bias Assessment Tool (PROBAST) was used to assess the methodological quality and applicability concerns of included studies across four domains: participants, predictors, outcomes, and analysis.
The search identified 1,847 records, with 14 retrospective observational studies (published 2021-2026) from eight countries ultimately meeting the inclusion criteria after a rigorous screening process. The study selection process is illustrated in the PRISMA flow diagram (Figure 1).
All 14 included studies were retrospective observational designs, primarily using single- or multi-center Electronic Health Record (EHR) data from EDs globally, with sample sizes ranging from 657 to over 2.6 million patient visits. The AI/ML models employed a wide spectrum of approaches, including supervised machine learning classifiers like Random Forest (RF), XGBoost, Logistic Regression (LR), Decision Trees (DTs), Support Vector Machines (SVMs), and various ensemble methods (e.g., CatBoost, AdaBoost). Natural Language Processing (NLP) and Large Language Models (LLMs) were utilized in four studies to extract features from unstructured triage notes or audio transcripts. Prediction targets were diverse, encompassing triage level assignment, hospital admission, ICU admission, mortality (2-, 7-, and 30-day), ED disposition, short ED length of stay, and the need for critical interventions. Most studies reported internal validation, with several also performing external or temporal validation. Explainability methods, such as SHAP values, were used in some studies to enhance interpretability.
The predictive performance, primarily measured by Area Under the Receiver Operating Characteristic curve (AUC-ROC), ranged from moderate to excellent (0.642 to 0.991) across studies. High performance was noted for mortality prediction (0.874-0.933 for 2- to 30-day mortality) and pediatric critical illness (0.991). For ED disposition outcomes, ensemble models achieved AUC-ROCs around 0.90. Accuracy varied, with one study reporting 88.9% for ED disposition prediction. Sensitivity and specificity also varied by outcome. A significant limitation was the infrequent reporting of calibration metrics, with only three studies providing this crucial aspect of model reliability. Positive and Negative Predictive Values (PPV/NPV) were reported for some outcomes, with high NPVs (e.g., 0.93 for hospital admission) suggesting potential for ruling out adverse events.
While most studies focused on predictive performance, evidence of direct clinical impact was limited. Yu et al. demonstrated generalizable two-day mortality prediction with an interpretable score model across multiple hospitals. Tsai et al. showed that an AI ECG recommendation system significantly reduced missed ECGs and improved 48-hour detection of clinically actionable arrhythmias, a rare instance of direct care process improvement. Elhaj et al. noted increased high-acuity detection and fast prediction times, suggesting real-time deployment feasibility. However, no studies reported prospective interventional trials or randomized controlled testing measuring patient-centered outcomes like mortality reduction, reduced waiting times, or improved ED crowding metrics, highlighting a gap between predictive capability and proven clinical benefit.
Using the PROBAST tool, most studies exhibited a low risk of bias in the participant, predictor, and outcome domains. However, the analysis domain showed considerable variation: 11 studies were rated low risk, two as high risk (due to issues like insufficient missing data handling or lack of calibration reporting), and two as unclear risk. Overall, 10 studies had a low risk of bias, two were high risk, and two were unclear risk. These findings indicate that while many AI/ML triage systems were developed with sound participant and outcome methods, statistical analysis often presented limitations, raising concerns about the reliability and potential overfitting of some models.
This section synthesizes the key findings and addresses methodological gaps and their implications.
AI/ML models exhibit moderate to excellent discrimination for ED outcomes, with AUC-ROC values often exceeding 0.90. However, the considerable heterogeneity across studies in model types, prediction targets, and validation methods makes direct comparability and meta-analysis difficult. High AUC values require cautious interpretation, as they may sometimes mask issues like overfitting or poor performance in clinically severe but rare outcomes if not complemented by other metrics. No direct clinical benefits, such as reduced mortality or improved patient flow, were identified from the retrospective evidence, indicating an early validation stage for these technologies.
A critical deficiency is the widespread lack of calibration reporting (only 3 out of 14 studies), which is fundamental for ensuring that predicted probabilities accurately reflect observed event frequencies in clinical decision-making. Similarly, external validation on separate datasets was rare (only 5 studies), limiting confidence in the generalizability of models across diverse real-world settings. Internal validation alone often overestimates performance, and significant performance decay is common when models are deployed in new environments.
The review identified a diverse range of AI/ML approaches, from traditional classifiers to ensemble methods, neural networks, NLP, and LLMs. While some studies reported that ensemble or hybrid (NLP-enhanced) models outperformed simpler classifiers internally, the profound heterogeneity across studies prevents robust cross-study comparisons of model classes. NLP and LLMs show promise in extracting nuanced clinical information from unstructured text, potentially improving predictive performance, but practical implementation challenges and potential biases need careful consideration.
Only a few studies ventured beyond predictive metrics to report clinical impact, with one notable exception demonstrating that an AI ECG recommendation system improved care processes by reducing missed ECGs. However, no studies provided evidence from prospective interventional trials or randomized controlled trials on patient-centered outcomes. Interpretability, often addressed through methods like SHAP values or intrinsically transparent models, is crucial for building clinician trust and facilitating adoption. The field is still in an early translational stage, with models often failing during deployment due to poor human-computer interaction or misalignment with clinical workflows.
The PROBAST assessment revealed that while most studies had low risk of bias in participant, predictor, and outcome domains, the analysis domain was frequently problematic, with common deficiencies including inadequate handling of missing data, absence of calibration assessment, and insufficient validation methods against overfitting. This highlights the need for future studies to adhere to reporting guidelines like TRIPOD+AI, provide comprehensive reporting of both discrimination and calibration, perform external validation, and clearly detail missing data handling and class imbalance strategies.
This systematic review has several limitations, including substantial heterogeneity among included studies that precluded meta-analysis and quantitative pooling of results. All identified studies were retrospective, making them susceptible to inherent biases and limiting causal inferences. Potential publication bias towards favorable results may exist, though not formally assessed. The search was limited to English-language peer-reviewed journals, and the PROBAST assessment involved some subjective judgments. Generalizability is also limited by the predominance of single-center studies from high-income countries, lacking representation from low- or middle-income settings.
Future research must prioritize rigorous external validation across diverse healthcare systems and comprehensive calibration reporting. Prospective studies, ideally randomized controlled trials, are essential to measure the impact of AI-based triage on patient-centered outcomes like mortality, length of stay, and waiting times. Implementation science should address workflow integration, user acceptance, alert fatigue, and explainability. Cost-effectiveness analyses are also needed to justify investment. Currently, clinical adoption is not fully justified; only parsimonious, interpretable, and externally validated models are likely to succeed in real-world deployment, serving as promising adjuncts rather than replacements for clinician judgment.
AI and ML models demonstrate moderate to excellent predictive performance for various ED triage outcomes, including mortality, ICU admission, disposition, and critical illness. However, drawing robust comparative conclusions about different AI approaches is hindered by significant heterogeneity across studies. Crucial evidence gaps persist, particularly regarding calibration reporting, external validation, and the lack of prospective or randomized studies evaluating actual clinical utility and patient-centered outcomes. Thus, while promising, these technologies require further rigorous research before widespread clinical implementation.