FormalPara Take home message

In this retrospective cohort study of 14,343 patients, seven out of 32 previously published prognostic scores were able to fairly predict 30-day in-hospital mortality using routinely collected clinical and biological data (area under the ROC curve > 0.75). The 4C Mortality Score and the ABCS stand out because they performed as well in our cohort and their initial validation cohort, during the first and subsequent epidemic waves, in younger and older patients, and showed satisfactory calibration. Their ability to guide clinical management decisions and appropriate resource allocation should now be evaluated in future studies.

Introduction

Since the end of 2019, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has spread worldwide [1]. At the end of May 2021, there were over 167 million confirmed cases and over 3.4 million deaths from the coronavirus disease 2019 (COVID-19) around the world [2]. Hospital facilities have, thus, faced an unparalleled influx of patients. The evolution of hospitalized patients varies widely, from those necessitating no or low level of oxygen to those evolving to acute respiratory or hemodynamic failure requiring admission to intensive care units (ICU) [3, 4]. Accurate outcome prediction with scores based on patient characteristics (age, sex, comorbidities, clinical state, laboratory and imaging results, etc.) help optimizing healthcare delivery in a limited medical resources context [5]. They can also be used to select patients with a homogeneous risk for a given outcome for inclusion in clinical studies.

Various scores have been developed since the beginning of the outbreak and older ones, routinely used in community acquired pneumonia and other conditions, have also been tested in the setting of COVID-19. A systematic review updated in July 2020 found 39 published prognostic scores estimating mortality risk in COVID-19 patients and 28 aimed to predict progression to severe or critical disease. All scores were rated at high or unclear risk of bias. Only a few had undergone external validation, with shortcomings including unrepresentative patient sets, small sizes of the derivation samples and insufficient numbers of outcome events [6]. Moreover, the worldwide applicability of these prediction scores remains an open question: healthcare systems and patient profiles differ between countries [7] and may impact these scores' performances.

The aim of this study was to evaluate the accuracy of published scores to predict in-hospital mortality or ICU admission in SARS-CoV-2-infected patients, using a large multicenter cohort from the Greater Paris University Hospitals (GPUH).

Methods

Study reporting

Our manuscript complies with the relevant reporting guidelines, namely the REporting of studies Conducted using Observational Routinely collected health Data (RECORD) statement [8] and the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) statement [9]. Completed checklists are available in Appendix 2.

Study design and setting

We conducted a retrospective cohort study using the GPUH’s Clinical Data Warehouse (CDW), an automatically filled database containing data collected during routine clinical care in the GPUH. GPUH is a public institution and count 39 hospitals (22,474 beds) spread across Paris and its region, accounting for 1.5 million hospitalizations each year (10% of all hospitalizations in France). The data of patients hospitalized for COVID-19 in GPUH was used to evaluate the accuracy of published prognostic scores for COVID-19. Final data extraction was performed on May 8th, 2021. The GPUH’s CDW Scientific and Ethics Committee (IRB00011591) granted access to the CDW for the purpose of this study and no linkage was made with other databases.

Inclusion and exclusion criteria

Patients’ selection process is summarized in Fig. 1. All patients with a result found in the database for reverse transcriptase-polymerase chain reaction (PCR) for SARS-CoV-2 in a respiratory sample were screened. Patients were included in the study if they met both following criteria:

  • A hospital stay with an International Classification of Diseases, 10th edition (ICD-10) code for COVID-19 (U07.1),

  • At least one positive respiratory PCR for SARS-CoV-2 from 10 days before to 3 days after hospital admission.

Patients were excluded from the study if they met at least one of the following criteria:

  • PCR result considered unreliable (i.e., time of validation by the biologist before the time of PCR sample collection, or more than 20 days after the time of sample collection),

  • Asymptomatic positive PCR result during a COVID-unrelated hospitalization or COVID considered as hospital-acquired (i.e., a first positive PCR sample collected more than 3 days after hospital admission),

  • Direct ICU admission (i.e., time between recorded hospital admission and recorded ICU admission less than 2 h and no visit in another GPUH hospital in the preceding 24 h),

  • Age < 18, not recorded or unknown,

  • Hospitalization in the Georges Pompidou European hospital, one of the 39 GPUH hospitals (all biological and clinical data from this hospital were missing, due to interoperability issues with the CDW).

To have a follow-up of 30 days or more for all hospitalized patients, only patients with a PCR performed before March 30th were considered.

Fig. 1
figure 1

1. Where validation by a biologist occured before or 20 days after recorded sample collection date and time. 2. Patients from Georges Pompidou European Hospital were excluded, as all biological and clinical data from this hospital were missing due to interoperability issues with the CDW. 3. Hospitalizations with no ICD-10 code for Covid-19, or with an ICD-10 code for Covid-19 and a first positive PCR sample obtained more than 10 days before or more than 3 days after admission. 4. Hospitalizations for Covid-19 with ICU transfer within 2 hours following hospital admission, and no visit in any other GPUH hospital in the preceding 24 hours

Flow chart of selected patients.

Data collection

The reference date used for baseline characteristics was the date of hospital admission for COVID-19. The following data were collected:

  • Demographic data and data on hospital admission.

  • Medical history (based on ICD-10 codes for current or previous hospital visits; the list of codes used is based on a previously published work [10]).

  • Vital signs and biological values (the first value found in the database from 24 h before to 48 h after hospital admission was retrieved for each patient, as a delay can exist for logistical reasons between true and recorded admission date; values obtained in ICU were not considered).

  • Outcomes (in-hospital mortality, ICU admission and invasive mechanical ventilation within 30 days from admission).

Of note, invasive mechanical ventilation is always performed in ICU in France.

Selection of published scores

The selection of high-quality published scores was performed using “COVID-19 Evidence Alerts” (https://plus.mcmaster.ca/Covid-19/), a service provided by the McMaster University, in which evidence reports on COVID-19 published in all journals included in MEDLINE are critically appraised for scientific merit based on prespecified criteria (see https://hiru.mcmaster.ca/hiru/InclusionCriteria.html). All studies identified by the “Clinical Prediction Guide” filter were systematically screened by two independent investigators (L.A. and P.S.), and discrepancies were adjudicated by a third investigator (Y.L.). Studies were included if they met all the following criteria:

  • studies on prognostic scores predicting ICU transfer or in-hospital mortality for patients hospitalized for COVID-19, including scores primarily developed for other purposes prior to the pandemic,

  • meeting all the prespecified criteria for “higher quality” (i.e., generated in one or more sets of real patients; validated in another set of real patients; study providing information on how to apply the prediction guide); or studies excluded from this category only due to the lack of an independent validation cohort, but in which derivation and validation were performed in different samples from the same cohort (split validation),

  • computable with the data collected in the CDW.

The last search in “COVID-19 Evidence Alerts” was performed on April 3rd, 2021. The process for scores’ selection and reasons for exclusion are detailed in Appendix 3 and Figure S1, and information on scores included in the study in Table S1 and S2.

Statistical analysis

Aberrant values for biological tests and vital signs were treated as described in Table S3. Missing data were treated by multiple imputations (mice function of the mice package, 50 imputed datasets with 15 iterations, predictive means matching method for quantitative variables, after log or square-root transformation when needed to get a more normalized dataset), under the missing-at-random hypothesis. Outcome variables were included in the dataset used for imputation. Rubin’s rule was used to pool estimates obtained in each imputed dataset. Variables used for multiple imputations are detailed in Table S4.

For each score included in the analysis and each outcome, discrimination was assessed by drawing a receiver operating characteristics (ROC) curve and computing the corresponding area under the curve (AUC). DeLong’s method [11] was used to estimate the variance in each dataset, results were pooled with Rubin’s rule and used to compute pooled 95% confidence intervals.

First, we assessed the performance of each score to predict the available outcome closest to the one used in the original study, with the required adaptations to be computed with the available data. AUC in our cohort and in previously published studies were compared using a Z-test for independent samples. Second, we assessed the performance of each score to predict 30-day in-hospital mortality and the composite of 30-day in-hospital mortality or ICU transfer. Third, we used a Z-test for paired data following DeLong’s method [11] to compare the accuracy of scores with an AUC > 0.75 to predict 30-day hospital mortality. Sensitivity analyses were conducted on subgroups of age (≤ 65 or > 65 years old) or wave of admission (before or after June 15th, 2020, a graphically determined threshold), considering only complete cases (only patients with all data available to compute a given score), and considering the area under the precision-recall curve instead of under the ROC curve (pr.curve function of the PRROC package). Heterogeneity of AUC between subgroups was assessed using an interaction term between the score and the grouping variable in a logistic regression model predicting the outcome.

Post hoc analyses were performed to further characterize the best scores at predicting 30-day in-hospital mortality (AUC > 0.75). Calibration curves were drawn by plotting the observed mortality rate in each class as a function of the predicted probability of mortality, with patients grouped by deciles of predicted probability. For each score, a logistic regression model was built to predict 30-day in-hospital mortality with its predictors and fitted on our data. Variable importance was determined using the absolute value of the t-statistic for each predictor in this model (varImp function of the caret package). Calibration curves were drawn using probabilities predicted by the revised logistic regression models fitted on our data.

All tests are two-sided, and a p value < 0.05 was considered significant. Continuous variables are reported as mean (standard deviation) for normally distributed variables, and median [interquartile range] for non-normally distributed variables. Binary variables are reported as number of patients with a positive result (percentage of patients with a positive result). Analyses were performed using the R freeware version 4 (packages mice, pROC, psfmi, Amelia, PRROC, caret).

Results

Baseline characteristics and outcomes of patients included in the study

We included 14,343 patients in the validation cohort (Fig. 1). First hospital admission for COVID-19 was on January 29th, 2020 and last on April 6th, 2021. Patients’ baseline characteristics are summarized in Table 1 and outcomes are summarized in Table 2. Baseline characteristics appeared similar during the first wave and subsequent waves (Table S5). Initial care site appeared to be an important factor for vital signs or biological values to be missing (Table S6). Multiple imputations were therefore stratified by center. In-hospital mortality at day 30 was 18% overall, significantly lower during the first wave than in the subsequent waves, and significantly higher in patients older than 65 years old (Figure S2, p < 0.001 for Log-Rank test).

Table 1 Baseline characteristics of patients included in the study
Table 2 Outcomes of patients included in the study

Selected scores and their performance to predict the original outcome

Thirty-two scores [12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37] were included in the study: 23 were specifically derived in COVID-19 patients and 9 were pre-existing scores developed for other purposes and tested in COVID-19 patients (Table 3, Table S1 and S2, Appendix 3). Among 27 scores with available 95% CI to estimate AUC variance in previous reports, 19 (70%) had an AUC significantly lower in our cohort (Table 3). The 4C Mortality Score was the only one with an AUC significantly higher in our cohort compared to the previously published value (p < 0.001).

Table 3 Summary of scores included in the study and comparison to previously published data

Performance to predict 30-day in-hospital mortality and the composite of 30-day in-hospital mortality or ICU admission

Results are summarized in Table S7, and Figure S3 shows the ROC curves of the three most accurate scores for each outcome. None of the included scores had a very high accuracy to predict 30-day in-hospital mortality alone, or the composite of 30-day in-hospital mortality or ICU admission (all AUC < 0.8). AUC was higher to predict 30-day in-hospital mortality alone than 30-day in-hospital mortality or ICU admission for 25/32 scores (78%).

Seven scores had an AUC > 0.75 to predict 30-day in-hospital mortality (Table 4). The 4C Mortality score and the ABCS had the highest AUC to predict 30-day in-hospital mortality (4C Mortality score: 0.793, 95% CI 0.783–0.803; ABCS: 0.790, 95% CI 0.780–0.801). Their AUC did not differ significantly from each other (p = 0.61) but were significantly higher than that of the following scores (p < 0.01 for all comparisons). The CORONATION-TR score had the highest AUC to predict 30-day in-hospital mortality or ICU admission (AUC 0.724, 95% CI 0.714–0.733). Table S8 provides the sensitivities and specificities for these scores to predict in-hospital mortality using cut-off values from previous reports, and Figure S4 shows the Kaplan–Meier curves for in-hospital mortality of the three scores that performed best to predict in-hospital mortality.

Table 4 Detailed characteristics of scores with an AUROC > 0.75 to predict 30-day in-hospital mortality in the analysis using multiple imputed data

Sensitivity and post hoc analyses

Among the seven scores with an AUC > 0.75 to predict 30-day in-hospital mortality: accuracy was not significantly altered by wave of admission for any of them (Table S9); accuracy was significantly lower in the subgroup of patients > 65 years old for two of them (RISE-UP and COVID-19 SEIMC; Table S10); AUC was < 0.75 in the analysis using complete cases for one of them (CORONATION-TR; Table S7); the 4C Mortality Score ranked first to predict in-hospital mortality in analyses using multiple imputed data and analyses using complete cases (Table S7).

Main results were unchanged when using the area under the precision-recall curve instead of under the ROC curve to measure discriminative ability: the 4C Mortality score and the ABCS ranked first and second to predict 30-day in-hospital mortality, and the CORONATION-TR score ranked first to predict 30-day in-hospital mortality or ICU transfer (Table S11).

As shown by calibration curves (Figure S5), the risk of 30-day in-hospital mortality was overestimated by 6/7 scores (all but the CORONATION-TR), and most notably so for the COVID-GRAM and ANDC scores. Overestimation was overall less important during the first epidemic wave than subsequent waves (Figure S5) and was corrected after logistic coefficients revision (Figure S6).

In variable importance analysis, age was the most important factor to predict 30-day in-hospital mortality in 5 scores (4C Mortality, ANDC, CORONATION-TR, COVID-GRAM, RISE UP), troponin positivity in 1 score (ABCS) and low estimated glomerular filtration rate in 1 score (COVID-19 SEIMC) (Figure S7).

Discussion

Key results

Most scores (19/27 with available data for comparison) had a significantly lower accuracy in our study compared to previously published studies, and most scores (25/32) had a lower accuracy to predict the composite outcome of 30-day in-hospital mortality or ICU admission, compared to 30-day in-hospital mortality alone. Seven scores had a high accuracy (AUC > 0.75) for the prediction of 30-day in-hospital mortality: the 4C Mortality and ABCS scores had significantly higher AUC values compared to the other scores; the CORONATION-TR score was the most accurate to predict in-hospital mortality or ICU admission; the RISE-UP and COVID-19 SEIMC scores were less accurate in the subgroup of patients > 65 years old. The discriminative performance of these scores was not altered by wave of admission despite changes in clinical care such as larger use of corticosteroids and lower use of invasive ventilation during the subsequent waves. On the opposite, calibration was poorer during the second and subsequent waves than in the first wave.

Limitations and strengths

We conducted a large, multicentre, independent study to validate systematically selected prognostic scores for COVID-19, using routine clinical care data. Selection criteria were chosen to identify the most promising scores, although many of them had not yet been externally validated or had been validated in small cohorts only. Outcomes used in our study (in-hospital mortality, ICU admission and invasive mechanical ventilation) are of high clinical importance, objective and reliably collected in the CDW.

The main limitations of our study are consequences of its retrospective design, with a risk for selection and information bias. Selection bias was controlled using objective and reproducible inclusion and exclusion criteria, based on both administrative (ICD-10 codes for COVID-19) and microbiological information (PCR for SARS-CoV-2). This information is exhaustively recorded in the database, as ICD-10 codes for all hospital stays are independently assessed by a trained physician or technician before transmission to the national health insurance service for billing. Information bias for comorbidities and medical history was controlled by collecting ICD-10 codes for both index and previous visits, using a systematic procedure that was independently validated in a medico-administrative database whose structure is similar to ours [10]. Missing physiological values, such as oxygen saturation, respiratory rate, are explained by several templates available to record them in electronic health records. Only a limited number of these templates are used to gather and aggregate these data in the CDW. Missing biological values, such as d-dimers, CRP or ferritin, are explained by unstandardized practices across GPUH hospitals. As a result, the rate of missing values varied across centers for physiological and biological values (see Table S6), and was high for several important variables such as the Glasgow coma scale. To control these biases, we used multiple imputations under the missing-at-random hypothesis [38], taking centers into account, and we performed a confirmatory sensitivity analysis using complete data.

Several scores, based on machine- or deep-learning algorithms, or using data rarely collected for initial evaluation of patients in clinical practice (such as myoglobin or interleukins) could not be computed in our cohort (see Appendix 3). Although for many of them discriminative performance seemed high in previous studies, their use in clinical practice is more difficult, as they would require changing protocols for patients’ initial evaluation to add costly biological tests, and, for machine- or deep-learning based algorithms, to set an automatic system for computation. Further prospective pragmatic studies are needed on these matters.

Interpretation and generalizability

Our cohort includes patients from Paris and its suburbs, with various ethnicities and socioeconomic backgrounds [39]. Patients are treated in various hospitals, each of them having different resources and practices. Our validation study is strengthened by the number and diversity of included patients and settings, and by the independence from all cohorts used for the derivation and first validation of investigated prognostic scores. Patients were consecutively recruited, and the number of outcome events was very large, overcoming two major shortcomings of previous validation studies. For example, several included scores were previously validated in less than 100 patients (Table 3). The waste of time and money on inappropriately designing or validating COVID-19 prognostic scores has been stressed in a living systematic review [6].

Using a cut-off value of 0.75 for AUC to predict in-hospital death, seven scores were identified as having a high accuracy. They differ in characteristics that may influence their choice for a given use in a given clinical context. For example, some scores use costly biological tests and are not appropriate for countries with limited resources; some use many variables and may be hard to compute at the bedside; some are less accurate in older patients; some are more accurate to predict ICU admission and therefore more suitable to predict the demand on healthcare systems. For the seven fairly accurate scores identified, we provide detailed characteristics that can help clinicians choose the best suited to their needs (Table 4). The 4C Mortality and ABCS scores appear to be the most promising ones, as they use a limited number of variables that are available in routine clinical care, had a fair accuracy in our external validation study, performed equally well during the first epidemic wave and subsequent waves, and in younger and older patients.

The risk of 30-day in-hospital mortality was overestimated by 6/7 scores (all but the CORONATION-TR), and more so during the second and subsequent waves. This can be explained by overall better outcomes during these later waves, as seen in our study and in other ones [40]. Many published scores were derived and validated on first wave data. Revising the scores using local and current data is necessary if accurate estimations of the mortality risk are needed. Likewise, the thresholds indicating a high risk of poor outcome should be locally defined.

In variable importance analysis, age was the most influential factor in 5/7 scores, even in those including many clinical and biological variables (for example, the CORONATION-TR score), underlining the importance of age in driving severity among hospitalized COVID-19 patients. Elevated baseline troponin was the most important factor in the ABCS, which discriminated and calibrated well in our cohort. Troponin has been previously shown to be independently associated with mortality in both non-ICU [41] and ICU [42] patients, stressing its potential relevance for risk stratification at bedside.

The place these scores could have to guide therapeutic strategies is yet to be determined. Their most promising use may be as a tool to guide hospital admission, in the context of a pandemic with a high demand and a low offer for hospital beds, especially in low-income countries [43, 44]. Further studies should be conducted on this important issue.

Scores specifically derived for COVID-19 outperformed generic scores for infectious pneumonia or for sepsis. This highlights the specificity of COVID-19 in comparison to other forms of pneumonia, with a key role for the inflammatory and pro-thrombotic status to drive severity [45,46,47]. However, given their simplicity of use and their good performance to predict in-hospital mortality in our cohort, scores such as the CURB-65 or A-DROP scores could still be considered for risk stratification in COVID-19 patients. On the opposite, scores used in sepsis such as qSOFA or SIRS seemed to offer no clear benefit for risk stratification. Low specificity can be explained by a limited number of factors used for initial evaluation, as many patients present with abnormal vital signs or white blood cells counts, and those factors alone are insufficient to identify patients at high risk for critical illness. Low sensitivity can be explained as patients truly at risk for critical illness (particularly the elderly or patients with many comorbidities) may initially appear clinically stable before suddenly and dramatically worsening.

Accuracy was lower in our cohort to predict ICU admission compared to in-hospital mortality, even for scores specifically aimed at predicting this endpoint. This could partly be explained by the complexity of ICU admission criteria, which may differ across countries, according to local guidelines and demography, and may vary with time given the pressure on ICU beds [48]. In France for example, during the first wave of the pandemic, some patients with invasive mechanical ventilation urgently initiated in the emergency room or in general wards could not be transferred to the hospital-related ICU due to shortage of beds, and were transferred to other hospitals, either in the Paris region or in other regions [49].

In conclusion, several scores using routinely collected clinical and biological data have a fair accuracy to predict in-hospital death. The 4C Mortality Score and the ABCS stand out because they performed as well in our cohort and their initial validation cohort, during the first epidemic wave and subsequent waves, and in younger and older patients. Their use to guide appropriate clinical care and resource utilization should be evaluated in future studies.