Abstract
Introduction: Unhealthy drinking is prevalent in the United States, and yet it is underidentified and undertreated. Identifying unhealthy drinkers can be time-consuming and uncomfortable for primary care providers. An automated rule for identification would focus attention on patients most likely to need care and, therefore, increase efficiency and effectiveness. The objective of this study was to build a clinical prediction tool for unhealthy drinking based on routinely available demographic and laboratory data.
Methods: We obtained 38 demographic and laboratory variables from the National Health and Nutrition Examination Survey (1999 to 2016) on 43,545 nationally representative adults who had information on alcohol use available as a reference standard. Logistic regression, support vector machines, k-nearest neighbor, neural networks, decision trees, and random forests were used to build clinical prediction models. The model with the largest area under the receiver operator curve was selected to build the prediction tool.
Results: A random forest model with 15 variables produced the largest area under the receiver operator curve (0.78) in the test set. The most influential predictors were age, current smoker, hemoglobin, sex, and high-density lipoprotein. The optimum operating point had a sensitivity of 0.50, specificity of 0.86, positive predictive value of 0.55, and negative predictive value of 0.83. Application of the tool resulted in a much smaller target sample (75% reduced).
Conclusion: Using commonly available data, a decision tool can identify a subset of patients who seem to warrant clinical attention for unhealthy drinking, potentially increasing the efficiency and reach of screening.
- Alcohol Drinking
- Alcoholism
- Area Under Curve
- Clinical Decision Rules
- Decision Trees
- Logistic Models
- Machine Learning
- Neural Networks (Computer)
- Nutrition Surveys
- Support Vector Machine
Introduction
An estimated 27% of adults in the United States drink alcohol at a level considered unhealthy,1 which is defined as consuming ≥1 drink per day for women or ≥2 for men or binge drinking (consuming ≥4 drinks on the same occasion for women or ≥5 for men) at least once in the past year.2 Consuming more than the recommended amount of alcohol is a major risk factor for health and social issues, injuries, accidents, and early death.3⇓–5 Unhealthy drinking has been associated with cancer, pancreatitis, liver disease, psychopathology, sleep problems, hypertension, and other serious diseases,6⇓⇓⇓–10 costing the United States $249 billion in 2010.11 Moreover, 88,000 deaths are attributable to consuming unhealthy levels of alcohol each year,12 making it the third leading preventable cause of death in the United States behind tobacco use and poor diet/lack of exercise.
The United States Preventive Services Task Force recommends screening for unhealthy drinking among adults ages 18 and older,13 and valid screening tools such as the Alcohol Use Disorders Identification Test (AUDIT),14 AUDIT-Consumption,15 and the Single Alcohol Screening Question16 exist for this purpose.
Primary Care Providers (PCPs) have an important role in identifying people with unhealthy drinking; yet, screening rates in primary care are low. In a representative survey of the US population, only 25% reported having been screened for alcohol use in the last year.17 Barriers to screening include lack of time and administrative support, need for modifications to office workflow, lack of training for PCPs, the stigma associated with alcohol misuse, and the fact that universal screening will not be applicable to the majority of patients.18⇓–20 Efforts to impose universal screening through the use of electronic clinical reminders and/or performance measures have improved screening rates in some health care systems but are inconsistently used and can be hampered by low clinical staff buy-in.21,22
An alternative approach is a clinical prediction rule, which can automatically identify patients most likely to have unhealthy drinking, thereby reducing the burden on PCPs and staff. Previous research has shown that clinical prediction rules using prospectively collected data can successfully identify unhealthy drinking. Hartzet al23 used logistic regression and 40 laboratory values to distinguish 426 heavy drinkers from 188 light drinkers. Lichtensteinet al24 used linear regression plus clinical and laboratory values to predict heavy drinking. Harasymiwet al25,26 used discriminant function analysis to predict patient-reported alcohol use from a set of blood chemistry profiles. Korzec and colleagues27 built a predictive test for unhealthy drinking based on laboratory values and a clinical questionnaire using Bayesian networks. However, the generalizability of these studies is limited by small sample sizes and highly selected populations. Furthermore, questionnaires or prospective data collection offer little advantage over universal screening. Finally, neither logistic regression nor discriminant function analysis accommodate missing values, which are common in clinical data.
Clinical prediction rules using large, existing datasets and machine learning methods are gaining momentum in the medical literature and have been used to predict poststroke mortality,28 in-hospital mortality,29 peripheral artery disease and future mortality risk,30 infection in the emergency department,31 and mortality among colon cancer patients,32 to mention a few.
The purpose of this study was to build a clinical prediction rule for unhealthy drinking based on routinely collected demographic, clinical, and laboratory data and to compare its performance to a universal screening strategy. We hypothesized that a clinical prediction rule could discriminate patients with greater likelihood of unhealthy drinking from those with a low probability of unhealthy drinking who would not require further evaluation. The population of patients needing further evaluation would, therefore, be smaller and have a higher prevalence of unhealthy drinking and have a greater yield from additional evaluations. In this way, a prediction rule could save time and clinical resources, relieving providers from a function that is challenging to implement reliably.16,18,19,33
Materials and Methods
Data Source
Ideally, a clinical prediction model should be developed in the context in which it is intended to be used, based on data available in that context. However, drinking data are inconsistently recorded in electronic health records (EHRs). Therefore, to test our hypothesis that a machine learning approach could be used to build a model for identifying unhealthy drinking, we used a dataset that reliably collected drinking data from each patient.
We obtained deidentified demographic, clinical, and laboratory information on 43,545 nationally representative adults from the National Health and Nutrition Examination Survey (NHANES) from 1999 to 2016. To be included, the records needed responses to the alcohol questions to be used as a reference standard. Individuals younger than 18 years did not receive these questions. Demographic and clinical variables included age, sex, smoking status, height, weight, systolic and diastolic blood pressure, and resting heart rate. Laboratory data included 30 variables from routine clinical chemistries and hemograms (see Table 1). These variables were selected based on prior literature, clinical judgment, and the likelihood that the candidate predictor would be available in routine medical records.23,34,35 Drinking data were used to classify patients as having either unhealthy drinking or low-risk drinking. Unhealthy drinking was defined by ≥1 drink per day for women or ≥2 for men or binge drinking ≥1 per month in the past 12 months (≥4 drinks on the same occasion for women or ≥5 for men). Individuals not meeting criteria for unhealthy drinking were classified as low risk. This category includes nondrinkers.
The data were randomly split into 3 independent sets: a training set (65%) for initial development of the model, a validation set (15%) to evaluate the initial model, and a test set (20%) to determine the final fit of the model to the data. The test set was stored separately until a final prediction algorithm was created and ready to use. Univariate analyses were performed to ensure the 3 random subsets were similar.
Model Development and Selection
Six candidate machine learning methods were evaluated to determine the most appropriate approach to use for building a clinical prediction rule with this dataset. Logistic regression,36 support vector machines,37 neural networks,38 k-nearest neighbors,39 decision trees,40 and random forests41 were used individually to create clinical prediction rules for unhealthy drinking using the training set. These methods were chosen based on prior literature42,43 and because they each have unique advantages and disadvantages for classification (Appendix). Each method was tuned to maximize prediction in the training data using all 38 variables. The decision tree and random forest methods used techniques to extract information from missing values. Essentially, missing data were counted as another level or value of the variable. All resulting clinical prediction rules were run against the validation dataset, and the 1 with the largest area under the receiver operating characteristic curve (AUC)44 (the random forest) was selected as the target for further evaluation. Variables with an information gain of less than 2% (a measure of importance of each variable in predicting unhealthy drinking) were removed to create a more parsimonious and reproducible clinical prediction rule.45
Model Performance
We calculated the performance of the clinical prediction rule in the test set at various thresholds (estimated probabilities of unhealthy drinking), forming a receiver operating characteristic curve. Performance parameters included accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and workload improvement (“savings”). An operating threshold was chosen to optimize these values, with priority given to specificity over sensitivity. Accuracy was calculated as the number of correctly classified patients (true positives + true negatives) divided by the total population. The improvement in screening workload attributable to the clinical prediction rule (“savings”) was calculated as (1 − the positivity rate) and represents the reduction in the fraction of patients needing evaluation when using the prediction rule compared with the universal screening approach (100% evaluated).
Data management and statistical analyses were performed using Stata version 15 (Stata Corporation, College Station, TX), JMP Pro version 13 (SAS Institute Inc., Cary, NC), and Python version 3.6 (Python Software Foundation, Wilmington, DE). The University of Vermont Committees on Human Subjects determined that the study did not constitute human subjects research.
Results
Overall, the prevalence of unhealthy drinking was 26%. The 43,545 records were randomly assigned to training (n = 28,262), validation (n = 6474), and test (n = 8809) sets. There were no significant differences among the 3 sets for any of the 38 variables. A total of 6% of values were missing and 23% of records were missing at least 1 variable.
Table 1 shows demographic and laboratory values by the reference drinking status (unhealthy versus low risk). On average, respondents in the unhealthy drinking category consumed 4.1 drinks per drinking day. In contrast, low-risk adults (including abstainers) had 1.5 drinks per drinking day. Individuals with unhealthy drinking were more likely to be younger, male, and current cigarette smokers. Although the differences in many clinical and laboratory values were statistically significant, they were small and unlikely to be clinically important.
Table 2 shows a comparison of the AUCs of the various methods across the training, validation, and test sets and the performance parameters for each model in the validation set. The random forest model produced the largest AUC in both the training set (0.85) and the validation set (0.80) and outperformed the other machine learning methods in sensitivity, specificity, PPV, NPV, overall accuracy, and savings in the validation set (see Figure 1). The random forest model was used to build the final clinical prediction rule. It was the only method used in the final test set.
After selecting random forest as the final method, variables that contributed an information gain of <2% were dropped to create the most parsimonious model, ultimately including only 15/38 variables. The final model included the following predictors: age, current smoker, hemoglobin, sex, high-density lipoprotein, hematocrit, γ-glutamyl transpeptidase, mean cellular hemoglobin, uric acid, albumin, lactate dehydrogenase, mean corpuscular volume, systolic blood pressure, creatinine, and blood urea nitrogen (Table 3).
Compared with the presumed effects of universal screening (all patients are screened and all instances of unhealthy drinking are identified), the clinical prediction rule finds fewer unhealthy drinkers but at a much lower cost (see Figure 2). At a prevalence of 26% and at the optimum operating point, the clinical prediction rule has a sensitivity of 0.50, requiring that only 25% of the population undergo further evaluation (see Table 2). The PPV of 0.55 indicates that 55% of them are identified as having unhealthy drinking, compared with 26% of all patients identified with universal screening. By eliminating 75% of the population with a relatively low risk of unhealthy drinking, the model increases the prevalence of unhealthy drinking in the identified group and lowers the number assessed from 43,345 to 10,886 in this population.
With the same prediction rule, the operating point could be shifted along the receiver operating characteristic curve to prioritize sensitivity. For example, an alternate operating point prioritizing sensitivity could produce a sensitivity of 0.88, specificity of 0.49, PPV of 0.38, and NPV of 0.92. However, 61% of the population (n = 26,562) would need to be evaluated.
Discussion
We used commonly available laboratory, clinical, and demographic information from a nationally representative dataset to build a clinical prediction rule for unhealthy drinking. The analysis, which includes over 45,000 records, indicates that an automated tool can accurately identify unhealthy drinking by using commonly available secondary data, even with many missing values. Using a random forest model, we were able to predict unhealthy drinking with high specificity and modest sensitivity. Changing the operating point could allow for high sensitivity and modest specificity, if that were preferred. Random forest outperformed logistic regression and the other machine learning methods.
Prior studies on predicting unhealthy drinking have used classic statistical techniques with small data sets and limited computing power23,25⇓–27,46 compared with more modern methods. These prospective studies had control over the recruitment process and the ability to minimize missing data, which may have helped their prediction results. In contrast, the current study used a large existing dataset and analytical methods that accounted for missing data.
In the curated NHANES dataset, individual values were missing less than 5% of the time, but in EHRs, we would expect many more missing values. Some machine learning methods, especially random forest, consider and use missing data to create the most robust model.47 Because all clinical data sources, including EHRs, have gaps, it is important that clinical prediction rules can account for missing data.
We tested logistic regression and multiple machine learning methods on the training and validation sets. Random forest outperformed all other methods, likely because it is particularly robust to outliers, missing data, and nonlinear relationships.41 Although logistic regression is widely used in binary classification problems,48 results in the medical literature are inconclusive about whether logistic regression can predict as well as machine learning methods.28,29 A recent systematic review by Christodoulou et al49 found no performance benefit of machine learning methods over logistic regression. However, logistic regression, and other methods that cannot handle missing data, are not practical in a clinical setting because users would either need to impute the missing data before applying the rule or abandon prediction for many cases. In the NHANES data, a particularly well-groomed dataset, only 77% of records had complete data. The choice of model for medical domains should be selected based on the problem to be solved; the understanding of the underlying biological, psychological, and social mechanisms; and the data available, rather than just whether the domain is medical or not.
The predictors of unhealthy drinking in the final model are biologically plausible and supported by the literature. Age, sex, smoking, and unhealthy drinking have been shown to be strongly correlated.1,50 Alcohol use is associated with increased levels of high-density lipoprotein, reportedly through an increased transport rate of apolipoproteins A-I and A-II.34 Others have used mean corpuscular volume, hemoglobin, γ-glutamyl transpeptidase, albumin, and systolic blood pressure in prediction models for heavy drinking.23⇓–25 Despite race and ethnicity being associated with alcohol use, they were removed a priori due to common misclassification problems, especially in EHR data.51 To create the most parsimonious model, the random forest algorithm removed potential predictors that have a minimal effect on performance.
Universal screening results in many low-risk patients being offered an unnecessary intervention that PCPs are already reluctant to provide,16,18,19,33 This clinical prediction rule prioritizes specificity over sensitivity and identifies patients who are likely to truly be drinking at an unhealthy level. Therefore, the population appropriate for follow-up assessment is greatly reduced compared with universal screening, freeing up time and resources. The trade-off is that some patients with unhealthy drinking are incorrectly categorized as low risk, missing an opportunity to intervene. If the setting warrants, the model can operate at a higher sensitivity, with correspondingly lower specificity.
This study has limitations. First, the NHANES sample is meant to be representative of the general population of adults in the United States, which may be different from those seeking primary care. The study population undoubtedly included some adults who would not be subjects for screening because, for example, they had a previously diagnosed alcohol use disorder. Second, the NHANES data may not be representative of EHR data, which would be used in practice. EHRs are likely to have much more missing data. However, random forest models are robust to missing data. Third, NHANES questionnaires were administered in person, possibly introducing social desirability response bias.52 Therefore, alcohol and tobacco use may be underreported compared with self-report articles or electronic questionnaires. Because smoking was an important predictor in the model and alcohol use is the outcome, inaccurate reporting could result in misclassification. Nonetheless, self-report is the typical method for assessing smoking status and alcohol use in health care settings. Fourth, the prediction rule is not very transparent. Notably, it offers no single estimate of the relationship between any predictor and the outcome analogous to the odds ratio from a regression. A single predictor may seem to be harmful in some subgroups of patients and protective in others. Finally, we believe that this analysis overestimates the performance of universal screening because it assumes that all patients would be screened. In fact, a relatively low fraction of primary care patients are routinely screened with a validated tool such as the AUDIT.17
Conclusions
Motivated by critical barriers facing PCPs in identifying unhealthy drinking, we describe an alternative approach to routine universal screening: a clinical prediction rule based on existing data. This method could reduce the burden on PCPs and allow them to focus their attention on those who need it most. The virtue of the clinical prediction rule is not that it is perfectly accurate but that it is fast, inexpensive, unobtrusive, and identifies a subset of patients at a higher risk of unhealthy drinking.
Appendix
Notes
This article was externally peer reviewed.
Conflicts of Interest: none.
Funding: This work was supported by the National Institute of Alcohol Abuse and Alcoholism award number 1R41AA025297 to Gail L. Rose (PI).
To see this article online, please go to: http://jabfm.org/content/33/3/397.full.
- Received for publication November 15, 2019.
- Revision received January 31, 2020.
- Accepted for publication February 5, 2020.