Skip to main content

Main menu

  • HOME
  • ARTICLES
    • Current Issue
    • Ahead of Print
    • Archives
    • Abstracts In Press
    • Special Issue Archive
    • Subject Collections
  • INFO FOR
    • Authors
    • Reviewers
    • Call For Papers
    • Subscribers
    • Advertisers
  • SUBMIT
    • Manuscript
    • Peer Review
  • ABOUT
    • The JABFM
    • The Editing Fellowship
    • Editorial Board
    • Indexing
    • Editors' Blog
  • CLASSIFIEDS
  • Other Publications
    • abfm

User menu

Search

  • Advanced search
American Board of Family Medicine
  • Other Publications
    • abfm
American Board of Family Medicine

American Board of Family Medicine

Advanced Search

  • HOME
  • ARTICLES
    • Current Issue
    • Ahead of Print
    • Archives
    • Abstracts In Press
    • Special Issue Archive
    • Subject Collections
  • INFO FOR
    • Authors
    • Reviewers
    • Call For Papers
    • Subscribers
    • Advertisers
  • SUBMIT
    • Manuscript
    • Peer Review
  • ABOUT
    • The JABFM
    • The Editing Fellowship
    • Editorial Board
    • Indexing
    • Editors' Blog
  • CLASSIFIEDS
  • JABFM on Bluesky
  • JABFM On Facebook
  • JABFM On Twitter
  • JABFM On YouTube
Research ArticleOriginal Research

A Machine Learning Approach to Identification of Unhealthy Drinking

Levi N. Bonnell, Benjamin Littenberg, Safwan R. Wshah and Gail L. Rose
The Journal of the American Board of Family Medicine May 2020, 33 (3) 397-406; DOI: https://doi.org/10.3122/jabfm.2020.03.190421
Levi N. Bonnell
From University of Vermont College of Medicine, Burlington (LNB, BL, GLR); University of Vermont, College of Engineering and Mathematical Sciences, Burlington (SRW).
MPH
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Benjamin Littenberg
From University of Vermont College of Medicine, Burlington (LNB, BL, GLR); University of Vermont, College of Engineering and Mathematical Sciences, Burlington (SRW).
MD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Safwan R. Wshah
From University of Vermont College of Medicine, Burlington (LNB, BL, GLR); University of Vermont, College of Engineering and Mathematical Sciences, Burlington (SRW).
PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Gail L. Rose
From University of Vermont College of Medicine, Burlington (LNB, BL, GLR); University of Vermont, College of Engineering and Mathematical Sciences, Burlington (SRW).
PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • Article
  • Figures & Data
  • References
  • Info & Metrics
  • PDF
Loading

Article Figures & Data

Figures

  • Tables
  • Figure 1.
    • Download figure
    • Open in new tab
    Figure 1.

    Random Forest AUC for training, validation, and test sets. Abbreviations: AUC, area under the receiver operating characteristic curve.

  • Figure 2.
    • Download figure
    • Open in new tab
    Figure 2.

    Population effect of using the clinical prediction rule to identify unhealthy drinking compared with universal screening. Abbreviations: NHANES, National Health and Nutrition Examination Survey; AUDIT-C, Alcohol Use Disorders Identification Test – alcohol consumption questions.

Tables

  • Figures
    • View popup
    Table 1.

    Characteristics of the Cohort Stratified by Unhealthy Drinking Status

    Demographic InformationReference GroupP Value*
    Unhealthy Drinkers (n = 11,464), % or MedianLow-Risk Drinkers (n = 32,081), % or Median
    Sex, male†67%42%<0.001
    Smoking, current†36%15%<0.001
    Age, years†3853<0.001
    Clinical Information
        Height, cm171.2165.4<0.001
        Weight, kg8077.3<0.001
        Systolic blood pressure, mm Hg†120122<0.001
        Diastolic blood pressure, mm Hg7270<0.001
        Resting pulse rate, 60-second count72720.13
    Chemistry
        Calcium, mg/dL9.49.4<0.001
        Chloride, mmol/L104104<0.001
        Phosphorus, mg/dL3.73.7<0.001
        Potassium, mmol/L440.006
        Sodium, mmol/L139139<0.001
        Blood urea nitrogen, mmol/L†4.34.6<0.001
        Creatinine, mg/dL†0.860.82<0.001
        Bicarbonate, mmol/L2525<0.001
        Glucose, mg/dL9093<0.001
        Uric acid, mg/dL†5.65.2<0.001
        Serum osmolality, mOsm/kg277278<0.001
    Liver function
        Bilirubin, mg/dL0.70.6<0.001
        Alanine aminotransferase, U/L2320<0.001
        Aspartate aminotransferase, U/L2423<0.001
        Alkaline phosphatase, U/L6568<0.001
        Gamma-glutamyl transpeptidase, U/L†2319<0.001
        Lactate dehydrogenase, U/L†124130<0.001
        Protein, g/dL7.27.2<0.001
        Albumin, g/L†4442<0.001
    Hematology
        Hemoglobin, g/dL†14.813.9<0.001
        Hematocrit, %†43.441.2<0.001
        Mean corpuscular volume, fL†90.589.8<0.001
        Mean cellular hemoglobin, pg†30.930.5<0.001
        Red blood cell distribution width, %12.612.9<0.001
        White blood cell count, 1000/µL7.16.9<0.001
        Platelet count, 1000/µL8.18.1<0.001
    Lipids
        Total cholesterol, mg/dL1941930.25
        High density lipoprotein, mg/dL†5150<0.001
        Calculated low density lipoprotein, mg/dL110111.40.002
        Triglyceride, mg/dL1181210.18
    • ↵* P value determined by χ2 or Wilcoxon rank-sum test. Because the rank-sum tests considers the entire distribution of each group, it can detect statistically significant differences even when the medians are identical.

    • ↵† Variables included in final prediction model.

    • View popup
    Table 2.

    Performance of the Various Machine Learning Models in the Validation Set Using All 38 Variables*

    ModelTrainingValidationTest
    AUC (95% CI)AUC (95% CI)SensitivitySpecificityPPVNPVOverall AccuracySavingsAUC (95% CI)SensitivitySpecificityPPVNPVOverall AccuracySavings
    Universal Screening (No rule)——1.01.00.260.741.00%—1.01.00.260.741.00%
    Random Forest0.85 (0.84–0.86)0.80 (0.79–0.81)0.450.900.580.820.7985%0.78 (0.77–0.79)0.500.880.550.830.7675%
    Support Vector Machines0.81 (0.80–0.82)0.77 (0.76–0.78)0.340.890.500.790.7482%———————
    Neural Networks0.79 (0.78–0.80)0.78 (0.77–0.78)0.360.900.580.800.7682%———————
    K-nearest Neighbors0.78 (0.78–0.79)0.75 (0.74–0.76)0.350.840.450.780.7179%———————
    Decision Trees0.77 (0.76–0.78)0.75 (0.73–0.76)0.340.900.560.790.7583%———————
    Logistic Regression0.76 (0.75–0.77)0.71 (0.70–0.73)0.480.850.550.810.7476%———————
    • ↵* Sensitivity, specificity, PPV, NPV, Overall Accuracy, and Savings are all calculated at the selected optimum operating point in each case.

    • PPV, positive predictive value; NPV, negative predictive value; AUC, area under the receiver operating characteristic curve; CI, confidence interval.

    • View popup
    Table 3.

    Information Gain of Variables Used in the Final Prediction Model

    Reference GroupInformation Gain (%)
    Age28.1
    Current smoker10.7
    Hemoglobin7.7
    Sex7.3
    High density lipoprotein6.3
    Hematocrit6.0
    Gamma-glutamyl transpeptidase5.4
    Mean cellular hemoglobin4.8
    Uric acid4.4
    Albumin3.7
    Lactate dehydrogenase3.2
    Mean corpuscular volume3.2
    Systolic blood pressure3.1
    Creatinine3.1
    Blood urea nitrogen3.0
    • View popup
    Appendix.

    Selected Machine Learning Methods for Classification of Unknown Cases into Mutually Exclusive Categories

    MethodAdvantageDisadvantage
    Random forest
    • Low computational cost

    • Uses missing data to inform model

    • Can handle large number of records and variables

    • Provides estimates of the information gained by each input variable

    • Works well with nonlinear data

    • Not ideal for rare outcomes

    • Very difficult to interpret individual variable contributions to classification

    • Time consuming hyperparameter tuning

    • Overfitting of data

    Support Vector Machines
    • Low computationally cost

    • Effective when number of variables> number of records (very wide data)

    • Need a clear margin of separation between outcomes (unhealthy drinking vs low-risk)

    • Time consuming hyperparameter tuning

    • Not efficient with large number of records

    Neural Networks
    • Works well with nonlinear data

    • High computational cost during training

    • Extremely useful with large number of predictors (high dimensionality (e.g. image data))

    • Time consuming hyperparameter tuning

    • Any numeric data can be used

    • Need relatively large number of records for training set

    • Very difficult to interpret individual variable contributions to classification

    • Must have many records per variable

    • Overfitting of data

    K-nearest neighbors
    • Very simple construction requiring minimal specifications (a.k.a. hyperparameters)

    • Intuitive methodology

    • High computational cost

    • Challenging with large number of variables (wide data)

    • Cannot handle imbalanced data

    • Very sensitive to outliers

    • Cannot handle missing data

    Decision Trees
    • Can handle missing data

    • Highly biased to training set

    • No data preprocessing needed

    • Provides highly intuitive explanation over the prediction

    • Relatively inaccurate compared to other models

    Logistic Regression
    • Common and understood by most

    • Proper selection of features is required

    • Relatively easy to implement

    • Cannot handle missing data

    • Loss function is always convex

    • Needs data preprocessing and handling to cover non-linear data

    • Cannot handle large number of categorical predictors

PreviousNext
Back to top

In this issue

The Journal of the American Board of Family     Medicine: 33 (3)
The Journal of the American Board of Family Medicine
Vol. 33, Issue 3
May/June 2020
  • Table of Contents
  • Table of Contents (PDF)
  • Cover (PDF)
  • Index by author
  • Back Matter (PDF)
  • Front Matter (PDF)
Print
Download PDF
Article Alerts
Sign In to Email Alerts with your Email Address
Email Article

Thank you for your interest in spreading the word on American Board of Family Medicine.

NOTE: We only request your email address so that the person you are recommending the page to knows that you wanted them to see it, and that it is not junk mail. We do not capture any email address.

Enter multiple addresses on separate lines or separate them with commas.
A Machine Learning Approach to Identification of Unhealthy Drinking
(Your Name) has sent you a message from American Board of Family Medicine
(Your Name) thought you would like to see the American Board of Family Medicine web site.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
2 + 5 =
Solve this simple math problem and enter the result. E.g. for 1+3, enter 4.
Citation Tools
A Machine Learning Approach to Identification of Unhealthy Drinking
Levi N. Bonnell, Benjamin Littenberg, Safwan R. Wshah, Gail L. Rose
The Journal of the American Board of Family Medicine May 2020, 33 (3) 397-406; DOI: 10.3122/jabfm.2020.03.190421

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
Share
A Machine Learning Approach to Identification of Unhealthy Drinking
Levi N. Bonnell, Benjamin Littenberg, Safwan R. Wshah, Gail L. Rose
The Journal of the American Board of Family Medicine May 2020, 33 (3) 397-406; DOI: 10.3122/jabfm.2020.03.190421
Twitter logo Facebook logo Mendeley logo
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Jump to section

  • Article
    • Abstract
    • Introduction
    • Materials and Methods
    • Results
    • Discussion
    • Appendix
    • Notes
    • References
  • Figures & Data
  • References
  • Info & Metrics
  • PDF

Related Articles

  • No related articles found.
  • PubMed
  • Google Scholar

Cited By...

  • Machine Learning Applications and Advancements in Alcohol Use Disorder: A Systematic Review
  • Well-Being, New Technologies, and Clinical Evidence for Family Physicians
  • Google Scholar

More in this TOC Section

  • Integrating Adverse Childhood Experiences and Social Risks Screening in Adult Primary Care
  • A Pilot Comparison of Clinical Data Collection Methods Using Paper, Electronic Health Record Prompt, and a Smartphone Application
  • Associations Between Modifiable Preconception Care Indicators and Pregnancy Outcomes
Show more Original Research

Similar Articles

Keywords

  • Alcohol Drinking
  • Alcoholism
  • Area Under Curve
  • Clinical Decision Rules
  • Decision Trees
  • Logistic Models
  • Machine Learning
  • Neural Networks (Computer)
  • Nutrition Surveys
  • Support Vector Machine

Navigate

  • Home
  • Current Issue
  • Past Issues

Authors & Reviewers

  • Info For Authors
  • Info For Reviewers
  • Submit A Manuscript/Review

Other Services

  • Get Email Alerts
  • Classifieds
  • Reprints and Permissions

Other Resources

  • Forms
  • Contact Us
  • ABFM News

© 2025 American Board of Family Medicine

Powered by HighWire