Appendix.

Selected Machine Learning Methods for Classification of Unknown Cases into Mutually Exclusive Categories

MethodAdvantageDisadvantage
Random forest
  • Low computational cost

  • Uses missing data to inform model

  • Can handle large number of records and variables

  • Provides estimates of the information gained by each input variable

  • Works well with nonlinear data

  • Not ideal for rare outcomes

  • Very difficult to interpret individual variable contributions to classification

  • Time consuming hyperparameter tuning

  • Overfitting of data

Support Vector Machines
  • Low computationally cost

  • Effective when number of variables> number of records (very wide data)

  • Need a clear margin of separation between outcomes (unhealthy drinking vs low-risk)

  • Time consuming hyperparameter tuning

  • Not efficient with large number of records

Neural Networks
  • Works well with nonlinear data

  • High computational cost during training

  • Extremely useful with large number of predictors (high dimensionality (e.g. image data))

  • Time consuming hyperparameter tuning

  • Any numeric data can be used

  • Need relatively large number of records for training set

  • Very difficult to interpret individual variable contributions to classification

  • Must have many records per variable

  • Overfitting of data

K-nearest neighbors
  • Very simple construction requiring minimal specifications (a.k.a. hyperparameters)

  • Intuitive methodology

  • High computational cost

  • Challenging with large number of variables (wide data)

  • Cannot handle imbalanced data

  • Very sensitive to outliers

  • Cannot handle missing data

Decision Trees
  • Can handle missing data

  • Highly biased to training set

  • No data preprocessing needed

  • Provides highly intuitive explanation over the prediction

  • Relatively inaccurate compared to other models

Logistic Regression
  • Common and understood by most

  • Proper selection of features is required

  • Relatively easy to implement

  • Cannot handle missing data

  • Loss function is always convex

  • Needs data preprocessing and handling to cover non-linear data

  • Cannot handle large number of categorical predictors