•  
  •  
 

Abstract

Due to the generally unqualified nature of prediction data and the difficulty of interpreting predictions, predicting diabetes remains a significant hurdle in the adoption of machine learning within the medical domain. In this study, several tree-based machine learning techniques (LightGBM, XGBoost, CatBoost, and Gradient Boosting) were applied to predict diabetes using the 2015 BRFSS dataset, while two ensemble methods (soft voting and stacking) were employed to improve predictive accuracy. The performance analysis of the individual models and ensemble approaches indicates that CatBoost achieved the highest accuracy among the single classifiers (0.871), with an F1-score of 0.871 and a ROC–AUC of 0.921. Both ensemble methods further improved predictive performance compared with individual models. The soft voting ensemble obtained an overall accuracy of 0.878 and a ROC–AUC of 0.928, whereas the stacking ensemble achieved the highest overall performance, with an accuracy of 0.883, an F1-score of 0.883, and a ROC–AUC of 0.934. Moreover, SHAP-based analysis identified general health, body mass index, and age as the most influential factors affecting the prediction outcomes. We conclude that ensemble learning improves predictive performance while still providing a level of interpretability when assessing risk for developing diabetes.

Included in

Engineering Commons

Share

COinS