Comparative Analysis of Machine Learning Models for Predicting Hospital- and Community-Associated Urinary Tract Infections Using Demographic, Hospital, and Socioeconomic Predictors.

Abstract

BACKGROUND: Urinary tract infections (UTI) are among the most common infections encountered in both community and healthcare settings. Differentiating between community-associated UTI (CA-UTI) and healthcare-associated UTI (HA-UTI) is crucial for understanding their epidemiology, identifying risk factors, and developing appropriate treatment strategies. Machine learning (ML) techniques have shown significant potential in improving the accuracy of predicting these infections, enabling more effective interventions and better patient outcomes. While previous studies have demonstrated the utility of ML models in various healthcare settings, there is still a need for a comparative analysis of different ML approaches, particularly in distinguishing between CA-UTI and HA-UTI and assessing the risk of UTI among hospitalized patients.

OBJECTIVE: Using 2019-2023 patient demographics, hospital, and socioeconomic data, this study aims to build, validate, and compare machine learning models-Decision Tree (DT), Neural Network (NN), Logistic Regression (LR), Random Forest (RF), and Extreme Gradient Boosting (XGBoost) to differentiate between the incidences of HA-UTI and CA-UTI. Additionally, it seeks to identify key predictors of UTI using demographic, hospital, and socioeconomic variables.

RESULTS: The DT model demonstrated the highest sensitivity, particularly in handling the highly imbalanced data of HAI, with a sensitivity of 87%. LR achieved the best overall accuracy, at 95.9% for HA-UTI and 93.2% for HA-UTI vs. CA-UTI. RF performed best in cross-validation, reaching 99.1% for HA-UTI and 96.2% for HA-UTI vs. CA-UTI. NN showed the highest specificity, at 93.4%, for HA-UTI vs. CA-UTI. The AUC values further supported these findings, ranging from 71.9% for NN to 96% for RF, reflecting the robustness of these models across different annual datasets. Among patient demographics, hospital, and socioeconomic variables, all models consistently identified the nurse units (e.g., inpatient units and mental health units) as the most significant predictors of UTI. In addition to nurse units, LR and DT identified location (e.g., various clinics and medical centres) as a key predictor. For HA-UTI versus CA-UTI, variations were observed across the years, with patient age, median household income, and gender intermittently emerging as key predictors.

CONCLUSION: The predictive accuracy of the machine learning models is relatively the same, with some differences in sensitivity and specificity for identifying both HA-UTI vs. CA-UTI and HA-UTI. Nurse units consistently emerge as the most significant predictors across all years. The importance of all predictors, such as socioeconomic factors and location, varies from year to year, highlighting the need for incorporating those variables in the surveillance systems to optimize the accuracy of predictions.

Last updated on 05/09/2025
PubMed