Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, Jiangsu 211166, China.
Hospital of Traditional Chinese Medicine Affiliated to the Fourth Clinical Medical College of Xinjiang Medical University, Urumqi, China.
Biomed Res Int. 2021 Mar 29;2021:6696041. doi: 10.1155/2021/6696041. eCollection 2021.
To establish a machine learning model for identifying patients coinfected with hepatitis B virus (HBV) and human immunodeficiency virus (HIV) through two sexual transmission routes in Jiangsu, China.
A total of 14197 HIV cases transmitted by homosexual and heterosexual routes were recruited. After data processing, 12469 cases (HIV and HBV, 1033; HIV, 11436) were left for further analysis, including 7849 cases with homosexual transmission and 4620 cases with heterosexual transmission. Univariate logistic regression was used to select variables with significant value and odds ratio for multivariable analysis. In homosexual transmission and heterosexual transmission groups, 10 and 6 variables were selected, respectively. For identifying HIV individuals coinfected with HBV, a machine learning model was constructed with four algorithms, including Decision Tree, Random Forest, AdaBoost with decision tree (AdaBoost), and extreme gradient boosting decision tree (XGBoost). The detective value of each variable was calculated using the optimal machine learning algorithm.
AdaBoost algorithm showed the highest efficiency in both transmission groups (homosexual transmission group: accuracy = 0.928, precision = 0.915, recall = 0.944, - 1 = 0.930, and AUC = 0.96; heterosexual transmission group: accuracy = 0.892, precision = 0.881, recall = 0.905, - 1 = 0.893, and AUC = 0.98). Calculated by AdaBoost algorithm, the detective value of PLA was the highest in homosexual transmission group, followed by CR, AST, HB, ALT, TBIL, leucocyte, age, marital status, and treatment condition; in the heterosexual transmission group, the detective value of PLA was the highest (consistent with the condition in the homosexual group), followed by ALT, AST, TBIL, leucocyte, and symptom severity.
The univariate logistics regression combined with the AdaBoost algorithm could accurately screen the risk factors of HBV in HIV coinfection without invasive testing. Further studies are needed to evaluate the utility and feasibility of this model in various settings.
建立一种机器学习模型,用于识别中国江苏通过两种性传播途径感染乙型肝炎病毒(HBV)和人类免疫缺陷病毒(HIV)的患者。
共招募了 14197 例经同性恋和异性恋途径传播的 HIV 病例。经过数据处理,留下了 12469 例(HIV 和 HBV,1033 例;HIV,11436 例)进行进一步分析,包括 7849 例同性恋传播病例和 4620 例异性恋传播病例。单变量逻辑回归用于选择具有显著 值和多变量分析优势比的变量。在同性恋传播和异性传播组中,分别选择了 10 个和 6 个变量。对于识别 HIV 个体合并感染 HBV,使用包括决策树、随机森林、基于决策树的自适应增强(AdaBoost)和极端梯度提升决策树(XGBoost)在内的四种算法构建了机器学习模型。使用最优机器学习算法计算每个变量的探测值。
AdaBoost 算法在两个传播组(同性恋传播组:准确性=0.928、精确率=0.915、召回率=0.944、-1=0.930、AUC=0.96;异性传播组:准确性=0.892、精确率=0.881、召回率=0.905、-1=0.893、AUC=0.98)中的效率最高。通过 AdaBoost 算法计算,PLA 在同性恋传播组中的探测值最高,其次是 CR、AST、HB、ALT、TBIL、白细胞、年龄、婚姻状况和治疗情况;在异性传播组中,PLA 的探测值最高(与同性恋组一致),其次是 ALT、AST、TBIL、白细胞和症状严重程度。
无需侵入性检测,使用单变量逻辑回归结合 AdaBoost 算法可以准确筛选出 HIV 合并感染 HBV 的风险因素。需要进一步研究来评估该模型在各种环境中的实用性和可行性。