Gerussi Alessio, Verda Damiano, Cappadona Claudio, Cristoferi Laura, Bernasconi Davide Paolo, Bottaro Sandro, Carbone Marco, Muselli Marco, Invernizzi Pietro, Asselta Rosanna
Division of Gastroenterology, Center for Autoimmune Liver Diseases, Department of Medicine and Surgery, University of Milano-Bicocca, 20900 Monza, Italy.
European Reference Network on Hepatological Diseases (ERN RARE-LIVER), San Gerardo Hospital, 20900 Monza, Italy.
J Pers Med. 2022 Sep 26;12(10):1587. doi: 10.3390/jpm12101587.
The application of Machine Learning (ML) to genetic individual-level data represents a foreseeable advancement for the field, which is still in its infancy. Here, we aimed to evaluate the feasibility and accuracy of an ML-based model for disease risk prediction applied to Primary Biliary Cholangitis (PBC).
Genome-wide significant variants identified in subjects of European ancestry in the recently released second international meta-analysis of GWAS in PBC were used as input data. Quality-checked, individual genomic data from two Italian cohorts were used. The ML included the following steps: import of genotype and phenotype data, genetic variant selection, supervised classification of PBC by genotype, generation of "if-then" rules for disease prediction by logic learning machine (LLM), and model validation in a different cohort.
The training cohort included 1345 individuals: 444 were PBC cases and 901 were healthy controls. After pre-processing, 41,899 variants entered the analysis. Several configurations of parameters related to feature selection were simulated. The best LLM model reached an Accuracy of 71.7%, a Matthews correlation coefficient of 0.29, a Youden's value of 0.21, a Sensitivity of 0.28, a Specificity of 0.93, a Positive Predictive Value of 0.66, and a Negative Predictive Value of 0.72. Thirty-eight rules were generated. The rule with the highest covering (19.14) included the following genes: RIN3, KANSL1, TIMMDC1, TNPO3. The validation cohort included 834 individuals: 255 cases and 579 controls. By applying the ruleset derived in the training cohort, the Area under the Curve of the model was 0.73.
This study represents the first illustration of an ML model applied to common variants associated with PBC. Our approach is computationally feasible, leverages individual-level data to generate intelligible rules, and can be used for disease prediction in at-risk individuals.
将机器学习(ML)应用于个体水平的基因数据是该领域可预见的进展,目前该领域仍处于起步阶段。在此,我们旨在评估基于ML的疾病风险预测模型应用于原发性胆汁性胆管炎(PBC)的可行性和准确性。
在最近发布的PBC全基因组关联研究(GWAS)的第二次国际荟萃分析中,在欧洲血统受试者中鉴定出的全基因组显著变异用作输入数据。使用了来自两个意大利队列的经过质量检查的个体基因组数据。ML包括以下步骤:导入基因型和表型数据、基因变异选择、通过基因型对PBC进行监督分类、通过逻辑学习机(LLM)生成疾病预测的“如果-那么”规则以及在不同队列中进行模型验证。
训练队列包括1345名个体:444例PBC病例和901名健康对照。预处理后,41899个变异进入分析。模拟了与特征选择相关的几种参数配置。最佳的LLM模型的准确率为71.7%,马修斯相关系数为0.29,约登指数为0.21,灵敏度为0.28,特异性为0.93,阳性预测值为0.66,阴性预测值为0.72。生成了38条规则。覆盖度最高(19.14)的规则包括以下基因:RIN3、KANSL1、TIMMDC1、TNPO3。验证队列包括834名个体:255例病例和579名对照。通过应用在训练队列中得出的规则集,模型的曲线下面积为0.73。
本研究首次展示了将ML模型应用于与PBC相关的常见变异。我们的方法在计算上是可行的,利用个体水平数据生成可理解的规则,并且可用于高危个体的疾病预测。