Department of Human Genetics and Genomic Medicine, University of Southampton, Southampton, UK.
NIHR Southampton Biomedical Research, University Hospital Southampton, Southampton, UK.
J Crohns Colitis. 2023 Nov 8;17(10):1672-1680. doi: 10.1093/ecco-jcc/jjad084.
Inflammatory bowel disease [IBD] is a chronic inflammatory disorder with two main subtypes: Crohn's disease [CD] and ulcerative colitis [UC]. Prompt subtype diagnosis enables the correct treatment to be administered. Using genomic data, we aimed to assess machine learning [ML] to classify patients according to IBD subtype.
Whole exome sequencing [WES] from paediatric/adult IBD patients was processed using an in-house bioinformatics pipeline. These data were condensed into the per-gene, per-individual genomic burden score, GenePy. Data were split into training and testing datasets [80/20]. Feature selection with a linear support vector classifier, and hyperparameter tuning with Bayesian Optimisation, were performed [training data]. The supervised ML method random forest was utilised to classify patients as CD or UC, using three panels: 1] all available genes; 2] autoimmune genes; 3] 'IBD' genes. ML results were assessed using area under the receiver operating characteristics curve [AUROC], sensitivity, and specificity on the testing dataset.
A total of 906 patients were included in analysis [600 CD, 306 UC]. Training data included 488 patients, balanced according to the minority class of UC. The autoimmune gene panel generated the best performing ML model [AUROC = 0.68], outperforming an IBD gene panel [AUROC = 0.61]. NOD2 was the top gene for discriminating CD and UC, regardless of the gene panel used. Lack of variation in genes with high GenePy scores in CD patients was the best classifier of a diagnosis of UC.
We demonstrate promising classification of patients by subtype using random forest and WES data. Focusing on specific subgroups of patients, with larger datasets, may result in better classification.
炎症性肠病(IBD)是一种慢性炎症性疾病,有两个主要亚型:克罗恩病(CD)和溃疡性结肠炎(UC)。及时诊断亚型有助于进行正确的治疗。我们使用基因组数据评估机器学习(ML),根据 IBD 亚型对患者进行分类。
使用内部生物信息学管道处理儿科/成人 IBD 患者的全外显子组测序(WES)数据。这些数据被浓缩为每个基因、每个个体的基因组负担评分 GenePy。数据分为训练和测试数据集[80/20]。使用线性支持向量分类器进行特征选择,并使用贝叶斯优化进行超参数调整[训练数据]。使用随机森林监督 ML 方法,使用三个面板将患者分类为 CD 或 UC:1)所有可用基因;2)自身免疫基因;3)“IBD”基因。使用测试数据集评估 ML 结果的Receiver Operating Characteristics 曲线下面积(AUROC)、敏感性和特异性。
共有 906 名患者纳入分析[600 名 CD,306 名 UC]。训练数据包括 488 名患者,根据 UC 的少数类平衡。自身免疫基因面板生成的 ML 模型表现最佳[AUROC=0.68],优于 IBD 基因面板[AUROC=0.61]。无论使用哪个基因面板,NOD2 都是区分 CD 和 UC 的最佳基因。CD 患者 GenePy 评分高的基因变异缺乏是 UC 诊断的最佳分类器。
我们使用随机森林和 WES 数据证明了对患者进行亚型分类的有前景的方法。通过关注特定的患者亚组,并使用更大的数据集,可能会导致更好的分类。