Tellaetxe-Abete Maitena, Calvo Borja, Lawrie Charles
Molecular Oncology Group, Biodonostia Health Research Institute, Paseo Doctor Begiristain, 20014 Donostia/San Sebastian, Spain.
Intelligent Systems Group, Computer Science Faculty, University of the Basque Country, Paseo Manuel Lardizabal, 20018 Donostia/San Sebastian, Spain.
NAR Genom Bioinform. 2021 Oct 27;3(4):lqab092. doi: 10.1093/nargab/lqab092. eCollection 2021 Dec.
Increasingly, treatment decisions for cancer patients are being made from next-generation sequencing results generated from formalin-fixed and paraffin-embedded (FFPE) biopsies. However, this material is prone to sequence artefacts that cannot be easily identified. In order to address this issue, we designed a machine learning-based algorithm to identify these artefacts using data from >1 600 000 variants from 27 paired FFPE and fresh-frozen breast cancer samples. Using these data, we assembled a series of variant features and evaluated the classification performance of five machine learning algorithms. Using leave-one-sample-out cross-validation, we found that XGBoost (extreme gradient boosting) and random forest obtained AUC (area under the receiver operating characteristic curve) values >0.86. Performance was further tested using two independent datasets that resulted in AUC values of 0.96, whereas a comparison with previously published tools resulted in a maximum AUC value of 0.92. The most discriminating features were read pair orientation bias, genomic context and variant allele frequency. In summary, our results show a promising future for the use of these samples in molecular testing. We built the algorithm into an R package called Ideafix (DEAmination FIXing) that is freely available at https://github.com/mmaitenat/ideafix.
越来越多癌症患者的治疗决策是基于福尔马林固定石蜡包埋(FFPE)活检产生的下一代测序结果做出的。然而,这种材料容易出现难以轻易识别的序列假象。为了解决这个问题,我们设计了一种基于机器学习的算法,利用来自27对FFPE和新鲜冷冻乳腺癌样本的超过160万个变体数据来识别这些假象。利用这些数据,我们组装了一系列变体特征,并评估了五种机器学习算法的分类性能。使用留一法交叉验证,我们发现XGBoost(极端梯度提升)和随机森林获得的AUC(受试者工作特征曲线下面积)值>0.86。使用两个独立数据集进一步测试性能,得到的AUC值为0.96,而与先前发表的工具进行比较,得到的最大AUC值为0.92。最具区分性的特征是读段对方向偏差、基因组背景和变异等位基因频率。总之,我们的结果表明这些样本在分子检测中的应用前景广阔。我们将该算法构建到一个名为Ideafix(脱氨基修复)的R包中,可在https://github.com/mmaitenat/ideafix上免费获取。