Data Science, Color Genomics, Burlingame, California.
Scientific Affairs, Color Genomics, Burlingame, California.
Hum Mutat. 2020 Jun;41(6):1079-1090. doi: 10.1002/humu.24011. Epub 2020 Apr 1.
Advances in genome sequencing have led to a tremendous increase in the discovery of novel missense variants, but evidence for determining clinical significance can be limited or conflicting. Here, we present Learning from Evidence to Assess Pathogenicity (LEAP), a machine learning model that utilizes a variety of feature categories to classify variants, and achieves high performance in multiple genes and different health conditions. Feature categories include functional predictions, splice predictions, population frequencies, conservation scores, protein domain data, and clinical observation data such as personal and family history and covariant information. L2-regularized logistic regression and random forest classification models were trained on missense variants detected and classified during the course of routine clinical testing at Color Genomics (14,226 variants from 24 cancer-related genes and 5,398 variants from 30 cardiovascular-related genes). Using 10-fold cross-validated predictions, the logistic regression model achieved an area under the receiver operating characteristic curve (AUROC) of 97.8% (cancer) and 98.8% (cardiovascular), while the random forest model achieved 98.3% (cancer) and 98.6% (cardiovascular). We demonstrate generalizability to different genes by validating predictions on genes withheld from training (96.8% AUROC). High accuracy and broad applicability make LEAP effective in the clinical setting as a high-throughput quality control layer.
基因组测序的进步使得新型错义变异的发现呈指数级增长,但确定其临床意义的证据可能有限或存在冲突。在这里,我们提出了 LEAP(从证据中学习评估致病性),这是一种机器学习模型,它利用多种特征类别对变体进行分类,在多个基因和不同健康状况下都能取得优异的性能。特征类别包括功能预测、剪接预测、群体频率、保守分数、蛋白质结构域数据以及个人和家族史和共变量等临床观察数据。我们在 Color Genomics 的常规临床检测过程中检测和分类了错义变异(来自 24 个癌症相关基因的 14,226 个变体和来自 30 个心血管相关基因的 5,398 个变体),并基于这些变体训练了 L2-正则化逻辑回归和随机森林分类模型。使用 10 倍交叉验证预测,逻辑回归模型在癌症(AUROC 为 97.8%)和心血管(AUROC 为 98.8%)方面的表现达到了 97.8%(癌症)和 98.8%(心血管),随机森林模型则分别达到了 98.3%(癌症)和 98.6%(心血管)。通过对未参与训练的基因进行预测验证,我们证明了该模型具有广泛的适用性(AUROC 为 96.8%)。LEAP 具有高准确性和广泛的适用性,使其成为一种有效的高通量质量控制层,可在临床环境中应用。