Department of Family Medicine and Community Health, University of Kansas Medical Center, Kansas City, Kansas, USA.
Proteomics Clin Appl. 2021 May;15(2-3):e1900124. doi: 10.1002/prca.201900124. Epub 2021 Mar 12.
Human exome sequences contain 15,000-20,000 variants but many variants have unknown clinical impact. In silico predictive classifiers are recognized by the American College of Medical Genetics as a resource for interpreting these "variants of uncertain significance." Many in silico classifiers have been developed, of which PolyPhen-2 is highly successful and widely used. PolyPhen-2 uses a naïve Bayes model to synthesize sequence, structural and genomic information. I investigated whether predictive performance could be improved by replacing PolyPhen-2's naïve Bayes model with alternative machine learning methods.
Classifiers using the PolyPhen-2 feature set were retrained using extreme gradient boosting (XGBoost), random forests, artificial neural networks, and support vector machines. Classifiers were externally validated on "pathogenic" and "benign" ClinVar variants absent from the training datasets. Software is implemented in Python and is freely available at https://github.com/djparente/polyboost and the Python Package Index (PyPI) under the BSD license.
An XGBoost-based classifier-designated PolyBoost (PolyPhen-2 Booster)-improves discriminative performance and calibration relative to PolyPhen-2 in external validation on ClinVar.
PolyBoost analyzes PolyPhen-2 output and can be incorporated into existing bioinformatics workflows as a post-analysis method to improve interpretation of clinical exome sequences obtained to identify monogenic disease.
人类外显子组序列包含 15000-20000 个变体,但许多变体的临床影响尚不清楚。美国医学遗传学学院将基于计算的预测分类器视为解释这些“意义不明的变异”的资源。已经开发了许多基于计算的分类器,其中 PolyPhen-2 非常成功且应用广泛。PolyPhen-2 使用朴素贝叶斯模型来综合序列、结构和基因组信息。我研究了是否可以通过用替代机器学习方法替代 PolyPhen-2 的朴素贝叶斯模型来提高预测性能。
使用 PolyPhen-2 特征集的分类器使用极端梯度增强(XGBoost)、随机森林、人工神经网络和支持向量机进行重新训练。使用 XGBoost 对“致病性”和“良性”ClinVar 变体进行外部验证,这些变体不存在于训练数据集中。软件用 Python 实现,并在 https://github.com/djparente/polyboost 和 Python 包索引 (PyPI) 上免费提供,许可证为 BSD。
XGBoost 为基础的分类器设计,命名为 PolyBoost(PolyPhen-2 Booster),与在 ClinVar 上的外部验证相比,提高了区分性能和校准能力。
PolyBoost 分析 PolyPhen-2 的输出,可以作为一种后分析方法,纳入现有的生物信息学工作流程,以改善对获得的临床外显子组序列的解释,从而识别单基因疾病。