Clinical Bioinformatics Lab, Imagine Institute, Paris Descartes University, Sorbonne Paris Cité, 75015, Paris, France.
INSERM UMR 1163, Institut Imagine, 75015, Paris, France.
Genome Biol. 2019 Feb 11;20(1):32. doi: 10.1186/s13059-019-1634-2.
State-of-the-art methods assessing pathogenic non-coding variants have mostly been characterized on common disease-associated polymorphisms, yet with modest accuracy and strong positional biases. In this study, we curated 737 high-confidence pathogenic non-coding variants associated with monogenic Mendelian diseases. In addition to interspecies conservation, a comprehensive set of recent and ongoing purifying selection signals in humans is explored, accounting for lineage-specific regulatory elements. Supervised learning using gradient tree boosting on such features achieves a high predictive performance and overcomes positional bias. NCBoost performs consistently across diverse learning and independent testing data sets and outperforms other existing reference methods.
评估致病非编码变异的最新方法主要针对常见疾病相关的多态性进行了特征描述,但准确性和位置偏差都较大。在这项研究中,我们整理了 737 个与单基因孟德尔疾病相关的高可信度致病非编码变异。除了种间保守性,还探索了一套全面的近期和正在进行的人类净化选择信号,包括谱系特异性调控元件。在这些特征上使用梯度树增强进行监督学习可以实现较高的预测性能,并克服位置偏差。NCBoost 在不同的学习和独立测试数据集上表现一致,优于其他现有参考方法。