Douville Christopher, Masica David L, Stenson Peter D, Cooper David N, Gygax Derek M, Kim Rick, Ryan Michael, Karchin Rachel
Department of Biomedical Engineering and Institute for Computational Medicine, The Johns Hopkins University, Baltimore, Maryland.
Institute of Medical Genetics, School of Medicine, Cardiff University, Heath Park, Cardiff, UK.
Hum Mutat. 2016 Jan;37(1):28-35. doi: 10.1002/humu.22911. Epub 2015 Oct 26.
Insertion/deletion variants (indels) alter protein sequence and length, yet are highly prevalent in healthy populations, presenting a challenge to bioinformatics classifiers. Commonly used features--DNA and protein sequence conservation, indel length, and occurrence in repeat regions--are useful for inference of protein damage. However, these features can cause false positives when predicting the impact of indels on disease. Existing methods for indel classification suffer from low specificities, severely limiting clinical utility. Here, we further develop our variant effect scoring tool (VEST) to include the classification of in-frame and frameshift indels (VEST-indel) as pathogenic or benign. We apply 24 features, including a new "PubMed" feature, to estimate a gene's importance in human disease. When compared with four existing indel classifiers, our method achieves a drastically reduced false-positive rate, improving specificity by as much as 90%. This approach of estimating gene importance might be generally applicable to missense and other bioinformatics pathogenicity predictors, which often fail to achieve high specificity. Finally, we tested all possible meta-predictors that can be obtained from combining the four different indel classifiers using Boolean conjunctions and disjunctions, and derived a meta-predictor with improved performance over any individual method.
插入/缺失变异(indels)会改变蛋白质序列和长度,但在健康人群中却高度普遍,这给生物信息学分类器带来了挑战。常用特征——DNA和蛋白质序列保守性、indel长度以及在重复区域中的出现情况——对于推断蛋白质损伤很有用。然而,这些特征在预测indels对疾病的影响时可能会导致假阳性。现有的indel分类方法特异性较低,严重限制了临床应用。在此,我们进一步开发了我们的变异效应评分工具(VEST),以将框内和移码indels(VEST-indel)分类为致病或良性。我们应用24种特征,包括一种新的“PubMed”特征,来估计基因在人类疾病中的重要性。与四种现有的indel分类器相比,我们的方法实现了大幅降低的假阳性率,特异性提高了多达90%。这种估计基因重要性的方法可能普遍适用于错义及其他生物信息学致病性预测器,这些预测器往往无法实现高特异性。最后,我们测试了所有可能通过使用布尔合取和析取组合四种不同indel分类器而获得的元预测器,并得出了一种性能优于任何单个方法的元预测器。