Chapman Timothy, Lassmann Timo
The Kids Research Institute Australia, 15 Hospital Ave, Nedlands, WA, 6009, Australia.
UWA Centre for Child Health Research, The University of Western Australia, 35 Stirling Hwy, Crawley, Western Autralia, 6009, Australia.
BMC Genomics. 2025 May 28;26(1):540. doi: 10.1186/s12864-025-11711-w.
Whole genome sequencing offers significant potential to improve the diagnosis and treatment of rare diseases by enabling the identification of thousands of rare, potentially pathogenic variants. Existing variant prioritisation tools can be complemented by approaches that incorporate phenotype specificity and provide contextual biological information, such as tissue or cell-type specificity. We hypothesised that integrating single-cell gene expression data into phenotype-specific models would improve the accuracy and interpretability of pathogenic variant prioritisation.
To test this hypothesis, we developed IMPPROVE, a new tool that constructs phenotype-specific ensemble models integrating CADD scores with bulk and single-cell gene expression data. We constructed a total of 1,866 Random Forest models for individual HPO terms, incorporating both bulk and single cell expression data.
Our phenotype-specific models utilising expression data can better predict pathogenic variants in 90% of the phenotypes (HPO terms) considered. Using single-cell expression data instead of bulk benefited the models, significantly shifting the proportion of pathogenic variants that were correctly identified at a fixed false positive rate , using an approximate Wilcoxon signed rank test). We found 57 phenotypes' models exhibited a large performance difference, depending on the dataset used. Further analysis revealed biological links between the pathology and the tissues or cell-types used by these 57 models.
Phenotype-specific models that integrate gene expression data with CADD scores show great promise in improving variant prioritisation. In addition to improving diagnostic accuracy, these models offer insights into the underlying biological mechanisms of rare diseases. Enriching existing pathogenicity-related scores with gene expression datasets has the potential to advance personalised medicine through more accurate and interpretable variant prioritisation.
全基因组测序通过能够识别数千种罕见的、潜在致病变异,为改善罕见病的诊断和治疗提供了巨大潜力。现有的变异优先级排序工具可以通过纳入表型特异性并提供上下文生物学信息(如组织或细胞类型特异性)的方法来补充。我们假设将单细胞基因表达数据整合到表型特异性模型中会提高致病变异优先级排序的准确性和可解释性。
为了验证这一假设,我们开发了IMPPROVE,这是一种新工具,它构建了将CADD评分与批量和单细胞基因表达数据整合的表型特异性集成模型。我们为各个人类表型本体(HPO)术语共构建了1866个随机森林模型,纳入了批量和单细胞表达数据。
我们利用表达数据的表型特异性模型能够在90%的所考虑表型(HPO术语)中更好地预测致病变异。使用单细胞表达数据而非批量数据对模型有益,在固定假阳性率下显著改变了正确识别的致病变异比例(使用近似威尔科克森符号秩检验)。我们发现57个表型的模型根据所使用的数据集表现出很大的性能差异。进一步分析揭示了这些57个模型所涉及的病理学与组织或细胞类型之间的生物学联系。
将基因表达数据与CADD评分整合的表型特异性模型在改善变异优先级排序方面显示出巨大前景。除了提高诊断准确性外,这些模型还能深入了解罕见病的潜在生物学机制。用基因表达数据集丰富现有的致病性相关评分有潜力通过更准确和可解释的变异优先级排序推动个性化医疗。