BioFolD Unit, Department Pharmacy and Biotechnology (FaBiT), University of Bologna, Via F. Selmi 3, Bologna 40126, Italy.
Department of Medical Sciences, University of Torino, Via Santena 19, 10126, Torino, Italy.
Nucleic Acids Res. 2023 Jul 5;51(W1):W451-W458. doi: 10.1093/nar/gkad455.
One of the primary challenges in human genetics is determining the functional impact of single nucleotide variants (SNVs) and insertion and deletions (InDels), whether coding or noncoding. In the past, methods have been created to detect disease-related single amino acid changes, but only some can assess the influence of noncoding variations. CADD is the most commonly used and advanced algorithm for predicting the diverse effects of genome variations. It employs a combination of sequence conservation and functional features derived from the ENCODE project data. To use CADD, a large set of pre-calculated information must be downloaded during the installation process. To streamline the variant annotation process, we developed PhD-SNPg, a machine-learning tool that is easy to install and lightweight, relying solely on sequence-based features. Here we present an updated version, trained on a larger dataset, that can also predict the impact of the InDel variations. Despite its simplicity, PhD-SNPg performs similarly to CADD, making it ideal for rapid genome interpretation and as a benchmark for tool development.
人类遗传学的主要挑战之一是确定单核苷酸变异(SNVs)和插入/缺失(InDels)的功能影响,无论其是否为编码或非编码。过去,已经创建了一些方法来检测与疾病相关的单个氨基酸变化,但只有一些方法可以评估非编码变异的影响。CADD 是最常用和先进的预测基因组变异的多种影响的算法。它结合了来自 ENCODE 项目数据的序列保守性和功能特征。要使用 CADD,在安装过程中必须下载大量预先计算的信息。为了简化变异注释过程,我们开发了一个易于安装且轻量级的机器学习工具 PhD-SNPg,它仅依赖于基于序列的特征。在这里,我们展示了一个经过更大数据集训练的更新版本,它还可以预测 InDel 变异的影响。尽管它很简单,但 PhD-SNPg 的性能与 CADD 相似,非常适合快速基因组解释,并可作为工具开发的基准。