Graduate Group in Genomics and Computational Biology, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, 19104, USA.
Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA.
Nat Commun. 2023 Nov 28;14(1):7805. doi: 10.1038/s41467-023-43651-y.
Structural variants (SVs) represent a major source of genetic variation associated with phenotypic diversity and disease susceptibility. While long-read sequencing can discover over 20,000 SVs per human genome, interpreting their functional consequences remains challenging. Existing methods for identifying disease-related SVs focus on deletion/duplication only and cannot prioritize individual genes affected by SVs, especially for noncoding SVs. Here, we introduce PhenoSV, a phenotype-aware machine-learning model that interprets all major types of SVs and genes affected. PhenoSV segments and annotates SVs with diverse genomic features and employs a transformer-based architecture to predict their impacts under a multiple-instance learning framework. With phenotype information, PhenoSV further utilizes gene-phenotype associations to prioritize phenotype-related SVs. Evaluation on extensive human SV datasets covering all SV types demonstrates PhenoSV's superior performance over competing methods. Applications in diseases suggest that PhenoSV can determine disease-related genes from SVs. A web server and a command-line tool for PhenoSV are available at https://phenosv.wglab.org .
结构变异 (SV) 是与表型多样性和疾病易感性相关的遗传变异的主要来源。虽然长读测序可以在人类基因组中发现超过 20000 个 SV,但解释它们的功能后果仍然具有挑战性。现有的识别与疾病相关的 SV 的方法仅关注缺失/重复,并且不能优先考虑受 SV 影响的单个基因,特别是对于非编码 SV。在这里,我们引入了 PhenoSV,这是一种具有表型意识的机器学习模型,可以解释所有主要类型的 SV 和受影响的基因。PhenoSV 对 SV 进行分段和注释,并利用基于转换器的架构在多实例学习框架下预测它们的影响。有了表型信息,PhenoSV 还利用基因-表型关联来优先考虑与表型相关的 SV。在涵盖所有 SV 类型的广泛人类 SV 数据集上的评估表明,PhenoSV 的性能优于竞争方法。在疾病中的应用表明,PhenoSV 可以从 SV 中确定与疾病相关的基因。PhenoSV 的 Web 服务器和命令行工具可在 https://phenosv.wglab.org 上获得。