School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, PR China.
State Key Laboratory of Organic Electronics and lnformation Displays & lnstitute of Advanced Materials (IAM), Nanjing University of Posts & Telecommunications, 9 Wenyuan Road, Nanjing 210023, PR China.
J Chem Inf Model. 2024 Feb 26;64(4):1407-1418. doi: 10.1021/acs.jcim.3c02019. Epub 2024 Feb 9.
Studying the effect of single amino acid variations (SAVs) on protein structure and function is integral to advancing our understanding of molecular processes, evolutionary biology, and disease mechanisms. Screening for deleterious variants is one of the crucial issues in precision medicine. Here, we propose a novel computational approach, TransEFVP, based on large-scale protein language model embeddings and a transformer-based neural network to predict disease-associated SAVs. The model adopts a two-stage architecture: the first stage is designed to fuse different feature embeddings through a transformer encoder. In the second stage, a support vector machine model is employed to quantify the pathogenicity of SAVs after dimensionality reduction. The prediction performance of TransEFVP on blind test data achieves a Matthews correlation coefficient of 0.751, an F-score of 0.846, and an area under the receiver operating characteristic curve of 0.871, higher than the existing state-of-the-art methods. The benchmark results demonstrate that TransEFVP can be explored as an accurate and effective SAV pathogenicity prediction method. The data and codes for TransEFVP are available at https://github.com/yzh9607/TransEFVP/tree/master for academic use.
研究单个氨基酸变异(SAVs)对蛋白质结构和功能的影响对于深入了解分子过程、进化生物学和疾病机制至关重要。筛选有害变异是精准医学的关键问题之一。在这里,我们提出了一种新的计算方法 TransEFVP,它基于大规模的蛋白质语言模型嵌入和基于转换器的神经网络,用于预测与疾病相关的 SAVs。该模型采用两阶段架构:第一阶段旨在通过转换器编码器融合不同的特征嵌入。在第二阶段,使用支持向量机模型在降维后量化 SAV 的致病性。在盲测数据上,TransEFVP 的预测性能达到了马修斯相关系数 0.751、F1 分数 0.846 和接收器操作特征曲线下面积 0.871,优于现有的最先进方法。基准结果表明,TransEFVP 可作为一种准确有效的 SAV 致病性预测方法进行探索。TransEFVP 的数据和代码可在 https://github.com/yzh9607/TransEFVP/tree/master 上获取,仅供学术使用。