Tariq Usman, Saeed Fahad
Knight Foundation School of Computing, and Information Sciences, Florida International University (FIU), Miami, FL USA.
Biomolecular Sciences Institute (BSI), Florida International University, Miami, FL, USA.
bioRxiv. 2024 Aug 22:2024.08.21.609035. doi: 10.1101/2024.08.21.609035.
Database search algorithms reduce the number of potential candidate peptides against which scoring needs to be performed using a single (i.e. mass) property for filtering. While useful, filtering based on one property may lead to exclusion of non-abundant spectra and uncharacterized peptides - potentially exacerbating the effect. Here we present , a novel attention and multitask deep-network, which can multiple peptide properties (length, missed cleavages, and modification status) directly from spectra. We demonstrate that can predict these properties with up to 97% accuracy resulting in search-space reduction by more than 90%. As a result, our end-to-end pipeline is shown to exhibit 8x to 12x speedups with peptide deduction accuracy comparable to algorithmic techniques. We also formulate two uncertainty estimation metrics, which can distinguish between in-distribution and out-of-distribution data (ROC-AUC 0.99) and predict high-scoring mass spectra against correct peptide (ROC-AUC 0.94). These models and metrics are integrated in an end-to-end ML pipeline available at https://github.com/pcdslab/ProteoRift.
数据库搜索算法减少了潜在候选肽段的数量,针对这些肽段,需要使用单一(即质量)属性进行过滤以执行评分。虽然这种方法很有用,但基于单一属性进行过滤可能会导致排除丰度较低的光谱和未表征的肽段,这可能会加剧这种影响。在这里,我们提出了一种新颖的注意力和多任务深度网络,它可以直接从光谱中预测多种肽段属性(长度、漏切和修饰状态)。我们证明,该网络能够以高达97%的准确率预测这些属性,从而使搜索空间减少90%以上。因此,我们的端到端流程显示出8倍到12倍的加速,肽段推导准确率与算法技术相当。我们还制定了两个不确定性估计指标,它们可以区分分布内和分布外的数据(ROC-AUC为0.99),并针对正确的肽段预测高分质谱(ROC-AUC为0.94)。这些模型和指标集成在一个端到端的机器学习流程中,可在https://github.com/pcdslab/ProteoRift上获取。